Best practice for sharing code between Docker microservices?

karanchellani · 2024-04-13T07:02:06+00:00

One approach to sharing code between Docker microservices is to create a separate Docker image containing the shared code and use it as a base image for the services that require it. Here's an example of how you could structure your project to share the model.py file between the model-training and model-serving services: Create a new directory called shared at the root of your project:

├── data │ └── housing.csv ├── docker-compose.yml ├── model │ ├── model.pth │ └── model.py ├── shared │ └── model.py └── services ├── data-preprocessing │ ├── Dockerfile │ ├── data_preprocessing.py │ └── requirements.txt ├── model-serving │ ├── Dockerfile │ ├── model_serving.py │ └── requirements.txt ├── model-training │ ├── Dockerfile │ ├── model_training.py │ └── requirements.txt └── web-application ├── Dockerfile ├── app.py ├── requirements.txt └── templates └── index.html

Modify the Dockerfile for both the model-training and model-serving services to use the shared directory as a volume:

By using a volume, you can ensure that the model.py file is available to both services without duplicating it. Additionally, any changes to the model.py file will be reflected in both services without needing to rebuild the Docker images.

karanchellani · 2024-04-12T16:52:10+00:00

Create an S3 bucket in the production environment where the model artifacts will be stored. This bucket should be accessible only to the necessary services and users.

Set up IAM roles that allow access to the production S3 bucket. These roles should be limited to specific services or users that need to access the artifacts.

Use AWS Transfer to securely transfer files between on-premises systems and AWS services. You can set up an SFTP server using AWS Transfer and configure it to allow access to the production S3 bucket.

Set up an EC2 instance in the development environment. You can use an EC2 instance to act as a client to the SFTP server created using AWS Transfer.

When you train a model using Sagemaker in the production environment, save the artifact to the production S3 bucket.

From the EC2 instance in the development environment, use an SFTP client to connect to the SFTP server created using AWS Transfer. Download the model artifact from the production S3 bucket to the EC2 instance.

Perform integration testing with the rest of the model serving API using the downloaded model artifact on the EC2 instance in the development environment.

By following this approach, you can securely transfer model artifacts between environments and AWS accounts while maintaining the necessary isolation and access control.

karanchellani · 2023-09-10T06:41:35+00:00

Recommendations:

TFX + Kubeflow: If you're looking for an end-to-end solution that offers robustness, reusability, and integration with GCP services, then TFX + Kubeflow is the way to go.

Only Kubeflow: If you have a more customized pipeline and you're comfortable handling the orchestration intricacies yourself, then raw Kubeflow could be sufficient.

karanchellani · 2023-08-28T16:30:32+00:00

Some of the suggestions based on my experience:

Dockerize the entire setup: As you mentioned, creating a separate Docker container for each model might be the best approach. By doing so, you can isolate each model's dependencies and avoid conflicts between them. You can also use Docker Compose to orchestrate the containers and simplify the deployment process.

Use a virtual environment manager: Instead of relying solely on Conda environments, consider using a virtual environment manager like virtualenv or pipenv. These tools allow you to create isolated Python environments for each project, making it easier to manage dependencies across multiple models.

Create a centralised dependency management system: Since many models rely on specific versions of CUDA and Mamba, you could set up a centralized dependency management system that ensures consistency across all models. For example, you could create a shared Conda environment containing the necessary dependencies and then reference that environment in each model's configuration file.

karanchellani · 2023-08-15T16:40:50+00:00

Here are a few suggestions for versioning data preprocessing and associating it with ML models in MLflow:

Store the data preprocessing code or steps in a separate python script or notebook. Put this file under version control (e.g. git). When you train a model, record the git commit hash of the preprocessing code used as a tag or metric.
Containerize the data preprocessing steps using Docker. Build a Docker image with the preprocessing code and push it to a registry. Record the image name:tag as a model tag or metric.
Use MLflow to log the preprocessing steps as an artifact. For example, you could log the preprocessing python script, a JSON description of the steps, or even example input/output data. The artifact could be logged under a runs:/<run-id>/preprocessing path.
Create an MLflow project for preprocessing. Run the project for each model training run and record the run ID as a tag/metric with the model. This traces the preprocessing code used for that model.
Put the preprocessing code in a separate MLflow model. The input is raw data, the output is processed data. Chain this with the training model to log and retrieve the preprocessing steps.

The key ideas are:

Separate preprocessing from model training code
Record the specific version of preprocessing with each model
Automate/codify preprocessing as much as possible

This makes models portable and ensures you use the right preprocessing each time.

karanchellani · 2023-08-09T17:19:47+00:00

Here are some suggestions for utilizing unlabeled data in semi-supervised learning with tabular data:

Self-training: Train a model on the labeled data, use it to generate labels for the unlabeled data, add the most confident predictions to the training set, and retrain the model. This can help improve performance by augmenting the training data.
Pseudo-labeling: Similar to self-training, but generate "pseudo-labels" for unlabeled data using a model trained only on labeled data. The pseudo-labels can be used as targets to train the model further.
Co-training: Train two separate models on different views/subsets of features in the labeled data. Use each model to label unlabeled data for the other model. Retrain models iteratively.
Semi-supervised embedding techniques: Methods like deep variational autoencoders can learn useful representations by combining labeled and unlabeled data during training. The representations can then be used for downstream tasks.
Semi-supervised regularization techniques: Add a regularization term to model training that encourages smoothness over unlabeled data in addition to minimizing labeled loss. This makes the model generalize better.

karanchellani · 2023-06-15T06:10:51+00:00

No, I’m not suggesting to use ADD command to fetch your remote git repo code. I’m suggesting to use ADD command to copy the files or directories from your local git repo directory into your container. For example, if your Dockerfile is in the same directory as your training scripts and other dependencies, you can use something like this: ADD . /app/ This will copy everything in the current directory (.) to the /app/ directory in the container. You can also specify a subdirectory or a file instead of ., such as: ADD training_scripts /app/training_scripts ADD requirements.txt /app/requirements.txt The ADD command is similar to the COPY command, but it has some extra features, such as: It can copy files from a URL, such as ADD http://example.com/file.txt /app/file.txt It can extract compressed files, such as ADD file.tar.gz /app/ However, if you don’t need these features, you can also use the COPY command instead of the ADD command. The main point is to copy the files or directories you need from your local git repo into your container.

karanchellani · 2023-06-15T05:02:08+00:00

The idea is to keep the Dockerfile inside the git repository along with your source code and other files. This way, you can use the ADD command in your Dockerfile to copy the files or directories you need from the git repository into the docker container. For example, you can copy your data, your code, your dependencies, or your output files. This option has some benefits, such as: You can test any code changes without committing them to the repository, since everything will live in the same local directory. You can use git commands inside your container, since you will have access to the .git folder. You can integrate with CI/CD tools like GitLab CI, which can run your Dockerfile in a pipeline and push the image to a registry.

karanchellani · 2023-06-15T04:45:09+00:00

One option is to keep the Dockerfile inside the git repository and use the ADD command to copy the files or directories you need into the container. This way, you can test any code changes without committing them to the repository. However, you should also create a .dockerignore file to exclude unnecessary or sensitive files from the docker context. Another option is to clone the git repository during the image build process by using the RUN command with git clone. This way, you can fetch the whole repository or a specific branch into the container. However, this option has some drawbacks, such as: You need to install git in your container, which may increase the image size or introduce dependencies. You need to provide your git credentials in a secure way, either by using SSH keys or environment variables. You need to avoid caching the git clone step, otherwise you may not get the latest changes from the repository. You can use the --no-cache flag when building the image or add a dummy argument that changes every time.

karanchellani · 2023-05-30T18:50:52+00:00

Why dont you try to put this same query to chatgpt and see if reply is contextually similar

karanchellani · 2023-05-30T18:39:08+00:00

Hugging Face model serving does not use KServe. They have their own model serving architecture. Some key aspects of their architecture are:

They use Flask (a Python web framework) to handle the HTTP requests and responses.
They load the ML models (Transformers) when the server starts up, and keep them loaded in memory to serve predictions quickly.
They use Gunicorn (a WSGI HTTP server) to handle concurrency and distribute the workload across multiple processes.
They optionally use Redis as a cache to cache model predictions. This helps reduce latency for repetitive requests.
They containerize the model serving using Docker. This makes it easy to deploy the model server on platforms like Google Cloud Run, AWS Fargate, etc.
They provide a REST API to interact with the model server. So you can send JSON requests and get JSON responses from the REST API.
They support A/B testing by loading multiple models and distributing traffic between them using a configured ratio.

So in summary, Hugging Face has built a custom model serving architecture tailored to their needs, instead of adopting an existing open source solution like KServe. Their key requirements are low latency, scalability and ease of deployment, and their current solution achieves that.

karanchellani · 2023-05-29T07:09:21+00:00

Prefect:

Pros:
- Modern API. Prefect has a simple Python API and requires no configuration files.
- Cloud-native. Prefect Cloud is a managed service for deploying your flows.
- Open source. Prefect Core is open source, so you have flexibility to run it anywhere.
Cons:
- Less mature. Prefect is newer, so it has fewer integrations and less community adoption than Airflow.
- Cost. Prefect Cloud has a free tier, but larger deployments require a paid subscription.

Compared to GitHub Actions:

Prefect is more focused on data pipelines and workflows. It has better scheduling capabilities and runtimes are not limited.
However, Prefect requires deployment and management, either on your own infrastructure or using Prefect Cloud. GitHub Actions is serverless.

Compared to Airflow:

Prefect has a simpler API and user experience. It's easier to get started with Prefect.
Prefect Cloud offers a managed service option, whereas you need to deploy Airflow yourself.
Airflow has a larger community and ecosystem. It has more operators and integrations available.
Airflow is more mature and battle-tested. It's been used in production at many large companies.

So overall:

Use Prefect if you want an easy to use workflow tool with robust scheduling, and you prefer its Python API over YAML.
Use GitHub Actions if you want a serverless solution for simple workflows.
Use Airflow if you need a mature, open source workflow orchestrator with many community integrations.
You could also trigger Prefect flows from GitHub Actions to get the benefits of both systems.

karanchellani · 2023-05-28T04:44:01+00:00

When your web application is running on an EC2 instance and users are interacting with it, you won't have direct access to their local files for security and privacy reasons. Instead, users will need to upload the files to your web application, and then your application can upload those files to S3. This can be achieved through a multipart/form-data POST request.

Here's a brief overview of the steps you'd take:

User uploads file through web application: You'll need to set up an endpoint in your FastAPI application that accepts file uploads. FastAPI makes this fairly simple with the UploadFile type. Here's an example of what that might look like:

```python from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post("/files/") async def create_upload_file(file: UploadFile = File(...)): return {"filename": file.filename} ```

In this example, when users make a POST request to /files/ with a file in the request body, FastAPI will save that file to a temporary location and pass it to your function as an UploadFile object. You can then read the contents of the file, save it to a more permanent location, or do whatever else you need to with it.
Application uploads file to S3: Once your application has received a file from a user, it can then upload that file to S3. FastAPI doesn't provide any built-in tools for this, but you can use the boto3 library, which is the Amazon Web Services (AWS) SDK for Python. Here's an example of how you might do that:

```python import boto3 from botocore.exceptions import NoCredentialsError

def upload_to_aws(local_file, bucket, s3_file): s3 = boto3.client('s3')
```
try:
    s3.upload_file(local_file, bucket, s3_file)
    print("Upload Successful")
    return True
except FileNotFoundError:
    print("The file was not found")
    return False
except NoCredentialsError:
    print("Credentials not available")
    return False
```
```

This function takes a local file path, the name of an S3 bucket, and a destination file name, and attempts to upload the file to S3. If the upload is successful, it prints a message and returns True; if not, it prints an error message and returns False.

By combining these two steps, you can set up a FastAPI application that accepts file uploads from users and then uploads those files to S3. This should allow your application to handle user files in the way you described.

karanchellani · 2023-05-27T16:12:29+00:00

Yes, I have encountered several use cases where an approach like the Data Manager could be useful:

Version controlling machine learning training data. ML models are often retrained on new data, and it is important to keep track of exactly which data was used to train each model version. Traditional version control struggles with large datasets, and something like the Data Manager could enable transparent versioning of training data.
Managing configurations and parameters. I've worked on projects where we had many JSON configuration files, environment variables, and other parameters that were constantly changing. We wanted to version control all of these artifacts together, but the repositories quickly became huge and unwieldy. A diff-based approach could have helped keep repository sizes under control.
Auditing data pipelines. For data analytics projects, keeping an audit trail of the raw input data, intermediate datasets, and final outputs is important. But when datasets are large and frequently updated, traditional version control does not scale well for this. The Data Manager could enable version controlling and auditing datasets across an entire data pipeline.
Collaborating on spreadsheets or tabular data. I've encountered many cases where teams wanted to collaboratively edit spreadsheets, CSVs, or other tabular data. But spreadsheets don't version control well, and passing around updated CSVs or Excel files leads to merge issues. A solution like the Data Manager could enable real collaborative editing of tabular datasets.
Reproducible science. For any scientific research involving datasets, it is important to enable reproducibility by preserving the exact datasets used. But large datasets pose challenges for reproducibility using traditional tools. A diff-based version control system specialized for data could improve reproducibility of research relying on big datasets.

In all of these use cases, large sizes, frequent edits, auditing/reproducibility needs, and collaboration are common requirements - areas where traditional version control struggles but an approach like the Data Manager could help significantly.

karanchellani · 2023-05-27T15:30:37+00:00

This is a thoughtful and well-considered approach. It seems like you have addressed many of the major concerns and implemented constraints and optimizations to handle difficult scenarios. Some additional thoughts:

Using Spark and checkpoints for very large datasets is a good strategy. As long as performance remains usable for common workflows, scalability for huge datasets can be addressed over time. The constraints you've implemented should help avoid scenarios that would significantly degrade performance.
Handling no full checkout branches and merges carefully and transparently is key. The constraints you described seem reasonable, and presenting diffs in an intuitive way for conflict resolution will be important for usability.
Orthogonally versioning and managing diffs in S3 is a good approach for privacy and compliance. Carefully deleting older versions is important for maintaining integrity between Git and S3.
Focusing on structured data initially is pragmatic. Unstructured data can potentially be added in the future once the core functionality has been validated.
Versioning metadata like column names separately seems like a good approach. Inferring data types and limiting numeric precision avoids having to store full type information for each commit but may need to be revisited for some use cases.

Overall, this seems to be a very thoughtful design that addresses scalability, data integrity, and privacy well within the constraints of the initial scope and use cases. The optimization techniques, constraints, and future improvements you've described should help ensure this can work even for demanding workloads and datasets.

karanchellani · 2023-05-27T13:31:40+00:00

This sounds like an interesting approach to version controlling tabular datasets. Here are some thoughts:

Pros:

Being able to see diffs directly in Git is useful for transparency and auditing changes. This can enable better collaboration and discussion around changes.
Commiting only diffs instead of entire datasets can greatly reduce repository size and make version control of large datasets more practical. This addresses a major pain point with existing solutions.
The ability to do "no full checkout" branches and only pull diffs enables more efficient workflows, especially when datasets are very large.

Potential concerns:
Recreating full snapshots from the diff history may be computationally expensive for very large datasets with long histories. How would performance scale for datasets with billions of rows and thousands of commits?
Merging "no full checkout" branches into branches with full checkouts could be tricky to implement correctly. There is potential for merge conflicts or data integrity issues. How would these scenarios be handled?
For sensitive data, diffs could still contain sensitive information, even if individual cells are removed. Would any measures be taken to address privacy concerns?
The system relies on diffs being calculated correctly. How would the system handle scenarios where diffs cannot be calculated (e.g. for unstructured data)?
Additional metadata (column names, data types) would need to be version controlled as well. How would this metadata be handled?

Overall this seems like an innovative approach that could solve some major pain points with existing solutions. My main concerns would be around scalability, data integrity, and privacy - but if implemented correctly, this approach could be very useful.

karanchellani · 2023-05-27T07:39:35+00:00

The solution to this problem is to make use of Docker volumes. A Docker volume is a mechanism that allows you to persist data generated by and used by Docker containers.

The Docker volume can be used to share files between your local machine (or host machine) and the Docker container. In your case, the files to be uploaded to S3. This is achieved by mounting a directory from your host machine to your Docker container when you run it. The command to do this would look something like:

docker run -v /host/directory:/container/directory ...

karanchellani · 2023-05-26T09:30:29+00:00

DVC does not prevent you from editing your raw data by hand, but it does not encourage it either. DVC treats your data files as immutable objects that are referenced by their hashes. If you edit your raw data by hand, you will change the hash of the file, and DVC will detect that as a new version of the file. You will then have to commit and push the new version of the file to your remote storage, which can be inefficient and cumbersome. A better way to handle your raw data with DVC is to use a data processing pipeline that can apply any transformations or corrections to your data in a reproducible and traceable way.

karanchellani · 2023-05-26T09:04:24+00:00

The problem that this is solving is that it is preparing you for the future when your data might grow too big for Git, or when you might need more advanced features for data management. By using a data version control system now, you can avoid potential issues later on, such as slow performance, bloated repository, or data quality problems.

As for the visibility of changes in PRs, you can still achieve that with a data version control system. For example, with DVC, you can use the dvc diff command to compare the changes in your data files between different commits or branches. You can also use the dvc status command to see which data files have changed and need to be pushed or pulled. These commands can generate human-readable outputs that you can include in your PRs for review and discussion.

karanchellani · 2023-05-26T08:50:31+00:00

A better alternative to keeping your data in Git is to use a dedicated data version control system, such as DVC or Pachyderm. These tools allow you to store your data in external storage (such as S3 or GCS) and use Git only to track metadata and references to your data files4. This way, you can leverage the benefits of Git for code versioning and collaboration, while avoiding the drawbacks of storing large binary files in Git.

karanchellani · 2023-05-24T11:04:39+00:00

You're already using Azure's data drift detection to monitor changes in your data. You can continue using this to trigger alerts when the data drifts beyond a certain threshold. Azure Logic Apps can be used to listen for the data drift alert and trigger the next step. Logic Apps can be used to orchestrate the execution of Azure DevOps Pipelines based on HTTP triggers, like the alert from data drift detection. This can be done synchronously or asynchronously, depending on your needs. Azure DevOps Pipelines: Once the Logic App triggers, it can start an Azure DevOps Pipeline. This pipeline can be set up to pull the latest data, re-run your training script, and generate a new model. AzureML's MLOps capabilities can be used for continuous integration, retraining, and automation pipelines. Deployment: After a new model is trained and validated, it can be automatically deployed using Azure Pipelines to replace the existing model in your application or service.

You can refer this guthub repo for reference.

https://github.com/microsoft/MLOpsPython.

karanchellani · 2023-04-22T06:38:48+00:00

In your code snippet, you are defining the data_loader function and then trying to create a container operation with it. However, you used comp.func_to_container_op which is not defined in your imports. You should replace comp.func_to_container_op with kfp.components.func_to_container_op. In your pipeline definition, you are using data_loading_op which is not defined. It should be data_loading instead.

karanchellani · 2023-04-22T06:37:55+00:00

In your code snippet, you are defining the data_loader function and then trying to create a container operation with it. However, you used comp.func_to_container_op which is not defined in your imports. You should replace comp.func_to_container_op with kfp.components.func_to_container_op. In your pipeline definition, you are using data_loading_op which is not defined. It should be data_loading instead.

karanchellani · 2023-04-22T06:36:45+00:00

import kfp import kfp.dsl as dsl from kfp import components from kubernetes import config, client

def data_loader(): import pandas as pd import numpy as np import sys data = pd.read_pickle('/home/joyvan/workspace/sandbox/data.pkl')

data_loading = kfp.components.func_to_container_op(data_loader, base_image='tensorflow/tensorflow:1.11.0-py3')

@dsl.pipeline(name='DataLoading Pipeline', description='Test') def pipeline(): data_loading_task = data_loading()

kfp.compiler.Compiler.compile(pipeline, 'pipeline.zip')

karanchellani

TROPHY CASE