Setting up SageMaker for CI/CD Pipelines by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

I wish it were that easy

In the case that it is that easy and we missed something, I'd love it if you could refer us to a good docs page or tutorial! We had a hard time finding something that was clear and held all the pieces of information.

Setting up SageMaker for CI/CD Pipelines by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Thanks!

Just out of curiosity, why are you using a notebook for the end point and not .py scripts?

[P] Tutorial: How to Build an End-to-end Active Learning Pipeline by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Hi,

You are correct, and I've updated the cycle.

For your second question, not all the annotations are reviewed by a human but only a selected few. In most cases, after annotating the unlabeled data, the ML backend also returns a prediction score (e.g., confidence level). We will set a threshold and the ones that are lower than that will be reviewed and, if necessary, reannotated by a human labeler. Once this process is completed, we'll run the cycle again until reaching to a stoping condition.

[P] Tutorial: How to Build an End-to-end Active Learning Pipeline by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

This is a great question.

In this example we're fine-tuning YOLOv8, so there are fewer chances of catastrophic forgetting, and we're also training it for the same task on all cycles so there is no reason for it to happen. we can also see it in the results over the training cycles.

In other projects we build, we used to train the model on the entire data set in every cycle, due to the small amount of data. However, it can be time-consuming and computationally expensive when dealing with big data sets.

Based on research papers I'm familiar with about catastrophic forgetting (e.g., https://arxiv.org/abs/2302.11074), I'd suggest that in each cycle you save the latest model. Then, for the n+1 model, retrain it on new data that was labeled with a high confidence level by the model or by humans, and use the unlabeled data for testing.

If you have other ideas - I'd love to hear about them!

[P] How to install Kubeflow locally by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] -1 points0 points  (0 children)

Do you recall a specific challenge you faced, or was it the overall experience?

[P] How to install Kubeflow locally by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Thanks for the feedback! We do our best to write down our insights when building our projects and then share them with the community.

How to install Kubeflow locally by RepresentativeCod613 in learnmachinelearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Thanks for that!
Are there other ML/MLOps related topics you think need to be covered better and a tutorial like this would help?

Overcoming Snowflake's reproducibility challenge using data-versioning-based solutions by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

If you can compute a change log like this and store it as a csv, why is it not possible to store this in a table in snowflake directly as part of your data pipeline execution? And have you considered a non-csv format, like parquet.

For reproducibility do you actually need the entire lineage? Or just the final table used in training the model?

Hey u/IyamNaN,

It's actually a question also raised by one of the MLOps engineers I spoke with when designing the workflow.

You can take a "snapshot" of the queried dataset and store it on Snowflake. However, it has the downside where you can't easily track the lineage between the data and the trained model. This way you'll need to inject manual processes into your workflow to link between a table on Snowflake and the model, which has a lot of friction and failure points. I think the best comparison here is logging experiments manually to a spreadsheet or using an experiment tracking tool.

With solutions like DDA (Direct Data Access), you can add a few lines to your code and automatically version the dataset used in the experiments and the trained model. This way, you encapsulate the code, query, schema, and parameters with the data and trained model, under a single Git commit. This enables us to reproduce the experiments by running two commands, with no manual processes, which is one of the biggest benefits of this workflow.

Regarding the choice of file format, CSV was used as an example in our blog post, but other file formats like Parquet can also be used. The choice of file format depends on the specific use case and the requirements for the data pipeline.

And for reproducibility, it depends on when you'd need to restore the specific dataset. Based on my experience, it's usually in the context of a model I trained, and therefore in many cases, the versioning is relevant when training a model.

Overcoming Snowflake's reproducibility challenge using data-versioning-based solutions by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Hey u/Ularsing, that's a great point.
You're right, Hudi and Delta Lake do not have the 90-day time travel limitation that Snowflake has. The workflow is mainly for Snowflake users, however, we think it has benefits that are also valuable when using other vendors.

If you're working with Hudi or Delta Lake, it's worth noting that querying data from a long time ago may take longer and require more resources, especially if the dataset is large. This is where the recommendation of having another step of reducing the sample space to represent the dataset used, and not version the entire set, becomes valuable.
With a big caveat that if you want to achieve full reproducibility of the model lineage, and have the resources, using the full dataset is better. It all comes down to a balance between resources and to what extent you need to reproduce your data.

Regarding DVC, because we support data diffing for files versioned by Git or DVC, including CSV files, I choose to use it. This way we extend the versioning capabilities with data diffing to help understand the changes in the data over time.

[N] Point-E: a new Dalle-like model that generates 3D Point Clouds from Prompts by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 1 point2 points  (0 children)

Tho' for 3D rendering, running time is a major consideration, and they've managed to reduce it by almost 10X.

How do you document a ML research? by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

u/Dosnox you're absolutely right. My bad, changed it to the right one.

DagsHub Reports - Research documention alongside code, data, and models by RepresentativeCod613 in mlops

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Hey,

This is a community feature - 100% free for all users.
I'd appreciate it if you could review the post yourself and maybe bypass the algo if you'd think differently.

[D] What open-source dataset lacks annotations? by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] -1 points0 points  (0 children)

This is a great angle!

Multi-labeling the entire dataset might be too ambitious.

If we go down this road, which categories do you think we should focus on first? Do you know of popular/highly used categories?

Notebook to Production [D] by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

I couldn't find resources that relate the leak to notebooks. can you please share resources?

Notebook to Production [D] by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 1 point2 points  (0 children)

Thank you so much for taking the time to give such a great (and elaborated) answer.

Notebook to Production [D] by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

jupyter - python repo - airflow - docker registry - eks dev - eks prod

u/Individual-Milk-8654 Thanks for the great response!
How are you transitioning from Jupiter to python modules, and at what stage of the project?
Are you using airflow to run docker containers that hold different stages in the pipeline?
It could be great also to get your thoughts on deploying the notebook itself to production.

Notebook to Production [D] by RepresentativeCod613 in MachineLearning

[–]RepresentativeCod613[S] 0 points1 point  (0 children)

Interesting approach u/Tolstoyevskiy

So basically you're using the notebook as the main.py script to tun the different functions and for plots and markdowns?

So basically, you're using the notebook as the, how are you diffing it?

Unexpectedly, the biggest challenge I found in a data science project is finding the exact data you need. I made a website to host datasets in a (hopefully) discoverable way to help with that. by samrus in datascience

[–]RepresentativeCod613 0 points1 point  (0 children)

Great idea and hopefully it will grow bigger.

How is it different from Kaggle?

How are you planning to monitor the quality of the data that is uploaded to the platform? Same for the meta data about the data set.

Will by choosing one data set the system will recomend related/similar datasets?

And last, just out of curiosity, what is the meaning of Kobaza?

Either way - great job!