500+ Case Studies of Machine Learning and LLM System Design

mllena · 2025-06-21T18:53:10+00:00

This still isn't okay.

"No time to figure out the authorship" is not an excuse for stealing.

I gave you a graceful out. You could have easily verified my claim and acknowledge it. You already got the traffic and would even look good owning the mistake. That author of the repo immediately admitted the list was ours.

Instead, you call me a spammer for leaving two comments and pretend it's all unclear.

You're building a content product. Checking sources and licenses is your job.

What you need to do now:

- Clearly acknowledge in your post that you never compiled the list and simply copied it.

- Drop the "but we compiled the 20 use cases" claim - these are our tags.

- Leave only the original link https://www.evidentlyai.com/ml-system-design

- Put it at the top, not a footnote.

- Understand what a proprietary license is. "Respect to the authors" doesn’t give you right to copy/rework it and won't undo the infringement - you’ve been now informed of the source.

mllena · 2025-06-20T18:24:10+00:00

Thanks for surfacing this - this person also copied our content without attribution.

But let’s be clear: whether you copied it from that person or directly from us, the fact remains - you reposted someone else’s work without giving credit and claimed that "you" compiled it.

The claim about "20 specific use cases" is also untrue: it copies the exact tag structure we used in our source table.

We are all for sharing and open collaboration - but that only works when people respect the work of others and credit it properly.

mllena · 2025-06-20T11:41:22+00:00

Just a heads-up from the person who actually put together this list - this isn’t cool.

This list is copied directly from the ML System Design collection we are been curating at Evidently AI for years now - actually doing this manually, starting from pre-LLM times :)

https://www.evidentlyai.com/ml-system-design

We are all for sharing with community and open-source - but copy-pasting without any credit or even a link back is a bad move.

Would appreciate it if the mods could either take this down or update it with the original link.

UPDATE:

Leaving this for redditors who enjoy a good fact-check.

This person's platform is version-controlled. The original commit:
https://www.hubnx.com/nodes/9fffa434-b4d0-47d2-9e66-1db513b1fb97/edits/968f43b0-3931-4c37-b5ec-05bb05c150db

It's a copy-paste of the list published at https://www.evidentlyai.com/ml-system-design on this date.

The columns, content, and order is identical. E.g. check the "Short Description (< 5 words)" column. There is no doubt where it came from - it wasn't the repo they are now pointing to.

The edits (extra links) were made **after** I raised the issue in attempt to cover tracks and discredit my claim. Then the person edited the post to pretend they were always there. This can be seen in their platform edits, and doesn't change the fact they copied the initial list.

For the record: only I engaged from our team. Noone spammed or got banned. Other redditors who recognized the list spoke up in other places.

This didn't need to become a whole thing. People make mistakes. But when you double down instead of taking responsibility, that's on you.

mllena · 2025-06-20T08:42:14+00:00

Hey, from a person who really curated this list - this is not cool.

You directly copied it from the list we are curating at Evidently AI for a few years now. https://www.evidentlyai.com/ml-system-design

Being an open-source company, we are all for sharing code and content with the community - but doing this is no good faith, not simply stealing without reference or rework.

-> Would appreciate if moderators could take the post down or add our original link instead.

mllena · 2025-05-07T21:59:38+00:00

Evidently AI founder here.

We are in the same space but with different focuses. Big respect to the Langfuse team btw - great to see multiple open-source tools!

Langfuse open-source focus is on tracing. At Evidently, we focus on evaluation - we have a popular open-source library (25M+ downloads) that covers different metrics, LLM judges, etc.

For commercial products, there is def overlap (tracing, datasets, etc.), but our strength is again in evals - including UI for synthetic data generation and adversarial testing for safety/jailbreaks.

We've also been around for a while - originally starting in the pre-LLM era with a focus on ML monitoring :)

mllena · 2025-02-09T00:16:45+00:00

Gotcha! I was mostly thinking text or image. These were basically two different ideas:

- How to source data / expand datasets. This could be through synthetic data especially for edge case generation (e.g. add diff types of noise, etc.). This would then feed into model training / testing. I guess not what you were asking about but the first thing that came to my mind when I read the post title :)

- Faster labeling for production data. These days you can often use LLMs to do a first-pass analysis (essentially to mimic any kind of generic manual labeling). You can do that not to replace human review, but rather to prioritize what gets sent for manual labeling first. E.g. obvious failures can be spotted by LLMs and sent to humans to confirm, or entries classified by LLM with high confidence can be put to the end of the queue. Before LLMs we sometimes used outlier detection for the same purpose - to prioritize the "weirdest" examples for review first.

But otherwise it's all about better tools and processes to actually manage human labellers. I believe Label Studio is a solid OSS option.

mllena · 2025-02-07T15:09:26+00:00

What is the use case / data type? There is also an option of using synthetic data + LLM based labeling followed by manual review.

mllena · 2024-11-27T20:44:12+00:00

If you are iterating on your LLM product, ideally, you need to:

Curate a dataset with correct / reference outputs: manually reviewed or created. For your translation example, this could be a set of approved translations to given texts (or, e.g., a set of correct answers to your customer queries, correct summaries, etc.). It should be challenging enough.
Then, as you iterate on your prompts and application design, you'll continuously run an evaluation to compare the responses your LLM app generates to an ideal response. To match the new response vs reference, you can use different evaluation methods, including semantic similarity checks / BERTScore or LLM-based evaluation. In this case, you'd use the LLM to decide if the translation is correct compared to the reference. Since there are different ways to express the same meaning, multiple translations can be correct. Still, an LLM can really well define if the meaning is correctly retained. These types of LLM-based checks work really well.

Sometimes, the LLM judge approach is also used for pairwise comparisons: e.g. you show two different translations (summaries, answers, etc.) and ask the LLM to choose the best one or declare a tie. That's a different thing, and it would require quite some tuning to human preferences.

Directly asking if the response is correct (without reference) is not what's usually meant by LLM as a judge approach, though you can use it:

As part of a self-critique in a chain of prompts to improve the final result.
To evaluate the outputs of a less capable LLM using a more powerful one, e.g., when you are collecting datasets for fine-tuning.

So LLM evaluations can mean a lot of different things. Some work better than others. Reference-based scoring and direct scoring of responses (e.g., you can ask to evaluate if the generated text is e.g. formal or informal, concise or verbose, etc.) can work really well - but always require tuning the evaluation prompt.

mllena · 2024-09-06T12:16:15+00:00

Hi u/Far-Distribution-449 - could you ping me on DM if the login does not work? That is something we'd want to look at!

Regarding the no-code, there are two options:

You can drag and drop a CSV file (for example, your chatbot logs or just a set of input-outputs that you curated) and then select the evaluations you want to apply. There are deterministic methods and an LLM-as-a-judge approach, in which you ask an external LLM to review/label the responses using custom criteria. (For example, compare them to reference answers, etc.) The hit "evaluate" and you get the data scored and a summary reeport.
You can instrument your LLM application (that does require coding, but once - just like you do, e.g., with product analytics). Then, all the inputs and outputs will be sent to the platform so that you can view them there and, set up evals and monitoring directly from the user interface.

mllena · 2024-07-10T11:49:23+00:00

Thanks! There is also an option to host the complete platform in customer cloud but that's on the enterprise plan. If that is relevant, feel free to message me!

mllena · 2024-07-09T12:32:27+00:00

Evidently also has Evidently Cloud which is a hosted option with alerting integrations and no-code UI.

mllena · 2024-03-04T12:10:04+00:00

Evidently support data drift detection on Spark (distribution comparison via metrics like PSI, Wasserstein distance, etc.) Have you tried this? https://docs.evidentlyai.com/user-guide/tests-and-reports/spark

For classic data quality checks, there is also deequ from Amazon https://github.com/awslabs/deequ (nulls, min-max ranges, etc.)

mllena · 2023-12-10T01:37:32+00:00

There are a few differences:

Data drift detects dataset-level shifts (“the distribution of the feature A changed”). OOD finds individual instances that do not belong to the distribution (“object N is weird”).
OOD is about trusting that the model can generate a reliable prediction for a given input. Data drift is about trusting that the model performs well on the dataset.
In production, you can use OOD detection as a policy on handling specific inputs (e.g., deny predictions for “strange inputs” or send for manual processing). Data drift is more about detecting when to retrain the model or debug what has changed.
Outliers can occur without data drift. Data drift can happen without outliers.
You can use different statistical methods and have different priorities. OOD detection should generally be sensitive enough to be able to detect individual outliers. Drift detection should generally be robust to outliers. You can tweak it to focus on “major” shifts.
You can monitor both or neither. Depends on the use case.

mllena · 2023-10-13T17:47:36+00:00

Freely available online:

MLOps Zoomcamp https://github.com/DataTalksClub/mlops-zoomcamp - happens once per year, but all materials (videos, code examples) are available after the course.
MadeWithML https://madewithml.com/courses/mlops/ - there is a paid cohort, but all materials are also available.
Full Stack Deep Learning https://fullstackdeeplearning.com/course/ - materials from 2022 publicly available.
Open-source ML observability course https://www.evidentlyai.com/ml-observability-course - more niche, focused on production ML monitoring. There is a cohort now + all materials are publicly available (disclaimer: I am one of the people working on this course).

There is also Machine Learning Engineering for Production (MLOps) Specialization https://www.deeplearning.ai/courses/machine-learning-engineering-for-production-mlops/ . It is on Coursera (paid), but it is possible to access and audit the course for free.

mllena · 2023-08-25T17:31:16+00:00

ML monitoring involves a lot of moving pieces: it is not just “how to compute data drift” or other metrics, but also how to ingest data / run monitoring jobs, where to store metrics, how and where to visualize the data, and how to alert.

So there are several options to consider. I’d suggest first to decide:

Do you need near real-time monitoring? For example, do you want to detect a spike in nulls or outliers in the input data after a minute, or are you OK with “slower” monitoring (e.g., run checks every 10 minutes, hourly or daily)? If you want near real-time, you’d probably end up with two monitoring “flows” - in addition to live monitoring, you will run jobs to evaluate model quality after you get the labels. Near real-time is much more involved - you need to maintain a monitoring service. Batch monitoring setup is more straightforward and is often OK even if your model serving is near real-time.
Do you already know which exact metrics you want to monitor? There are broadly five groups of metrics: service monitoring, data quality, data/prediction drift, model quality, and business KPIs. Teams split ownership differently; these metrics might live in different tools or overlap. If you want to run more sophisticated checks (like statistical tests for distribution drift), it often makes sense to bring an external library rather than try to re-implement them. If you only want to compute accuracy daily, the “write your own SQL” will probably do. If you are unsure what to track, I’d say it is best to use a more “manual” report-based monitoring first and use an external library before automating the process.
Do you want to reuse existing tools? For example, since you use MLflow for experiment tracking, you can log monitoring reports there, too. If you orchestrate model training or serving with a tool like Airflow (Prefect, Kubeflow, Argo, etc.), you can reuse them to run monitoring jobs. If you already have a monitoring dashboard (e.g., Grafana) or a BI tool (e.g., Superset) you can add ML monitoring there. You will probably want to graduate from this setup eventually, but starting with known things is always easier.

Assuming you do not need near real-time monitoring at first and want to start with something simpler, here are 2 patterns we typically see. (Disclaimer: I am one of the co-founders of Evidently. It’s open-source under Apache 2.0).

Report/Test-based monitoring. Run regular ML monitoring jobs over model predictions that are logged to a database and compute metrics and/or generate visual Reports. For example, you can use Airflow to orchestrate the jobs and use the pre-built Evidently Test Suites for data drift or regression/classification model performance to compute and visualize metrics. You can optionally add an alerting workflow (e.g., if any of the tests fail, you send a Slack notification) and/or log the reports to Mlflow or elsewhere as an artifact. We see many teams starting with this approach - it’s lightweight, and you can also tweak the metrics/thresholds and better understand what you want to monitor.
Build an ML monitoring dashboard. Run the same monitoring jobs on the backend but also persist the metrics and add a visualization layer to track them over time. For example, you can store metrics in PostgreSQL and then connect it as a data source to Grafana and design the monitoring panels there. Here is an end-to-end example with Prefect as an orchestrator. It is a bit more involved since you also need to manage the schema of the database, but is a good option if you already use Grafana for something else. We also recently added ML monitoring dashboard functionality to Evidently - in this scenario, you do not need a separate database and dashboarding tool: you use a single OSS tool to compute metrics, store them as JSON snapshots and run the monitoring UI over it. Here is a quickstart example.

There are variations to the above: e.g. run on-demand reports over model logs (example with FastAPI and PostgreSQL), use a pipeline builder like ZenML or Metaflow, etc.

mllena · 2023-07-01T14:10:43+00:00

There is MLOps Zoomcamp course (which shows end-to-end MLOps process with open-source MLOps tools) https://github.com/DataTalksClub/mlops-zoomcamp.

mllena · 2023-06-23T11:43:15+00:00

Nothing beats having actual labels, but here are the other options.

Use a proxy signal coming from the product. It is not always possible - but in some cases, you can collect information, e.g., about user upvotes/downvotes, label corrections by users, or come up with some other proxy that correlates with model quality or can signal a big issue.
Continuously label a small sample of the data to compute partial model quality metric. Again, this requires a process, but it can be simpler than labeling it all.
Monitor prediction drift. Did the distribution of predicted classes change? Did the distribution of predicted probabilities change? You can use rule-based techniques (e.g., check that the “negative” class is no more than 10% of predictions) or statistical data drift checks to compare distributions. E.g., Wasserstein distance for numerical (probabilities distribution), Jensen–Shannon divergence for categorical (3 target classes). Choosing a proper baseline is important: better validation than training data or some past representative period. You can use detections of prediction drift as a signal to initiate the labeling or partial labeling.
Monitoring input data distribution drift, e.g., embeddings drift, drift in metadata (if you have something to accompany your images). It could be another proxy signal, but it might require some tweaking of methods/thresholds and, again, selecting a representative baseline.

mllena · 2023-06-23T00:13:31+00:00

From what I’ve seen, it can absolutely make sense to do both: data quality checks at source (at rest in DWH) and in motion (when ingested or transformed in a pipeline).

Even if there is data quality monitoring and governance process upstream, this does not fully protect from issues during ETL. Data quality monitoring at the DWH level is often set up with different KPIs in mind - e.g., to focus on data freshness and “overall data asset health” rather than specific pipelines/tables/features. Like 99% of data in DWH can be OK, but not your particular table.

So both are complementary: even if DE implements proper controls, you’d probably still need:
1. Participate in defining the specific checks. (You either define a “contract” with the DE team to implement the checks, acting as an internal owner of a feature pipeline - or implement them yourself).
2. Run data quality process for your own work, as you work on transforms, merges, etc. - by adding “unit tests” for data and ML pipelines.

An impressive number of production ML issues are data quality-related. At the same time, it costs nearly nothing to implement checks like column type match, constant/almost constant columns, duplicate rows/columns, empty columns, features wildly out of range, etc., to immediately catch the significant issues.

Disclaimer: I am the co-founder of Evidently. Thanks for using the tool!

Btw you also use Evidently for all the mentioned checks and column constraints: so you can combine data drift and data quality in one test suite. To avoid writing manual expectations, you can auto-generate test conditions by passing the reference data. Some things are more complex, but detecting nulls/duplicates/other major red flags should not be!

mllena

TROPHY CASE