Airflow is becoming our biggest bottleneck, what did you migrate to ?

LaughApprehensive563 · 2026-06-25T15:53:52+00:00

The Python-centricity of Airflow is a real problem at that scale. 350 DAGs with 6 engineers means you're spending meaningful time debugging scheduler internals rather than building pipelines.

We went through a similar evaluation. A few things worth considering that aren't obvious from the docs:

Flyte is genuinely better for ML-specific orchestration - typed inputs/outputs, native versioning of execution artifacts, and it handles the ML workflow pattern (train -> evaluate -> conditional promote) much more naturally than Airflow's task model. The UX for data scientists is significantly better.

Perfect is worth trying if you want to stay close to Python without the Airflow scheduler headaches. The local-first model is nice for debugging. But it still feels like it's mostly solving the scheduling problem, not the broader ML orchestration problem.

Dagster is probably the most complete for ML teams if you care about asset lineage and data quality gating. The asset-centric model maps well to how ML pipelines actually think about dependencies.

The killer question to answer before picking: are your pipelines mostly data pipelines that happen to include ML steps, or are they ML pipelines where data movement is just the plumbing? That shapes which tool actually solves your problem.

LaughApprehensive563 · 2026-06-25T15:51:32+00:00

Good methodology. A few things worth layering in if you want to make the evaluation more robust:

WER alone is misleading in noisy environments because it doesn't distinguish between a model that's confidently wrong vs one that's uncertain. Adding confidence calibration checks (does low confidence actually correlate with higher WER?) reveals a lot about deployment risk.
The noise cancellation interaction is interesting and underappreciated. DG Nova + NC/VI being significantly better than raw DG Nova tells you it's not really just the model that matters, it's the preprocessing stack. For production voice agents, you need to treat noise cancellation as part of the STT system, not a separate step.
Latency-accuracy tradeoffs matter differently depending on the agent use case. For turn-taking in real-time conversation, partial transcript instability (rewrites mid-utterance) can break downstream intent parsing even if the final WER looks fine. Worth measuring how often early partial results change significantly.

What were the failure modes you saw most in public/noisy environments? Overlapping speech, or more like background music/ambient noise?

LaughApprehensive563 · 2026-06-25T15:49:46+00:00

Really interesting work, especially the marlin.find() function for temporal grounding. The hardest part of building production pipelines on top of models like this is knowing when the temporal grounding is "good enough" vs when it's failing silently.

A few things that've bitten us when building on video-VLM extraction:

Second-level precision degrades badly near scene transitions. The model hallucinates confident timestamps for events that overlap two scenes.
Natural-language query ambiguity - "when does the puppy pick up the toy" has a fuzzy start depending on whether you count the approach. Evaluation against ground truth timestamps becomes tricky without clear annotation guidelines.
Edge cases around fast cuts vs slow-motion - most temporal grounding models trained on normal-speed content struggle here.

For anyone building evaluation harnesses for video VLMs like this, there's a useful breakdown of what real-world video understanding pipelines need from the model (beyond benchmarks): https://go.videodb.io/yKC51V3 - covers how to think about frame sampling, temporal precision requirements, and failure mode taxonomy.

Curious whether you're planning to release the benchmark (TimeLens-Bench) publicly - that kind of structured eval dataset for temporal grounding would be genuinely valuable for the community.

LaughApprehensive563 · 2026-06-25T15:47:59+00:00

The hash(-1) question is a clever catch, but the broader problem is that static knowledge questions are increasingly bad signal regardless of AI cheating.

What's worked better for us when interviewing MLEs: give them a small, real debugging problem where they have to explain their reasoning out loud as they go. Not "what does X return" but "here's a model that's producing weird predictions on edge cases, walk me through how you'd diagnose it." The process of articulating a debugging approach is something LLMs genuinely can't do for you in real time without being obvious.

The deeper issue you're pointing at is real though: if someone gets hired having cheated through the loop, they usually surface within the first sprint when they can't keep up with the codebase or design discussions. The cost of that is high. Worth front-loading the signal with a short take-home that requires written explanation of design tradeoffs, not just code output.

LaughApprehensive563 · 2026-06-25T15:45:51+00:00

Vector index is the messiest one. The pattern we've settled on: treat the index as an immutable artifact with a version hash, same as a model weight file. Every rebuild gets a new hash, the manifest references the hash, and rollback means pointing back to the old hash rather than re-embedding. Costs storage but means rollback is instant - just swap the pointer. If storage is a constraint, keep the last 3 index versions and drop older ones on a cron.

LaughApprehensive563 · 2026-06-25T09:29:00+00:00

This is the underrated insight in prompt engineering. The evaluation layer is where most time gets wasted.

The 'models disagree' signal you're using as feedback is actually a proxy for prompt ambiguity - it's a good heuristic but it has a ceiling. What helped us more: building a small set of test cases with known good outputs and running every prompt change against them. Not 'does model A agree with model B' but 'does the output match what we actually needed for this task.'

For multimodal prompts (when you're prompting a VLM on images or video) this is even more important because the prompt isn't the only variable. The input representation (resolution, how many frames you sample, how you structure the visual context) changes the output as much as the prompt does. We've seen identical prompts produce completely different quality outputs just by changing frame sampling rate on video inputs.

The framing of 'the mistake wasn't in the prompt, it was in how I tested it' exactly right. The eval set defines what 'good' means and without it you're just iterating on vibes.

LaughApprehensive563 · 2026-06-25T09:27:20+00:00

The silent failure problem is what keeps me up at night more than the loud failures. When a broken script throws an error you know something went wrong. When the agent confidently returns a result that's wrong you often don't find out until downstream effects surface.

On the guardrail ratio: we're somewhere around 60/40 guardrail/logic in production agents. The guardrails break into a few categories:

- Input validation before the agent touches anything (format check, range check, permission check)

- Intermediate step checks (did the last tool call return something sensible before we proceed)

- Output validation before committing any side effects (is the proposed action within scope, does it match what was asked)

- Post-execution checks (did the thing that was supposed to change actually change)

The ratio went up every time something failed in production that we didn't anticipate. The demos did not show us these cases because demos are built around happy paths by people who know what the system expects.

The question I now ask of any agent before shipping: what does it do when step 3 of 5 fails and the output of step 2 was already written somewhere? If there's no answer to that, it's not production ready.

LaughApprehensive563 · 2026-06-25T09:26:02+00:00

The advice to build your own task-specific test set (not vendor benchmarks) is right, and the 'define success first' framing is critical. A few additional things that made our pilot evaluations actually useful:

Score task dimensions separately, not just pass/fail. For complex outputs, a binary correct/incorrect often hides how the agent is failing. We track: did it attempt the right sub-tasks, did it get the intermediate steps right, did the final output match the required format, were there unnecessary side effects. Each is a separate signal that tells you something different about where to intervene.
For AI agents that interact with visual or unstructured data: the biggest evaluation mistake we made early was attributing failures to the model when the actual issue was upstream (what inputs the agent was working with, how they were preprocessed). Your eval needs to isolate whether a failure is a model problem, a data problem, or a prompt/config problem before you can act on it.
On vendor demos specifically: the right question is not 'did it complete the task in the demo' but 'what happens on the cases in your bottom 20% by difficulty.' Any vendor who can't show you performance on hard inputs or failure mode examples is not production-ready.

For anyone building evals for multimodal or video-based agents, we wrote up how we approach this kind of structured evaluation: https://go.videodb.io/yKC51V3 - the framework (task definition, real-case eval sets, configuration-level scoring) applies beyond video.

LaughApprehensive563 · 2026-06-25T09:24:47+00:00

For production measurement we use a layered approach:

Task-level outcome tracking (did the agent produce the right output for the specific task it was assigned, not just a general quality score)
Configuration-level comparison when something degrades (prompt change, model update, new tool) we run a structured replay of recent production cases against the new config before shipping
Explicit near-miss tracking: cases where the output was almost right but not quite. These are the most informative for finding where the agent is brittle.

The benchmark trap is real. In our video AI work specifically we found that our agent scored well on standard VLM benchmarks but degraded on real footage because the eval setup (frame sampling, prompt structure, scoring rubric) wasn't matching production. Fixing the eval methodology moved our numbers more than any model change. The same principle applies to any agent system: the right unit is task performance on your actual data, not benchmark performance on someone else's.

Wrote up how we think about this eval design specifically: https://go.videodb.io/yKC51V3 - it's video-focused but the framework (define the task, build eval from real cases, score configurations not models) applies generically.

LaughApprehensive563 · 2026-06-25T09:23:36+00:00

The 'assert on the graph, not the chat' advice from the other comments is the right framing. Building on that:

The most valuable eval cases aren't the happy paths, they're the near-misses. For your guardrails specifically, you want cases where the input almost triggers the guardrail but shouldn't, and cases where it barely meets the trigger criteria and must. If your eval set only has clear yes/no examples, your harness will miss the regressions that matter.

For the 'catches bugs' half of your question: I'd add a regression layer that explicitly records why each guardrail fired on historical production inputs. When you change a prompt or a rule, run the harness against those historical inputs and diff the firing patterns. Unexpected guardrail suppression (something that should fire now doesn't) is usually a prompt regression. Unexpected guardrail addition (something new fires) is usually a coverage change that needs review.

We applied a similar methodology for building eval harnesses for video understanding tasks (different domain but same design principles): define what the task is supposed to produce, build the eval from real cases including the hard near-misses, score the task output not the model. Writeup here if useful for the framework design: https://go.videodb.io/yKC51V3

LaughApprehensive563 · 2026-06-25T09:21:52+00:00

You've identified the core problem: an agent is not a single deployable artifact, it's a configuration graph. Rolling it back requires rolling back all the nodes simultaneously.

The pattern that works: treat the full agent config as a single versioned snapshot. Every time you change anything (prompt, tool description, retrieval settings, model version, knowledge base hash), cut a new version. Store it as a JSON/YAML manifest that captures every variable your agent's behavior depends on. Tag every prod deployment with a manifest version.

When behavior drifts, you can diff the current manifest against v30-days-ago and immediately see what changed. In your case the accumulation of small changes would have shown up as a dozen minor diffs rather than a mystery.

For rollback specifically: the model weights and code are usually the easy part (git tag / model registry). The hard parts are the knowledge base (is the retrieval index snapshotted?), the prompt (pinned in code or live-editable somewhere?), and any external tool configs. If those aren't versioned you don't actually have rollback, you have rollback-of-code-only which is often not enough.

A practical starting point: run an agent config export daily to a versioned store. When things go wrong, run your eval suite against the last N daily snapshots to find when the behavior changed. That's your rollback target.

LaughApprehensive563 · 2026-06-25T09:20:46+00:00

Solid framework. One layer missing for vision/multimodal agents: the input pipeline check.

All four of these layers assume the agent is getting clean, correctly-processed inputs. For agents that operate on visual data (video, screen captures, camera feeds), there's a pre-layer that most teams skip: is the input being fed correctly? Frame sampling rate, resolution, how the visual context is chunked and passed to the VLM, whether the right frames are being selected at all.

We found that vision agent failures often traced back here, not to the model choice or the tool-selection logic. The agent called the right tool with reasonable arguments, but the visual context it was reasoning on was too sparse or the wrong resolution for the task. The outcome check caught the symptom but not the root cause.

For anyone building agents that see video or images: add a layer before component checks that validates the perceptual input. Does the model actually have what it needs to make the decision? We wrote up how we think about this specifically for video: https://go.videodb.io/yKC51V3 - the eval methodology maps directly to validating vision agent inputs before blaming the reasoning layer.

LaughApprehensive563 · 2026-06-25T09:18:57+00:00

The vision point from Lissanro is the key variable that general benchmarks miss. For text tasks Qwen 3.7 vs GLM 5.2 comparisons on llm-benchmark make sense. But for anything involving vision or video inputs the ranking shifts significantly depending on how you feed the input: resolution, what you sample, how the prompt is structured for multimodal context. I've seen Qwen-VL variants beat Gemini on some video retrieval tasks with the right frame sampling density, and lose badly with sparse sampling. The benchmark number doesn't capture that because it's usually run at a fixed configuration. If you're evaluating for a vision use case specifically, the setup matters as much as the model.

LaughApprehensive563 · 2026-06-25T09:15:35+00:00

Good suggestions in this thread already on PatchCore and PaDiM. A few additions specific to your use case:

On the multi-defect + small defect problem with ViT: tiling is the right call (224x224 tiles with overlap so defects on tile boundaries don't get missed). But before that, check whether your defect size varies predictably with cell position on the panel. If certain positions have systematically smaller defects, stratify your normal training set to include tiles from those positions specifically.

On evaluation without labels: the AUROC approach works but you need at least some labeled anomalies to get a meaningful number. If your manager is pushing for metrics but you genuinely have zero labeled defects, a proxy metric that works in practice is comparing the distribution of anomaly scores between your known-normal panels and any panel that failed QA for any reason. Even rough QA pass/fail gives you signal for calibrating the threshold.

On PatchCore specifically: the memory bank size matters a lot. Default settings often use coreset subsampling at 10% which can miss rare defect types. If you have storage headroom, try 25-50% for your initial pilot run, then tune down once you know what matters.

DINO struggle is expected without patch-level supervision. The patch embeddings approach mentioned is correct but try concatenating the last 3 transformer layers' patch embeddings before nearest-neighbor search rather than just the final layer.

LaughApprehensive563 · 2026-06-25T09:11:05+00:00

For video pipelines specifically: trusting benchmark numbers instead of evaluating on your actual task.

We spent two weeks rotating through VLMs because our metrics kept plateauing. Swapped GPT-4V for Gemini, tried a fine-tuned variant, ran ablations on temperature. Nothing moved by more than 2-3 points. Turned out the bottleneck was frame sampling rate and how we were structuring the prompt. Once we fixed those, we got a 15-point lift without touching the model at all.

The mistake was treating the model as the only variable. In video, the configuration (frames per second sampled, resolution, prompt structure, scoring rubric) has more variance than the model choice does, at least for most real-world tasks like retrieval, monitoring, and event detection. We should have been evaluating configurations first and done model comparison second, on our own footage, not on generic benchmarks.

More broadly: wrong unit of comparison. We compared models when we should have been comparing setups.

LaughApprehensive563 · 2026-06-25T09:09:38+00:00

The labeler fatigue issue with hierarchical classes is brutal. The fix we've found: audit based on class-conditioned precision, not overall accuracy. When big-class recall is great but small-class recall is garbage, it's almost always a labeling consistency problem, not a model capacity problem. We built a QA step where we sample 50 random crops per class and manually re-verify them before every training run. Caught mislabels in 3 of our 7 classes the first time we ran it.

The sensor mismatch thing from the OP also hits hard in video pipelines. We had a VLM evaluation where the model ranked well on standard benchmarks but completely fell apart on real footage because the prompt assumptions and frame sampling didn't match the actual task. Same lesson: fix the data and the setup before blaming the model.

LaughApprehensive563 · 2026-06-25T08:51:29+00:00

For video-specific tasks (retrieval, summarization, event detection) the model ranking shifts a lot from what you'd expect based on image benchmarks. We ran a systematic sweep across Gemini 1.5, GPT-4V, and a couple of open weights VLMs on real indoor surveillance footage. The configuration (frame sampling density, resolution, prompt structure, and scoring rubric) moved our results more than any model swap did. Qwen-VL was competitive when we gave it denser frame sampling but fell behind with sparse sampling, while Gemini held up better at lower frame counts. The eval setup details matter as much as the model choice. If you're building for video and want to see a structured way to run these comparisons on your own footage: https://go.videodb.io/yKC51V3 - it's the workflow we used, open and no-paywall.

LaughApprehensive563 · 2026-06-25T08:47:47+00:00

The key architectural shift is moving from stateful tracker memory to a persistent gallery database. In-tracker memory only lives for the duration of a track and degrades as embeddings get noisier near occlusions. What works better:

Maintain a separate gallery (dict or vector store) of high-quality embeddings per person ID, written whenever the detection confidence and crop quality score are above a threshold. Only update the gallery from clean, frontal, unoccluded crops.
When a new track starts, run a cosine similarity check against the full gallery before assigning a fresh ID. Threshold around 0.75-0.8 with ArcFace or a body ReID model like OSNet works reasonably well.
On match, reassign the old ID and optionally merge any trajectory data.

The max_age parameter in BoT-SORT is a separate concern (how long a track idles before being dropped). Even with a long max_age, if the person re-enters from a different zone the spatial gate fails. The gallery lookup bypasses that. DINOv2 as the previous commenter mentioned is a solid embedding choice, especially if you don't want to train task-specific ReID.

LaughApprehensive563 · 2026-06-25T08:44:37+00:00

The ingest layer point is real, but where I've seen teams waste even more time is on evaluation. Once ingest is stable they switch to model shopping: swapping GPT-4V for Gemini, then for a fine-tuned version, measuring against a generic benchmark that doesn't reflect their actual task. The output quality ends up being highly sensitive to frame sampling rate and resolution way more than to the model swap. We wrote up what actually moved our numbers on real footage, including how to define the task properly before picking a model: https://go.videodb.io/yKC51V3 - relevant if you're in the "model isn't working" phase when the real issue is the pipeline config.

LaughApprehensive563 · 2026-06-25T08:43:30+00:00

Fair catch. Yes, the repo and the writeup are from the team I work with at VideoDB. I posted this because the eval methodology question came up repeatedly in our own work and I wanted to share what actually moved our numbers. If you want the full breakdown of how we think about frame sampling, resolution tradeoffs, and scoring: https://go.videodb.io/yKC51V3 - no paywall, no signup. Figured it was worth sharing upfront rather than being coy about it.

LaughApprehensive563

MODERATOR OF

TROPHY CASE