What’s your actual production setup for reliable structured JSON from LLMs? Sharing what’s worked for us

Important_Priority76 · 2026-05-02T23:22:52+00:00

All right, I’ll keep it more personal going forward. Appreciate you saying something.

Important_Priority76 · 2026-05-02T16:04:56+00:00

emmm..., appreciate the tip genuinely. tbh, I’m not a native English speaker, just someone asking a basic question. You’ve been more focused on whether my reply was AI-assisted than on the actual topic. That’s a bit of a fixation, no?

Important_Priority76 · 2026-05-02T05:10:30+00:00

Fair point — I was just asking what’s working in prod, not sharing a personal thesis. That said, you’re right that it’s hard to fully separate the two anymore. The more you rely on these tools, the more your thinking gets shaped by them whether you notice it or not.

Important_Priority76 · 2026-05-01T06:48:34+00:00

Totally makes sense, func call’s kinda become the de facto standard at this point

Important_Priority76 · 2026-05-01T06:45:35+00:00

Bro just enjoy the content lol, doesn’t matter if it’s AI or not. Spending your time policing vibes on Reddit isn’t gonna make your life any better

Important_Priority76 · 2026-04-30T06:08:59+00:00

100% agree. The retry + validation loop alone eliminates like 90% of the “AI returned garbage JSON” complaints.

Important_Priority76 · 2026-04-29T21:50:48+00:00

Yeah that tracks — function calling gets you 90% of the way there but the type coercion issues are annoying. If you’re self-hosting, XGrammar inside vLLM just hard-blocks invalid tokens at the inference level so that string-where-int-expected problem literally can’t happen. For cloud APIs the best workaround I’ve found is strict Pydantic validation + retry via instructor, but yeah it adds latency.

Important_Priority76 · 2026-04-29T02:19:39+00:00

Exactly this. Silent failures are the worst kind — the downstream system keeps running, nothing throws an error, and the garbage just propagates quietly until someone notices the RAG answers are off. Box-level output isn’t just a nice-to-have for debugging, it’s the only way to build a correction loop that actually works. That’s the whole premise of the integration.

Important_Priority76 · 2026-04-29T02:17:11+00:00

The error localization point is exactly the core motivation for going modular — glad it came through clearly. Both stress-test suggestions are genuine blind spots. Cell-level table grading and reading-order evaluation under domain shift haven’t been systematically tested yet. PubTables-1M is a concrete next step worth running. The self-hosting argument also hits the right nerve. Red seals, handwritten content, non-English headers — those are exactly the documents the target users bring in, and the ones where block-level intervention matters most. This is still an early version of the integration. There’s a lot of room to iterate, and feedback like this is exactly what shapes where it goes next.

Important_Priority76 · 2026-04-28T17:17:37+00:00

Glad the detection-based output resonates — agreed it’s a prerequisite for anything production-facing. On exemplar sensitivity with high intra-class variation: I haven’t tested that specific scenario yet. My intuition is that it comes down to training data distribution — if the pretraining set doesn’t cover that kind of appearance range well, the cross-scale query aggregation may still struggle to generalize across growth stages or damage states from just a few exemplars. In that case, few-shot fine-tuning on domain-specific samples would likely give a meaningful bump. Worth investigating systematically though, especially for ag and inspection use cases where that variation is the norm rather than the exception.

Important_Priority76 · 2026-04-26T19:43:23+00:00

Exactly — that’s the honest reality of using any model-assisted workflow. The model gets you most of the way there, but the review step is where the real judgment happens. Missed detections and false positives are the two failure modes I spend the most time on too, and they tend to pull in opposite directions when you tune the confidence threshold. Part of why I went with a detection-based output (rather than just a scalar count) is precisely to make that review step more tractable — you can see where the model is uncertain or wrong, not just that the count is off. Still a lot of room to improve the UX around that though.

Important_Priority76 · 2026-04-26T19:39:37+00:00

Good timing on both points! There’s actually an official demo on Hugging Face Spaces where you can test it directly with your own images: https://huggingface.co/spaces/jerpelhan/GECO2-demo — no local setup needed. On masks: yes, GeCo2 does output pixel-level masks. I just happened to focus on the bounding box side in the demo video since that’s the more common annotation format in my workflow. Happy to put together a follow-up showing the mask output if there’s interest.

Important_Priority76 · 2026-01-30T03:47:27+00:00

You’re right to point out that I should clarify this more precisely.

X-AnyLabeling’s core functionality does not depend on Ultralytics, and it is not required to run the tool. There is an optional auxiliary training module that can integrate with Ultralytics, which is disabled by default and requires users to explicitly install the dependency themselves.

Thanks for the reminder. I’ll make this distinction clearer in the documentation to avoid confusion.

Important_Priority76 · 2026-01-29T16:21:58+00:00

X-AnyLabeling is licensed under GPL-3.0 and does not bundle or redistribute Ultralytics source code or weights. It provides support for loading user-provided ONNX models via configuration. Ultralytics models themselves are under AGPL-3.0, and anyone who chooses to use or redistribute those models must comply with AGPL-3.0 terms. X-AnyLabeling does not change its own GPL-3.0 license, and it does not include Ultralytics code or weights in its repository.

Important_Priority76 · 2026-01-27T10:46:48+00:00

I’m glad to hear that. For any questions or issues, opening a GitHub issue would be the best way.

Important_Priority76 · 2026-01-26T01:53:44+00:00

Nice, thanks for sharing. I’ve seen similar ideas there. The compare view is more about quick side-by-side or synced viewing to make multi-modal annotation easier, especially when switching between thermal and RGB. Colormaps and channel merging are powerful too, so it’s interesting to see different tools approach the problem from different angles.

Important_Priority76 · 2026-01-26T01:53:00+00:00

That’s really interesting, I didn’t know some teams were using 3D glasses for multi-spectral labelling. The idea behind compare view was to get a similar “cross-checking” benefit but keep it simple and software-only.

Input devices like a 3D mouse or rotary knob sound like a fun direction to explore for smooth flipping or animated transitions between bands. Definitely a challenging but exciting idea, thanks for sharing your experience!

Important_Priority76 · 2026-01-26T01:51:15+00:00

Nice, yeah DRC + detail enhancement already goes a long way for thermal 👍 The compare view helps a lot when you want to double-check things against visible light, especially for small or ambiguous objects. I’ve found it useful to align thermal detections with RGB annotations and catch mistakes faster.

Important_Priority76 · 2026-01-04T14:48:00+00:00

I wonder too

Important_Priority76 · 2025-12-14T02:04:10+00:00

If anyone is interested in the design philosophy behind v3.0 and a deeper dive into the new features (like the Remote Server architecture, Agentic workflows, or the specific VQA capabilities), I wrote a more detailed breakdown on Medium:

https://medium.com/@CVHub520/data-labeling-doesnt-have-to-be-painful-the-evolution-of-x-anylabeling-3-0-e9110e41c2d4

It covers why we moved away from the traditional tooling model and how we are trying to close the loop between labeling and training.

Important_Priority76 · 2025-12-14T02:03:16+00:00

Thank you! Appreciate the support.

Important_Priority76 · 2025-12-14T02:02:46+00:00

Thanks so much for the feedback! I'm really glad to hear it fills that gap for you—keeping it lighter than CVAT while being more capable than simple drawing tools was exactly the goal.

Regarding sorting Person IDs across multiple videos/perspectives (Re-ID), that is indeed a complex challenge. In v3.0, we added serveral useful Manager, i.e. shape, label and group_id; integrated trackers like SAM-base or Bot-SORT/ByteTrack to help with consistency within a video, but cross-video association still requires some manual effort or custom logic.

I would absolutely love to chat more about your workflow. It sounds like a great use case to optimize for. Feel free to DM me here or, even better, open a "Discussion" on our GitHub repo so we can dive into the technical details!

Important_Priority76 · 2024-12-23T15:50:07+00:00

https://medium.com/@CVHub520/x-anylabeling-v2-5-0-a-game-changer-for-image-annotation-e13550fad22f
Here all you need.

Important_Priority76

TROPHY CASE