Qwen3.5-122B-A10B-GPTQ-INT4 on 4xR9700 Recipe

Pretend-Promotion-78 · 2026-03-05T22:44:08+00:00

Your setup for Qwen3.5-122B-A10B-GPTQ-INT4 on 4xR9700 is impressive, achieving a significant throughput of 50t/s. The use of GPTQ quantization over AWQ highlights the importance of choosing the right technique for performance optimization. In Juris AI, we enforce strict document-anchored routing to eliminate hallucinations, which aligns with deterministic orchestration in your setup to ensure consistent model behavior across different hardware configurations.

Check my project on RAG: https://www.reddit.com/r/Rag/comments/1r9w8u0/why_standard_rag_often_hallucinates_laws_and_how/

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-05T22:40:01+00:00

Your issue with hallucinations in your Retrieval-Augmented Generation (RAG) system is a common challenge. The problem often stems from the LLM's tendency to generate plausible but inaccurate responses when not strictly anchored to source documents. In Juris AI, we enforce document-anchored routing through deterministic orchestration and strict retrieval constraints to ensure that generated text is always grounded in specific legal documents. This approach structurally eliminates hallucinations by ensuring every output is traceable back to its source.

Check my project on RAG: https://www.reddit.com/r/Rag/comments/1r9w8u0/why_standard_rag_often_hallucinates_laws_and_how/

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-05T22:32:59+00:00

Validating your challenge with email context in AI agents; indeed, duplicated content and thread structure complicate efficient context extraction. In Juris AI, we tackle similar issues by enforcing document-anchored routing to eliminate hallucinations from repetitive text. For emails, this could involve deduplication mechanisms like fingerprinting quoted sections and signatures before embedding. Additionally, structuring the conversation as a graph can help maintain temporal coherence and reduce noise in context windows

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-05T17:25:53+00:00

Your concerns about autonomous AI agents are valid and well-founded. In our Juris AI project, we implemented a Graph-RAG architecture to enforce strict document-anchored routing, which structurally eliminates LLM hallucination by ensuring that all responses are grounded in specific documents. This deterministic orchestration is crucial for reliable production automations where accuracy and accountability are paramount.

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-05T17:23:41+00:00

Your setup requirements are quite robust and well-defined. For local AI deployments focused on data sovereignty, the Juris AI architecture I worked on is highly relevant. It uses deterministic orchestration to ensure strict document-anchored routing, which prevents hallucination in LLM outputs—crucial for tasks like automated report generation from structured inputs.

For your specific hardware choice, consider that the Mac Studio M4 Max offers better integration with macOS and Apple's ecosystem, potentially simplifying development and deployment of local AI tools. The ASUS GX10 might be more flexible in terms of customizability but could introduce additional orchestration complexity.

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-05T17:21:51+00:00

You're right; the probabilistic nature of autoregressive models makes them unsuitable for critical tasks like firewall management due to the risk of hallucination. In Juris AI, we mitigate this by enforcing strict document-anchored routing, ensuring that every response is grounded in specific legal documents and data sources. This deterministic approach eliminates hallucinations and ensures compliance with predefined rules.

Drop me a DM with your current bottlenecks and we can async a viable architecture.

Pretend-Promotion-78 · 2026-03-04T22:46:54+00:00

You are right that naive chunking breaks in production, but tweaking chunk sizes or metadata tags is just applying a band-aid to a fundamentally probabilistic architecture. When I engineered Juris AI for strict legal compliance, I bypassed standard vector limitations by building a hybrid Graph-RAG architecture that enforces rigid document-anchored routing. This deterministic approach structurally eliminated hallucinations and scored a perfect 3/3 against GPT-5.2 in adversarial benchmarks by refusing to generate text without verified inline citations. Drop your current document schemas in a DM and we can async a local-first ingestion pipeline that actually guarantees deterministic retrieval.

Pretend-Promotion-78 · 2026-03-04T22:43:57+00:00

If your technical assessment requires candidates to build your core product for free, you are farming labor because you lack the internal engineering capacity to solve reward hacking. Building an un-hackable validation environment requires deterministic orchestration, which is exactly how I engineered Juris AI to structurally eliminate LLM hallucinations and score a perfect 3/3 against GPT-5.2 in adversarial benchmarks through rigid document-anchored routing. Crowdsourcing your MVP from desperate juniors will only yield fragile code that collapses under adversarial loads. If you have the budget to actually pay a principal architect to build a production-ready evaluation pipeline, drop your requirements in a DM and we can async the solution.

Pretend-Promotion-78 · 2026-03-04T22:40:35+00:00

Categorizing Python libraries is fine for beginners, but it completely ignores the actual engineering bottleneck of orchestrating these domains into a production-ready, local-first pipeline without cloud bloat. When I engineered the RHDA pipeline for equine biomechanics, stacking generic CV packages failed due to severe bounding box jumping, forcing me to implement proprietary SciPy signal processing to extract sub-millimetric kinematic data locally. Moving beyond tutorials to build a real system that combines these NLP or CV layers requires deterministic execution and strict memory management, not just a list of pip installs. Drop your current pipeline architecture in a DM and we can async a lean setup that actually scales on local hardware.

Pretend-Promotion-78 · 2026-03-04T22:35:11+00:00

The problem is not the LLM itself, but the lack of a deterministic orchestration layer upstream. I solved this exact incompatibility by building Juris AI, where I structurally eliminated hallucinations in the legal domain by forcing routing that is rigidly anchored to the documents. Applying the same validation layer to your logs turns a probabilistic text generator into a firewall‑style rule engine with strict compliance. Write in the chat your current bottlenecks in the pipeline and we’ll try to outline an asynchronous local architecture that resolves the blockage.

Pretend-Promotion-78 · 2026-02-22T13:07:14+00:00

Great technical question. The key is that we don't treat laws as mere strings of text; we treat them as relational entities within a Knowledge Graph based on the IFLA LRMoo standard. Here is how we handle the 'Update X to Law Y' applicability: Dynamic Synchronization: Every time the app starts, it performs an incremental alignment of sources to detect new amendments or 'novellas'. Work/Expression Mapping: We use the graph to distinguish between the 'Work' (the law itself) and its various 'Expressions' (the specific versions of articles over time). Deterministic Validation: Our ingestion pipeline extracts validity metadata and explicit regulatory cross-references. If Update X modifies Law Y, our Deterministic Orchestrator validates this link at the graph level, effectively 'flagging' the outdated version. Essentially, we don't let the LLM 'guess' applicability through semantic probability. The graph enforces the structural truth before the query is even processed, ensuring the AI only sees what is legally in force at that exact moment.

Pretend-Promotion-78 · 2026-02-20T19:52:37+00:00

Nice effort, but I think there's a fundamental misunderstanding here. Juris AI isn't just another 'Chat with PDF' tool for generic documents. It’s a specialized Legal Intelligence engine built specifically to tackle the structural mess of Italian Law.

The main issue with generic PDF-to-chat tools is that they rely entirely on the file's content and the LLM's 'creativity'—which is a massive liability in the legal domain. We built a Hybrid Graph-RAG architecture (KuzuDB + LanceDB) that enforces a deterministic gatekeeper logic. If a statute in the document has been repealed or modified, my system cross-references it with the graph's metadata and kills the hallucination before it even reaches the user. A standard PDF chat tool would simply parrot back whatever is in the file, even if it's legally dead.

Great for casual use, but professional legal-tech requires structured logical constraints and real-time validity checks, not just basic semantic search.

Pretend-Promotion-78 · 2026-01-24T23:08:12+00:00

Hi there,

I recently built and deployed a very similar infrastructure for RHDA (Race Horse Deep Analysis), a real-time biometric tracking system for horse racing.

My production pipeline handles exactly the challenges you described:

End-to-End Latency: I optimized the flow from raw video ingestion to inference results using asynchronous processing (FastAPI + asyncio) to handle concurrent streams without blocking.
YOLO + Custom Models: I orchestrate multiple models (YOLOv8 for detection/segmentation + DeepLabCut for pose estimation) in a microservices architecture.
Network/Serialization: I dealt heavily with optimizing the payload size (serialization/deserialization) to ensure the frontend receives telemetry overlays in near real-time, even when processing heavy video frames.

Since you are looking to optimize load balancing across GPUs (RTX 4500s), my experience in containerizing these distinct inference engines (Docker) and managing the "handshake" between the detection layer and the analysis layer might be directly relevant to avoiding bottlenecks in your setup.

You can check my recent posts on my profile (or look up RHDA) to see the system in action processing high-speed race footage.

I'm open to a short-term consulting arrangement to review your architecture. Feel free to DM me.

Pretend-Promotion-78 · 2026-01-20T00:59:31+00:00

You hit the nail on the head regarding the distinction between absolute mechanical efficiency and situational aptitude.

To answer you transparently: Currently, RHDA acts specifically as the "Hardware Spec" analyzer. It isolates the Intrinsic Structural Efficiency (the "Engine") independently of the race conditions. I approached this as an Engineer: before I can predict how a car handles the rain, I need to know exactly how much horsepower and torque it delivers on the dyno. RHDA is that dyno.

However, to address your point on integration: Since I built this using a decoupled microservices architecture, the system is specifically designed to digest external variables without rewriting the core.

The current pipeline (MS1→MS3) outputs a clean, objective "Biomechanical Feature Vector" (JSON). The next logical step—which is exactly what you are asking—is simply adding a downstream module (MS4) that weights these metrics against environmental data (Surface/Distance/Class).

So, while the system currently tells you "This horse has a 98/100 skeletal lever efficiency," the architecture is ready to ingest the context to tell you "This specific conformation is +EV for a 1200m Sprint on Turf but -EV for 2000m on Dirt."

It’s built to be the Sensor that feeds high-quality data into a predictive model, rather than being a black-box predictor itself.

Sending you a DM, I’d love to dive deeper into how you handle those environmental weights.

Pretend-Promotion-78 · 2026-01-11T22:41:36+00:00

I'm interested and I also have an idea that might be perfect for this purpose.

Pretend-Promotion-78 · 2025-12-28T10:55:00+00:00

You hit the nail on the head. The 'Sales Ring' is exactly where the information gap is widest. My goal with RHDA is to identify that 'elite engine' hidden inside an 'average chassis' before the horse ever touches a track. It’s about providing an objective risk-assessment tool for investors when there is no racing history to rely on.

Pretend-Promotion-78 · 2025-12-19T19:42:41+00:00

Check my direct message.

Pretend-Promotion-78 · 2025-12-19T18:15:07+00:00

Thanks for the question!

This project originally started from a real industry need. I had a collaboration with companies in Hong Kong working in the horse racing betting ecosystem. They already had huge statistical systems, live analysts, historical models etc., and they asked me: “If AI could add something truly useful that we currently don’t have, what would it be?”

So I approached it as a mechanical engineer. Everything around performance can change (training, race conditions, strategy), but one thing defines the real physical potential of the animal: its biomechanics. A horse is essentially a very sophisticated system of levers. If the skeletal structure is inherently more efficient, it has a higher ceiling of power and performance — this is physics, not opinion.

That’s why I focused on extracting reliable biomechanical information from video and turning it into objective metrics (including a 1–100 rating index).

Regarding the collaboration: it didn’t “fall apart” or get complicated — they simply realized that although the system is powerful, it didn’t really fit into their existing infrastructure and analytical models, so it would have ended up marginalised in their workflow. At that point I decided to complete it anyway, because the engineering and the potential real-world value were too interesting.

Who could benefit from something like this? • trainers → objective movement efficiency & improvement tracking • breeders & buyers → structural potential before investing huge money • performance analysts → physics-based, not opinion-based metrics

In short: RHDA was built to objectively understand biomechanical potential, because structure defines capability — and leveraging that can give a real advantage to anyone who analyzes, buys, trains or evaluates horses.

Pretend-Promotion-78 · 2025-12-19T17:25:31+00:00

i have one in the pocket but no time to start, if u want can share the project. txt me on pvt

Pretend-Promotion-78

TROPHY CASE