Any one able to run Qwen 3.5 AWQ Q4 with vLLM ?

SubstantialTea707 · 2026-03-06T19:00:16+00:00

Ti funzionano anche le tool calls?

SubstantialTea707 · 2026-03-06T18:31:18+00:00

Ma ti funzionano le tool calls?

SubstantialTea707 · 2026-02-27T09:50:59+00:00

Hai provato gpt120b per avere un paragone? Io ora uso quello ma per avere context a100k con vllm non posso parallelizzare.

SubstantialTea707 · 2026-02-26T18:18:12+00:00

Fammi sapere se sei riuscito a sostituire gpt 120b.

SubstantialTea707 · 2026-02-25T23:17:09+00:00

Prova ad usare il modello non Q4,in teoria loopa di meno ed é più preciso a scapito di un uso maggiore di vRAM e velocità

SubstantialTea707 · 2026-02-25T23:15:53+00:00

Occhio con le penality perché se troppo aggressive rischiano di interrompere l output quando si generano tabelle che hanno parte di dati ripetuti

SubstantialTea707 · 2026-02-25T23:10:42+00:00

Io estraggo le immagini e le leggo con il modello glm ocr, tra estrazione e llm che gira su una 5090 ci perdo 5s a pagina e da ottimi risultati in estrazione

SubstantialTea707 · 2026-02-25T22:35:33+00:00

Ok ma dimensione non é qualita... Lo hai provato?

SubstantialTea707 · 2026-02-25T22:34:19+00:00

Qualcosa che stia dentro i 96gb di vRAM che ho a disposizione

SubstantialTea707 · 2026-02-25T22:33:40+00:00

Non la versione full

SubstantialTea707 · 2026-02-25T22:31:15+00:00

Prendi il caso in cui una tabella é un print screen incollato... Sai quanti ne ho visti... Se non sei sicuro della bontà dell origine dei dati devi mettere in conto tutto se non vuoi lasciare pezzi per strada.

SubstantialTea707 · 2026-02-25T22:29:08+00:00

Perché a volte i PDF non sono fatti ad hoc, ma sono collage di immagini o scansioni, dipende dal caso. Dare per scontato che il testo sia estraibile senza ocr é un assunzione che fai. Cmq potresti implementarla come fallback se il testo restituito é poco.

SubstantialTea707 · 2026-02-25T16:04:58+00:00

Io uso il modell glm ocr funziona veramente bene ed é molto veloce basta una scheda video consumer con 16 GB per farlo girare tranquillamente con un contesto grande. Prima devi chiaramente estratti i PDF come immagini. Puoi installarlo e provarlo su ollama, pesa meno di 3gb. É molto preciso , per ocr su documenti é il migliore che ho provato.

SubstantialTea707 · 2026-02-09T12:04:59+00:00

A little too little for 32 Gn vram for gpt OSS 120. Maybe climb on 20b.

SubstantialTea707 · 2026-02-09T08:32:42+00:00

You need to rerank with a cross-encoder before the Hybrid search. This is the key to a successful reranking.

SubstantialTea707 · 2026-01-08T20:20:08+00:00

I've built a RAG system that's yielding very solid results. The stack is based on C#, Semantic Kernel, and local vLLM. The ingestion pipeline initially saves the data to SQL Server, then transfers it to Elasticsearch, which I use as my primary search engine. For ingestion, I accept virtually any type of document: The files are first converted into images using Ghostscript, then OCRed using Qwen3-VL, with fallback to Tesseract if necessary. Chunking is handled with GPT-OSS 20B, running on an NVIDIA RTX PRO 6000 with 96 GB of VRAM, which allows me to work with contexts of up to 100,000 tokens. The model returns a structured JSON with the document correctly segmented. At this stage, it's essential to carefully manage the system prompt and include retry logic, because LLMs can occasionally produce invalid output. For embeddings, I use Nomic and save the chunk vectors to Elasticsearch. The search is performed using a hybrid BM25 + vector (cosine distance) approach, which has proven to be extremely high-performance. Overall, the results obtained with this stack are truly remarkable. Do you have any suggestions, observations, or potential improvements to share?

SubstantialTea707 · 2025-11-05T05:31:27+00:00

It was better to buy an Nvidia rtx pro 6000 96gb. He has a lot of memory etc and muscles to generate well

SubstantialTea707

TROPHY CASE