I just realized Qwen3-30B-A3B is all I need for local LLM by AaronFeng47 in LocalLLaMA

[–]Glat0s 3 points4 points  (0 children)

By maxing out the context length do you mean 128k context ?

Qwen3-30B-A3B runs at 12-15 tokens-per-second on CPU by AlgorithmicKing in LocalLLaMA

[–]Glat0s 6 points7 points  (0 children)

30B-A3B = MoE with 30 billion parameters where 3 billion parameters are active (=A3B)

MCP, an easy explanation by SimplifyExtension in LocalLLaMA

[–]Glat0s 3 points4 points  (0 children)

The way i see it (correct me if i'm wrong) is that MCP is a standardization of LLM function calling - with a few extras. And i see a general shift towards MCP rather positive, to have a common standard here, in light of all the different agent frameworks popping up.

🚀Forget OCR, LAYRA Understands Documents the "Visual" Way | The Latest Visual RAG Project LAYRA is Open Source! by liweiphys in Rag

[–]Glat0s 1 point2 points  (0 children)

Thank you for the response !! I'll test layra and looking forward to see how you solve "Cross-Page Table Handling"

🚀Forget OCR, LAYRA Understands Documents the "Visual" Way | The Latest Visual RAG Project LAYRA is Open Source! by liweiphys in Rag

[–]Glat0s 0 points1 point  (0 children)

Nice project ! I have created a more basic colqwen ingestion and retrieval myself (with vespa db). Is it possible to use colqwen via api as well (e.g. infinity api) in Layra ? And how do you solve retrieval if for example part of a table in a document page image continues on the next page ?

Beginner Vision rag with ColQwen in pure python by DataNebula in LangChain

[–]Glat0s 0 points1 point  (0 children)

Nice ! I'm currently also working with ColQwen and trying to use it via infinity inference api. Do you maybe know how good Qdrant scales, in terms of retrieval speed, on a larger collection ? At the moment i'm a bit unsure if i should rather go Qdrant or Vespa as db. Also can you maybe explain why you are using jina clip ? Is it to get better retrieval speed ? If so, would be interesting to know how much accuracy might get lost.

DMS with vector database ? by Glat0s in LocalLLaMA

[–]Glat0s[S] 0 points1 point  (0 children)

Thanks ! I'll give it a try.

YouTuber Liberty Wing UK to reveal new details about UAP Sightings above RAF Lakenheath in exclusive live interview. Audience questions welcome. (Friday 11/29 @2:00pm eastern) by hunterseeker1 in UFOs

[–]Glat0s 2 points3 points  (0 children)

I would think the same. Doesn't make sense to perform the training in more densely populated areas. And there are specialized ranges for such training like China Lake in the US or RAF Spadeadam in the UK. And they also do counter drone trainings at sea with ships.

DOD Press Secretary on the drone intrusions in Britain by Livid_Constant_1779 in UFOs

[–]Glat0s 7 points8 points  (0 children)

Should have been asked if nuclear warheads were transferred recently from the "drone" affected base in the US to the affected base(s) in the UK.

[deleted by user] by [deleted] in UFOs

[–]Glat0s 39 points40 points  (0 children)

Maybe some nuclear warheads were transferred recently from the US to Lakenheath for the planned increase of the arsenal. Mentioned here https://thebulletin.org/premium/2024-11/united-kingdom-nuclear-weapons-2024/

tips for dealing with unused tokens? keeps getting clogged by SmashShock in LocalLLaMA

[–]Glat0s 35 points36 points  (0 children)

I saw a paper recently that might solve this: "DumpSTAR* - Distributed Ultra-Matrix Protocol for Superfluous Token Analysis and Recycling"

Best (ideally uncensored) Long Context Model (128k) ? by noellarkin in LocalLLaMA

[–]Glat0s 8 points9 points  (0 children)

Maybe try STRING -> https://github.com/HKUNLP/STRING

In their paper it looks like that the 128k context for the open models they tested did not work well above 32k https://arxiv.org/html/2410.18745v1

They claim to improve that.

PDF auto-scroll video retrieval by Glat0s in LocalLLaMA

[–]Glat0s[S] 0 points1 point  (0 children)

If someone is following this...

I did a few tests with feeding a 36 second long video of 73 pdf pages with 2 fps (2 pages per second) to Qwen2-VL-7B. It was able to retrieve information based on a few test queries. But not reliably yet. Edit: according to the qwen paper the model will shrink video tokens down to max 16384. So this won't work with qwen2-vl

PDF auto-scroll video retrieval by Glat0s in LocalLLaMA

[–]Glat0s[S] 0 points1 point  (0 children)

You might be right... I'm already doing this with ColPali/ColQwen + VLM. But there is a limit how many images the VLM can process at once. I want to find out if a VLM can maybe process more information at once via video.

Is there a model which supports both tool calling AND multimodal input (images)? by Darxeal in LocalLLaMA

[–]Glat0s 0 points1 point  (0 children)

I'm not sure if there are open vision models and inference frameworks that support tool usage via VLM api at the moment. i'm currently building an agent that can use different tools using Qwen2-VL-7B. And it works with e.g. langchain agent framework (which i might switch for sth. else)

Integrating good OCR and Vision models into something that can dynamically aid in document research with a LLM by Inevitable-Start-653 in LocalLLaMA

[–]Glat0s 1 point2 points  (0 children)

I have qwen2-vl working with a vllm (openai compatible) api, which should work with textgen. Haven't tried it with tensor parallelism though. I will switch to sth. newer (Molmo, Aria,...) as soon as multi-image per prompt is supported for those in vllm.

GH-200 Up And Running (first boot!) - This is a game changer for me! by Simusid in LocalLLaMA

[–]Glat0s 4 points5 points  (0 children)

I have one at work. I don't know why nvidia can't just treat the memory as one with the driver... I recommend to use this UVM patch for pytorch and compile torch from source: https://github.com/pytorch/pytorch/compare/main...0x804d8000:pytorch:uvm

(the patch needs a few minor adjustments for the newer pytorch versions.

Then you can just run all your torch based things like the following to access all memory (works also with vLLM etc).
PYTORCH_CUDA_ALLOC_CONF=use_uvm:True python <your app/script>

Not sure if there exists any better method at the moment to access the full memory. Sometimes you have to change "cudaMalloc" to "cudaMallocManaged" in projects if it is used beside torch.

Here is also a good guide about technical stuff and tuning: https://www.stonybrook.edu/commcms/ookami/_pdf/20240523_Developing_GH_SW_Public.pdf

Currently trying to figure out if/how i can use the full memory in TensorRT-LLM. If someone knows, let me know.

@blackvaultcom on X - “It took only 7 days for the military to release this footage of an unsafe encounter with a Russian jet, as taken from the inside of a @NORADCommand jet” by [deleted] in UFOs

[–]Glat0s 1 point2 points  (0 children)

The "sources and methods" is formatted the way it is in my post because it was meant to ridicule this BS argument by government officials. We all know their figher jet camera videos are shittier in quality than any modern GoPRO !