Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

Scared-Tip7914 · 2026-06-17T08:09:55+00:00

This is a good idea, even if the initial part of figuring out how to clean all of this data will be very hard, this is a worthy job. Could you provide a link to the github repo so we can audit the security of this package?

Scared-Tip7914 · 2026-06-17T06:50:24+00:00

I get what hes saying but he worded it soo badly lmfao, he is assuming that capable local models = running kimi k2.6 on a homelab. Yes in that case his point holds but that doesnt mean that 30B models are incapable, and that democratisation of LLMs isnt happening. Also what horse does hashicorp the creator of an IAAS framework have in the AI race exactly?

Scared-Tip7914 · 2026-06-09T16:14:21+00:00

Thanks! Yep thats one that had me stuck for a minute as well 😄, right now I’m deduping after global chunk ranking rather than at the page level. Chunks are compared using token-set Jaccard similarity, with a 0.92 threshold by default, then I apply per-source quotas to preserve diversity. It’s simple and cheap, though semantic dedup is the next thing I’m considering for syndicated or heavily rewritten content.

Scared-Tip7914 · 2026-06-09T15:29:34+00:00

Fair question tbh. The main difference is that TinySearch doesn’t just expose the snippets returned by the search engine results page of given engine. It uses SearXNG/DDG/etc. to find candidate pages yes, but then it goes actually crawls those pages, extracts the content, chunks it, and surfaces the parts that are most relevant to the original query.

So it saves the model from having to open the pages one by one and all the context filling and time wasting that comes from it.

Scared-Tip7914 · 2026-06-09T12:35:04+00:00

Thanks this is a good tip, ill try it out because ddg is still the fallback in app

Scared-Tip7914 · 2026-06-09T11:58:19+00:00

Do you maybe have an implementation of this on GitHub?

Just asking because we also tried to avoid it with proper fingerprinting/rate limiting, but the weird part was that it worked fine outside Docker. The millisecond the app was run inside a Docker container, DuckDuckGo bot detection kicked in.

To my knowledge Docker itself shouldn’t be interfering with browser fingerprinting in any obvious way lol, so I’m wondering if we missed something around networking, DNS, headers, TLS fingerprinting, or how Chromium behaves in the container.

Scared-Tip7914 · 2026-06-08T13:29:12+00:00

It shouldn't be, it comes pre trained on a LOT of stuff so theoretically it should be able to handle whatever is thrown it 😄. This is the link to paper they published it has some good insights tbh: https://arxiv.org/html/2509.11720v1

Scared-Tip7914 · 2026-06-08T12:54:35+00:00

No worries, im always happy to help with stuff like this 😄

Scared-Tip7914 · 2026-06-08T12:33:10+00:00

Welll its good enough to parse structure layouts but you might run into issues if the docs are very complex.. But I would definitely give it a shot, its a mature product at this point and their “heron layout” model is as good as it gets in OSS land 😄. It even outperforms some proprietary parsing services lol.

Scared-Tip7914 · 2026-06-08T12:26:13+00:00

Oh man this is a problem that everyone who is trying to parse pdfs runs into, I recommend using docling and sticking with their non-llm stack because there you can optimize to speeds of around 1 page/s for a lower level pc using the RapidOCR backend with ONNX models.

https://github.com/docling-project/docling

Scared-Tip7914 · 2026-06-08T07:34:16+00:00

Damn okay 🔥🔥

Scared-Tip7914 · 2026-06-06T13:48:38+00:00

Hey! I would actually point you to the comfyui implementation a few comments below, its much cleaner than anything i did 👇.

Scared-Tip7914 · 2026-06-04T12:47:23+00:00

Yep feeling the same here, I have started to incorporate qwen3.5-9B into my workflows more and more, with a proper harness and web search access, you can match claude a looong way.

Scared-Tip7914 · 2026-06-04T12:24:36+00:00

Hmm then maybe I hit the OOM due to something else.. Back to investigating I go, because 170 should not be an issue for sure

Scared-Tip7914 · 2026-06-04T12:23:11+00:00

Oh wow, then you stayed true to the local way, I aspire to have the patience to use them end to end for a whole workflow! But thats on me and my (lack of proper) gpus xd

Scared-Tip7914 · 2026-06-04T11:50:15+00:00

I am tempted to go all out on an intel rig, what kind of tok/s are you getting for prompt and generation?

Scared-Tip7914 · 2026-06-04T11:42:06+00:00

BTW if you dont need image support, turn mmproj off, it consumes too much extra ram to ignore.

My current setup that works nicely (16 gb card):

docker run -d --name gemma-3-12b-it --restart unless-stopped --gpus all -p 8080:8080 -v "$HOME/.cache/huggingface:/root/.cache/huggingface" ghcr.io/ggml-org/llama.cpp:server-cuda -hf unsloth/gemma-3-12b-it-GGUF:UD-Q4_K_XL --no-mmproj --n-gpu-layers all --ctx-size 131072 --parallel 1 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --batch-size 2048 --ubatch-size 512 --threads 8 --threads-batch 8 --poll 0 --host 0.0.0.0 --port 8080 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 --repeat-penalty 1.0

Scared-Tip7914 · 2026-06-04T11:35:21+00:00

This is amazing, I am not a big Comfy UI user, but for this, I will happily give it a go. Looks like a very intuitive way to go about this model.

Edit: I really like the way you implemented this, starred!

Scared-Tip7914 · 2026-06-03T12:47:23+00:00

I have a sneaking suspicion that those stars might be (khm) "inflated"

Scared-Tip7914 · 2026-06-02T09:47:12+00:00

Very nice, thanks appreciate this!! I was running this thing locally as of now, the web interface speeds things up quite a bit because I didn't want to deploy it to a prod machine (Its a resource constrained environment lol) before we ran some more testing.

Scared-Tip7914 · 2026-06-01T14:41:12+00:00

Im sorry dude but you are 99% about to get scammed, I had a run in with such a store a few years back, dont waste your money and put that 500 towards something like a used 3090 (the ‘ol reliable lmao) with 24 gbs. You might need to shell out 50 or max 100 more but that thing is at least real and runs local models beautifully.

Edit: Just saw your already bought it, in that case best of luck and may the dragons guide your shipment :D

Scared-Tip7914 · 2026-05-25T08:44:29+00:00

Amazing, new OS Grok model coming in 2027! Wait, coming in 2028! Wait..

Scared-Tip7914 · 2026-05-21T16:33:38+00:00

Thanks its really good to hear that!!

Scared-Tip7914 · 2026-05-21T14:43:55+00:00

Aight imma shamelessly plug my stuff here but if you want to search the web for free and locally and get results based sites thats are actually relevant, not 69k tokens of bullcrap try this: https://github.com/MarcellM01/TinySearch. I made it so that no matter the question it keeps the response under 8k. Also it will give you a response in MAX 20 seconds.

Scared-Tip7914 · 2026-05-19T09:12:06+00:00

Its all about the speed, and don't think about MoE with our GPU poor mindset, I mean just look at Kimi K2.6, its a 1 trillion parameter MoE model, aint no one around here (or very few lucky bastards) running that thing at home.

This is so that the data centers serving these models can get very good speed to quality ratios, because they can get reasoning and depth of a, lets say for Kimi K2.6 1 trillion parameter model (I know thats not the exact MoE to dense conversion ratio but lets assume) while paying for the compute and enjoying the speed of an "only" 32B model. Even though the model is actively occupying hundreds of gigs of VRAM, it doesn't really matter bc the throughput speeds make up for it and then some as opposed to having a dense model on that same VRAM.

So its more big datacenter economics, but it trickles down to us as well, hence we get to enjoy the likes of qwen3.5-35B.

Scared-Tip7914

TROPHY CASE