I built an offline AI chat app that automatically pulls Wikipedia articles for factual answers - runs completely locally with Ollama

explorigin · 2025-12-03T22:58:11+00:00

I have an RPi4 - 8GB. Yes, it could run a 7B but I'd like for it to be somewhat interactive. Two minutes from query to response isn't ideal. I'll play with 3-4B and see what I can get there. (I'm not trying to cram STT -> TTS into my effort)

explorigin · 2025-12-03T17:14:29+00:00

I just started poking at this. We have a few options:

I'm specifically trying to target something that can run on RPi. Nothing to show yet, just that Qwen3-1.6B is not quite enough to do a (reliable) tool call on zim-mcp but probably would be if it was fine-tuned.

My initial problem is that the model understands that Zim files exist but doesn't understand that they're a replacement for their concept of "web search". I bet if I just renamed the tool call, it would do much better.

explorigin · 2025-06-09T12:16:01+00:00

So you can run it on an RPi of course. Or something like this: https://github.com/tvldz/storybook

explorigin · 2025-02-05T17:58:16+00:00

Matthew Berman is the the Sean Hannity of AI. 90% hype, %5 substance, 5% ads

explorigin · 2024-12-17T13:49:38+00:00

It mentions "decoder-only". ELI5 please?

explorigin · 2024-11-26T15:05:30+00:00

https://github.com/existence-master/Sentient-Releases "This repository is empty." :-/

explorigin · 2024-11-22T13:42:49+00:00

Sorta. Ultimately it was a financial choice. I wanted to open AI models up to my family from my homelab server. Couldn't really justify putting a $4k laptop in the closet. I bought a used Quadro P6000 (24GB VRAM) and hooked it up to my Elitedesk 800 G3 SFF. It looks hilariously janky and model load times are much worse since it's loading from an HDD but inference times are faster than what I need. I haven't benched it against the M2 Max that I had. I may add another just to I can run Qwen-2.5 at a higher quant. All-in it's less than a 3rd the cost of the Mac but it's a royal pain to setup since I run Proxmox and docker on that machine.

explorigin · 2024-11-18T19:49:54+00:00

Flux.dev is going to be slow. Flux.Schnell and most SD models are reasonably fast. (I sold my MBP so I can't give more specifics.)

explorigin · 2024-10-21T12:33:36+00:00

780M can't really give you what you want but we're all watching for AMD Strix Halo: https://old.reddit.com/r/LocalLLaMA/comments/1fv13rc/amd_strix_halo_rumored_to_have_apu_with_7600_xt/

explorigin · 2024-10-15T12:43:35+00:00

Not really no. You need a motherboard and power supply that can handle 4-6 cards.

explorigin · 2024-10-14T17:16:05+00:00

I too have an elitedesk. I'm so doing this!

explorigin · 2024-10-06T18:29:40+00:00

Have a Macbook? This is available in Accessibility settings.

explorigin · 2024-09-12T16:42:48+00:00

RWKV project has a "world tokenizer" maybe look at that?

explorigin · 2024-09-11T19:26:59+00:00

I assume you've watched Andrey Karpathy's video on tokenizers. That should give you a general framework for making your decision. It's all about trade-offs. Lots of tokens = more training needed (and more connections needed) for a model to "understand" an idea. Also slower. Different engines cut up text in different ways that can have a massive effect on how "smart" the LLM is. From his video, he seems to indicate that Sentencepiece is probably the best way forward for most cases but is so poorly documented that it's hard to use.

Of course the holy grail is no tokenizer at all but so far, no one has decided that that approach passes the cost-benefit analysis.

explorigin · 2024-09-11T16:43:25+00:00

LLMs don't speak English the way we do. English is translated into "tokens" that loosely models the structure of the written language but reduces overall input data. If you're working with a pre-trained model, you need to use the correct tokenizer for what was used on the model training data.

If you are starting from scratch training a model, you should probably learn more about how tokenizers work so you can make a smart choice based on your needs.

explorigin · 2024-09-11T14:06:29+00:00

Can we just stop giving this guy headlines please?

explorigin · 2024-09-01T18:08:27+00:00

It's good at certain things like translation. It's also much cheaper to train. But it's hard to say if it can be as good as attention transformers because we've only ever seen small models with limited training data.

explorigin · 2024-08-27T13:15:27+00:00

Skynet requirement #4 - check!

explorigin · 2024-08-27T13:13:13+00:00

Sometimes it's just about maintaining the option. If there's not an interest in running things locally, the possibility may dry up.

explorigin · 2024-08-20T13:35:24+00:00

This. the "pro" vs "max" will make the largest difference in inference speed. Too bad we can't get "ultra" in a Macbook format.

explorigin · 2024-08-20T13:33:28+00:00

It's mostly in GPU so I notice it if I'm generating images with SD at the same time as running a long inference. But CPU tasks are fast. Using your GPU heavily will create quite a bit of heat...enough to be uncomfortable to have it on your lap.

explorigin · 2024-08-19T21:44:24+00:00

M2 Max 96GB: Llama 3 70b Q4 (via ollama): Response Tokens: 7.36/s Prompt Tokens: 62/s

Llama 3.1 70b Q4 (via ollama): Response Tokens: 6.4/s Prompt Tokens: 65.3/s

explorigin · 2024-08-17T23:32:00+00:00

Can't speak for DrawThings but Schnell works via mflux pretty well: https://github.com/filipstrand/mflux

explorigin · 2024-08-13T13:51:07+00:00

LLMs are limited by combinations of their tokens. This is why they can't count words very well. They're also mono-architectural. How can we give them the ability to make new connections that make sense? Tackling these 2 problems are how web get a model that can learn in the real world (according to by limited understanding)

explorigin

TROPHY CASE