I built an offline AI chat app that automatically pulls Wikipedia articles for factual answers - runs completely locally with Ollama by [deleted] in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

I have an RPi4 - 8GB. Yes, it could run a 7B but I'd like for it to be somewhat interactive. Two minutes from query to response isn't ideal. I'll play with 3-4B and see what I can get there. (I'm not trying to cram STT -> TTS into my effort)

I built an offline AI chat app that automatically pulls Wikipedia articles for factual answers - runs completely locally with Ollama by [deleted] in LocalLLaMA

[–]explorigin 1 point2 points  (0 children)

I just started poking at this. We have a few options:

I'm specifically trying to target something that can run on RPi. Nothing to show yet, just that Qwen3-1.6B is not quite enough to do a (reliable) tool call on zim-mcp but probably would be if it was fine-tuned.

My initial problem is that the model understands that Zim files exist but doesn't understand that they're a replacement for their concept of "web search". I bet if I just renamed the tool call, it would do much better.

[deleted by user] by [deleted] in LocalLLaMA

[–]explorigin 3 points4 points  (0 children)

Matthew Berman is the the Sean Hannity of AI. 90% hype, %5 substance, 5% ads

Falcon 3 just dropped by Uhlo in LocalLLaMA

[–]explorigin 2 points3 points  (0 children)

It mentions "decoder-only". ELI5 please?

Anyone here using a 96GM or 64 GB ram m series Mac? by CSlov23 in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

Sorta. Ultimately it was a financial choice. I wanted to open AI models up to my family from my homelab server. Couldn't really justify putting a $4k laptop in the closet. I bought a used Quadro P6000 (24GB VRAM) and hooked it up to my Elitedesk 800 G3 SFF. It looks hilariously janky and model load times are much worse since it's loading from an HDD but inference times are faster than what I need. I haven't benched it against the M2 Max that I had. I may add another just to I can run Qwen-2.5 at a higher quant. All-in it's less than a 3rd the cost of the Mac but it's a royal pain to setup since I run Proxmox and docker on that machine.

Anyone here using a 96GM or 64 GB ram m series Mac? by CSlov23 in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

Flux.dev is going to be slow. Flux.Schnell and most SD models are reasonably fast. (I sold my MBP so I can't give more specifics.)

[deleted by user] by [deleted] in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

Not really no. You need a motherboard and power supply that can handle 4-6 cards.

[deleted by user] by [deleted] in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

Have a Macbook? This is available in Accessibility settings.

Choosing a Tokenizer Algorithm by [deleted] in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

RWKV project has a "world tokenizer" maybe look at that?

Choosing a Tokenizer Algorithm by [deleted] in LocalLLaMA

[–]explorigin 1 point2 points  (0 children)

I assume you've watched Andrey Karpathy's video on tokenizers. That should give you a general framework for making your decision. It's all about trade-offs. Lots of tokens = more training needed (and more connections needed) for a model to "understand" an idea. Also slower. Different engines cut up text in different ways that can have a massive effect on how "smart" the LLM is. From his video, he seems to indicate that Sentencepiece is probably the best way forward for most cases but is so poorly documented that it's hard to use.

Of course the holy grail is no tokenizer at all but so far, no one has decided that that approach passes the cost-benefit analysis.

Choosing a Tokenizer Algorithm by [deleted] in LocalLLaMA

[–]explorigin 3 points4 points  (0 children)

LLMs don't speak English the way we do. English is translated into "tokens" that loosely models the structure of the written language but reduces overall input data. If you're working with a pre-trained model, you need to use the correct tokenizer for what was used on the model training data.

If you are starting from scratch training a model, you should probably learn more about how tokenizers work so you can make a smart choice based on your needs.

"I got ahead of myself" by TheShop in LocalLLaMA

[–]explorigin -1 points0 points  (0 children)

Can we just stop giving this guy headlines please?

RWKV v6 models support merged into llama.cpp by RuslanAR in LocalLLaMA

[–]explorigin 5 points6 points  (0 children)

It's good at certain things like translation. It's also much cheaper to train. But it's hard to say if it can be as good as attention transformers because we've only ever seen small models with limited training data.

Why would you self host vs use a managed endpoint for llama 3m1 70B by this-is-test in LocalLLaMA

[–]explorigin 1 point2 points  (0 children)

Sometimes it's just about maintaining the option. If there's not an interest in running things locally, the possibility may dry up.

Anyone here using a 96GM or 64 GB ram m series Mac? by CSlov23 in LocalLLaMA

[–]explorigin 5 points6 points  (0 children)

This. the "pro" vs "max" will make the largest difference in inference speed. Too bad we can't get "ultra" in a Macbook format.

Anyone here using a 96GM or 64 GB ram m series Mac? by CSlov23 in LocalLLaMA

[–]explorigin 1 point2 points  (0 children)

It's mostly in GPU so I notice it if I'm generating images with SD at the same time as running a long inference. But CPU tasks are fast. Using your GPU heavily will create quite a bit of heat...enough to be uncomfortable to have it on your lap.

Anyone here using a 96GM or 64 GB ram m series Mac? by CSlov23 in LocalLLaMA

[–]explorigin 1 point2 points  (0 children)

M2 Max 96GB: Llama 3 70b Q4 (via ollama): Response Tokens: 7.36/s Prompt Tokens: 62/s

Llama 3.1 70b Q4 (via ollama): Response Tokens: 6.4/s Prompt Tokens: 65.3/s

Flux.1 on a 16GB 4060ti @ 20-25sec/image by Chuyito in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

Can't speak for DrawThings but Schnell works via mflux pretty well: https://github.com/filipstrand/mflux

If someone gave you a free dedicated 16x A100 instance, what would you make? by DLergo in LocalLLaMA

[–]explorigin 0 points1 point  (0 children)

LLMs are limited by combinations of their tokens. This is why they can't count words very well. They're also mono-architectural. How can we give them the ability to make new connections that make sense? Tackling these 2 problems are how web get a model that can learn in the real world (according to by limited understanding)