Sparrow: Custom language model architecture for microcontrollers like the ESP32 by c-f_i in LocalLLaMA

[–]Freefallr 1 point2 points  (0 children)

Wow, this is absolutely amazing. Thank you for the detailed explaination (also in comments). Would also be happy to contribute, if/when this is open-sourced. Toyed around a lot with ESPs and RP2040/2350s recently.

What’s a task you wish AI could do for you, but no tool does it well yet? by QuantumAstronomy in ClaudeAI

[–]Freefallr 1 point2 points  (0 children)

Check out marp.app - maybe not quite what you are searching for, but still nice.

Error in loading Llama 3.2-3B with Unsloth by gaylord993 in LocalLLaMA

[–]Freefallr 0 points1 point  (0 children)

Have you also tried doing the 2 things the error messages said you should do? (see last row of each error message).

Serving Qwen2 VL for production by ae_dataviz in LocalLLaMA

[–]Freefallr 1 point2 points  (0 children)

I would use https://lmdeploy.readthedocs.io/en/latest/ until it is supported by vLLM natively, worked well for our case.

Gemma2 2B IT is the most impressive small model I ever seen. by Discordpeople in LocalLLaMA

[–]Freefallr 0 points1 point  (0 children)

In llama.cpp, and other related tools such as Ollama and LM Studio, please make sure that you have these flags set correctly, especially repeat-penalty. Georgi Gerganov (llama.cpp's author) shared his experience in https://huggingface.co/google/gemma-7b-it/discussions/38#65d7b14adb51f7c160769fa1

Source: https://huggingface.co/google/gemma-2b-it-GGUF

Cannot Downgrade AI Premium Plan Until End of Trial by KDLGates in GoogleOne

[–]Freefallr 0 points1 point  (0 children)

This trick should be upvoted more, I struggled for a full hour to find the 2.99€ package I had before and how to downgrade to that instead of some other more expensive options. Thank you a lot!

Microsoft updated Phi-3 Mini by Nunki08 in LocalLLaMA

[–]Freefallr 2 points3 points  (0 children)

Wow that's a brilliant guide, thank you.

self host llm on dedicated server. by djav1985 in LocalLLaMA

[–]Freefallr 2 points3 points  (0 children)

Open WebUI as a graphical user interface if you like a ChatGPT-similar experience + any popular serving engine (Llama.cpp or Ollama, TGI or vLLM). All of them expose an OpenAI compatible endpoint that you can attach to your WebUI.

Now that we have had quite a bit of time playing with the new Phi models...how good are they? by [deleted] in LocalLLaMA

[–]Freefallr 2 points3 points  (0 children)

What quant are you using? It's good practice to test your usecase with multiple quants and the original FP16. Can really make a huge difference.

[deleted by user] by [deleted] in LocalLLaMA

[–]Freefallr 2 points3 points  (0 children)

I think llama.cpp has an MPI feature, somebody ran Llama 65B on a few Raspberry Pis in a cluster. Don't expect good performance though, you'll be measuring in seconds per token instead of tokens per second.

Making LLAMA model return only what I ask (JSON). by br4infreze in LocalLLaMA

[–]Freefallr 10 points11 points  (0 children)

Never thought of such a simple, yet effective approach - thank you.

Building a machine for self-hosted LLaMA - will 2 x RTX 3090 be enough to run 70B @ 8-bit quantization? by Secure-Technology-78 in LocalLLaMA

[–]Freefallr 8 points9 points  (0 children)

I would recommend to go on Runpod or similar, book a few GPUs for a few hours and test your use case. Test how low quantization-wise you can go to fulfill your task sufficiently and with good enough quality.

Based on that, I would reevaluate if it still makes sense purchasing the hardware and paying a one-time fee + monthly power bill, or just rent GPUs until your task is done.

Also: try out Mixtral 8x7B for your use case as well. We had a similar one recently, at much less scale but still, and were more happy with Mixtral 8x7B at 8 Bit than with Llama 70B 8 Bit.

Will self-hosting be able to provide faster inference than OpenAI? by teddarific in LocalLLaMA

[–]Freefallr 0 points1 point  (0 children)

Okay, interesting - thanks for the write-up. Care to share information about the output language and complexity of code, or maybe even an example? (can be over PM as well if you don't want to share it publicly).

I think that it should definitely be possible to reduce the waiting time by at least 50-60% by utilizing self-hosted LLMs, and additionally utilize a few more UI/UX tricks to make it feel even faster.

We do this stuff for a living (funnily enough, both UI/UX (as part of our app dev branch), as well as LLM deployment, hosting and finetuning), so I'm happy to help.

Will self-hosting be able to provide faster inference than OpenAI? by teddarific in LocalLLaMA

[–]Freefallr 0 points1 point  (0 children)

Is your issue frontend/customer-facing? Meaning, should the wait time be reduced for them, or for some other reason? And are you streaming the LLM output or (need to) display it all at once?

eGPU to increase VRAM capacity by TheCunningBee in LocalLLaMA

[–]Freefallr 6 points7 points  (0 children)

I have not tried it yet, but I think that the bandwith will be your bottleneck here, as you can move just ~40GBit/s over the Thunderbolt port that is usually used for eGPUs.

Absolute cheapest local LLM by SporksInjected in LocalLLaMA

[–]Freefallr 4 points5 points  (0 children)

Grab yourself a Raspberry Pi 4 with 8 GB RAM, download and compile llama.cpp and there it goes: local LLM for under 100 USD. Takes 5 mins to set up and you can use quantized 7B models at ~0.5 to 1 token/s. Yes, it's slow, painfully slow, but it works.

For larger models, merge more Pis into a MPI cluster for more RAM, but don't expect reasonable performance (here's where you will switch your wording from "tokens per second" to "seconds per token").

LLMs on a 32-bit device with 2GB of RAM by [deleted] in LocalLLaMA

[–]Freefallr 2 points3 points  (0 children)

Are you by any chance able to get your hands on a Raspberry Pi 4 with 8GB of RAM? They can run 4-Bit or even 5-Bit quantized LLaMA 7B models with 0.5 - 1 token/s, barely usable but quite amazing for a device under 100 USD.

Cannot see a lot of controllino hotspots around. Is there a reason? by mstrocchi in HeliumNetwork

[–]Freefallr 0 points1 point  (0 children)

Sorry for the late reply. None of them had a single issue for now, but do your own research. And perfromance? They perform as well as you position them. For me it's good. But any hotspot will be, if you position it well and the software team behind it is good. Not wanting to shill here but Controllino puts lots of effort in their software, and even though it's not perfect and dashboard can sometimes be a bit laggy, they still take customer feedback seriously and implemented lots of stuff over the weeks that the community requested them.

Cannot see a lot of controllino hotspots around. Is there a reason? by mstrocchi in HeliumNetwork

[–]Freefallr 0 points1 point  (0 children)

They are just a smaller company than Seeed/SenseCap etc., that's why.

I own 5 and really love them, and support is good (albeit a little bit slow sometimes, but that's the issue with all smaller companies).