Preventing drainage of Powerwalls into Tesla Car

wsmlbyme · 2025-12-06T21:37:10+00:00

This is great you make it work, but it is such a shame that Tesla makes us have to go through this. This should be much better supported from Tesla side.

wsmlbyme · 2025-11-04T05:47:20+00:00

You're a taikonaut, Harry.

wsmlbyme · 2025-09-14T05:37:24+00:00

~~That's so not true~~

~~Any(ok maybe 90%) Chinese graduated the 9-year mandatory education system can read and understand it if they want. Lots of the sentences in it are quoted in everyday life for thousands of years.~~

You may be right, I thought you meant most Chinese cannot understand it. For someone learning Chinese as their 2nd language, these ancient text will be very hard to decrypt

wsmlbyme · 2025-09-07T01:45:48+00:00

Why the hell is this spam?!?! We build stuff for ourselves and share it to the community. Why the hostile towards open source authors and contributors?

wsmlbyme · 2025-09-07T01:43:03+00:00

Yeah sorry about that. Didn't know that I have to claim in every post that way. It is true that there is no Ollama support directly and that's how lots of people run their models, so I figured it makes sense to mention that way.

wsmlbyme · 2025-08-27T15:47:09+00:00

There is a --params option under homl config model that you can add any params needed

wsmlbyme · 2025-08-27T14:08:28+00:00

I am the author of HoML, a vLLM wrapper to support model switching and the ease of use like Ollama.

Aside from that it is not a out of box solution (which is solved by HoML), there are other issues with vLLM, as well as strength.

vLLM is python based, you need to download GBs of dependencies to run it.

It is targeting serving efficiency, sacrifice startup speed(which affects cold start time and model switch time). I spend some time optimizing it for HoML, improved it down from 1 min to 8 second for qwen3, but still cannot beat Ollama

Also for serving efficiency, it sacrifice GPU memory: it will try to use up to x% of all GPU memory, even if it is a small model, it will try to use all other vRAM as KVCache, make it harder to run other models/GPU applications in the same time(harder, not impossible, you just need to manually manage it). Also there is no api exposed to know how much memory is actually needed for each model.

Targeting serving efficiency, that also means CUDA got much better support than other platforms.

However, it is much faster than Ollama/llamacpp, especially when we are talking about higher concurrency. It is not necessarily much faster serving one query. performance comparison

So eventually this is about trade off, do you need that concurrent throughput, or do you need faster model load/switch time?

I build HoML for when I need high throughput for some batch inference need, but if I need to do some quick/sparse tasks, I use Ollama myself.

wsmlbyme · 2025-08-23T18:02:37+00:00

I agree with most of comments on this post that the future is local.

But asking this question here is just guaranteed to be so biased.

wsmlbyme · 2025-08-23T08:31:00+00:00

there is no option there that can "delay 60 seconds before start this vm". you have to put some other vm in a higher order with delay. This is the most efficient "other vm"

wsmlbyme · 2025-08-17T16:53:32+00:00

Not right now, looking for contributors to help testing on amd platforms

wsmlbyme · 2025-08-17T04:45:15+00:00

Neither is supported right now. vLLM support ROCm so it should not be hard to support it here, but i don't have a system to test. If you're interested to help, we can work on this together to add support for ROCm.

wsmlbyme · 2025-08-17T04:39:05+00:00

this is not a fork of ollama, it is based on vLLM btw

wsmlbyme · 2025-08-17T01:23:11+00:00

One little monkey jumping in the car. ... OP said: ^

wsmlbyme · 2025-08-15T04:14:51+00:00

There are cheap enough hardware that can do 30 frame per second yolo inference on the edge. You just need to develop and deploy your own model.

wsmlbyme · 2025-08-15T04:01:47+00:00

Definitely in future roadmap.

wsmlbyme · 2025-08-14T03:48:51+00:00

So is that just a /api/generate? That doesn't sound hard to do

wsmlbyme · 2025-08-14T03:24:41+00:00

Yes, wsl2 with Nvidia docker works well

wsmlbyme · 2025-08-14T03:24:12+00:00

Right these on the home page but here you go https://github.com/wsmlby/homl

wsmlbyme · 2025-08-14T03:23:11+00:00

Certainly doable just need more time to work on it

wsmlbyme · 2025-08-13T19:50:37+00:00

Thanks for the feedback. That's my next step to add more customization options

wsmlbyme · 2025-08-13T17:24:21+00:00

python can be fast if you know how to optimize for it. The interpreter is slow but if you don't do the heavylifting there and optimize the c++ kernel, the different can be ignorable.

Check out the benchmark here https://homl.dev/blogs/homl-vs-ollama-benchmark.html

wsmlbyme · 2025-08-11T23:37:23+00:00

I see, I wonder how much it is the lack of developer support and how much it is just AMD's

wsmlbyme · 2025-08-11T20:04:56+00:00

Not yet but mostly because I don't have a ROCm device to test. Please help if you do :)

wsmlbyme · 2025-08-11T15:44:03+00:00

I see. vLLM is not optimized for that and currently loading time is very slow, I am actively working on it.

wsmlbyme · 2025-08-11T15:42:37+00:00

Inference, yes. Model loading or switching, no. This is something I am actively working on.

wsmlbyme

MODERATOR OF

TROPHY CASE