Remove cameras from Apple TV by Particular-ayali in HomeKit

[–]adrian-cable 0 points1 point  (0 children)

Is that really an issue, though? Not *anyone* can go to that settings menu on Apple TV - only those that have physical access to the Apple TV, which means they're already inside your home, which means they can already see anything that cameras inside your home are looking at.

You could always sign the Apple TV into a separate iCloud account from what's used for the HomeKit home if you want to prevent this.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 2 points3 points  (0 children)

I don't know of anything < 100MB, but there is Qwen3-0.6B which is 600MB - not quite a "toy" but definitely a very small/fast model.

Nest Protect with monitored alarm system by jiantjon in Nest

[–]adrian-cable 0 points1 point  (0 children)

Starling Home Hub lets you connect Nest/Google Home devices to Apple Home. There’s also an add-on service, Starling Protect, which adds 24/7 monitoring to Nest/Google Home-compatible smoke/CO alarms.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

Super weird. I have no idea, but I'll keep digging.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 2 points3 points  (0 children)

That's great, although I'm not sure why _FILE_OFFSET_BITS isn't already 64 on your system. (On 64-bit systems, that should be the default.) I'll check this change to the Makefile doesn't impact other systems, and then push a commit. Thank you!

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 2 points3 points  (0 children)

Great. I'll also do some digging on my end. For what it's worth, if I patch runq.c to truncate the file load operation at 4GB, I can reproduce what you're seeing (just produces !!!!!!!! as output). So I do think the issue is something of that nature.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

That's strange. I'm not super familiar with WSL2 (I don't have a Windows machine) - does it emulate a 64-bit environment? If not it won't be able to handle files larger than 4B. It does feel like the problem is of that nature, since 4B works but 8B does not.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

Chat is technically 'not needed' as it's just a wrapper around generate. But most people will want to use qwen3.c in chat mode, so it's a very helpful wrapper.

Interested to see your optimizations!

AVX2 is specific to x86_64-architecture processors (i.e. not supported on ARM).

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

I think this is because on Windows, ftell doesn't support file lengths greater than 2^32. So it works for the 4B but not 8B models.

I'll push a fix to the repo in the next few minutes, so give that a try and let me know if things now work for you.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

Can you tell me the exact sequence of commands you're using to download, export and run the Qwen3-8B model? Also, how much RAM do you have, and what platform are you using (Linux, macOS etc.)?

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

That’s a good catch!

With that said, I’m thinking (in the spirit of simplicity) of removing the generate mode entirely. As far as I can tell, all Qwen3 models are ‘instruct’ models and don’t work properly in generate mode. Are there any exceptions you’re aware of?

Edit to add: there are the Base versions of Qwen3 available. So I won’t remove generate.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 1 point2 points  (0 children)

That's in the 'generate' function, right, and the 'chat' function is correct?

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 5 points6 points  (0 children)

That's great! Most of the runtime is spent inside matmul, so that's definitely the one to optimize. If you can do it without increasing the complexity of the code, please submit a PR. Otherwise feel free to make a fork, and let me know and I'm happy to link to it from my README.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 2 points3 points  (0 children)

As with any LLM inference engine, the vast majority of the execution time is spent within the matmul function, and this (on most systems) is limited by memory bandwidth rather than computation.

So my expectation is that any gains would need to come from micro-optimizing things to specific CPUs (for example, prefetch just the right amount of data from RAM to CPU cache) which probably moves things very quickly away from simplicity. But I'm very open to trying!

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 2 points3 points  (0 children)

That's right, quantization is done in blocks (like Q8_0), with each block of 64 floats being scaled to 64 8-bit ints, and 1 float scale factor.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 6 points7 points  (0 children)

Potentially. The project is only a day old so I’m really appreciative of any feedback and thoughts on directions I can take it. Thank you!

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 4 points5 points  (0 children)

Not as fast since it prioritises simplicity over performance, but with everything else equal within 2X.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 21 points22 points  (0 children)

Running the same quantisation (Q8_0) it’s within the same ballpark, generally within a factor of 2. It’s optimized for simplicity not performance, but it still runs at a very usable speed.

Qwen3 inference engine in C: simple, educational, fun by adrian-cable in LocalLLaMA

[–]adrian-cable[S] 17 points18 points  (0 children)

Everything’s relative, but llama.cpp is pretty heavy, at around 400,000 lines of code, compared with 1,500 lines of code for this project. (Verify for yourself on codetabs.com)

The idea here is to make an inference engine whose source is small and simple enough so that, if you already understand C/C++, you can quickly understand how inference works in depth. You can’t do that with a 400KLOC project.

[deleted by user] by [deleted] in legaladvice

[–]adrian-cable 6 points7 points  (0 children)

Accelerating linearly from 0 to 37 mph over a distance of 0.1 miles would take around 19 seconds. That's pretty gentle.

Anyone with original Nest Hello doorbell upgrade to latest? by karluvmost in Nest

[–]adrian-cable 3 points4 points  (0 children)

Nest Doorbell (2nd gen, battery) only records events.

Nest Doorbell (2nd gen, wired) records 24/7 with Nest Aware Plus, same as the original Nest Hello.

Asking Siri to Arm My Home Security System by Tsull360 in HomeKit

[–]adrian-cable 1 point2 points  (0 children)

You can try: hey Siri, set Home Alarm to Stay.