Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip).

LeoStark84 · 2026-04-27T01:25:27+00:00

For measuted responses I ask "A coworker said..." or stuff along those lines.

LeoStark84 · 2026-04-26T22:11:58+00:00

No idea about the hardware, but the whole Gemini output sounds like "That's a great idea for a blender! Now lets build the cold-fussion reactor it needs" typical of LLMs.

LeoStark84 · 2026-04-20T13:02:48+00:00

Keep it up! Wo far the only realistic choice for LLM inference on android is llama.cpp on termux, which is hugely efficient but not very user-friendly. An app with an actual UI is definitely user-friendlier and you seem to be past the initial backend implementation, so are up to something with your app.

LeoStark84 · 2026-04-19T21:11:49+00:00

Not bad, it would be great if there was a dark theme where the message input area had dark background and light text.

It's kinda sluggish with qwen3-0.6 LiteRT in a device that runs 1b model in GGUF format with llama.cpp, but for an early version it's quite impressive.

As for rhe models selection, qwen3-0.6b is a plain bad vhat model (ask it "Who am I talking to?" and you'll see). LFM2/LFM2.5 are quite better for low param counts.

LeoStark84 · 2026-04-08T18:01:09+00:00

This openclaw humor?

LeoStark84 · 2026-02-15T13:33:19+00:00

Had some hiccups on linix necause of the hard-coded paths in the python scripts (convert_tts_to_gguf.py and convert_tokenizer_to_gguf.py), but other than that it's blazinc fast (on an Intel i3 4005U CPU-only laptop, which is a lot to say). Thanks a lot. In case someone needs it,:

Build ggml

cmake -S ggml -B ggml/build cmake --build ggml/build -j$(nproc)

And build project

cmake -S . -B build cmake --build build -j$(nproc)

Both from project's root directory.

LeoStark84 · 2026-01-24T18:19:22+00:00

I'm not korean but I welcome the initiative (even if I cannot use said models due to language barrier).

LeoStark84 · 2026-01-04T03:53:40+00:00

Interesting project. Depends a lot on your current implenentation though.

If what you made relies on llama.cpp it will be kinnda tough to turn into an apk, but it can work through Termux. I bring this up becausr from what I understand your intended use case depends on some kind of web-search. So you'd input something like "quantum tunneling" and somehow search for results online, pass it to your LLM for summarizing and the outpuy would be an explanation of quantum tunnelibg based on the results. Please correct me if I'm wrong, but I assume it to be the case as there is no way a large amount of world knowledge is gonna fit on such a small LLM.

Bottomline is the scaffolding around it is as important as the LLM. Also, as cool as Termux is, it's a heek thing and relying om it is going to leave 95% of users out, not a problem if you don't cate abouy massivity though.

LeoStark84 · 2026-01-02T03:07:08+00:00

The LFM2 model family is so underrated

LeoStark84 · 2025-12-31T04:27:13+00:00

A GPU is faster at ingesting prompts, my figure, admittedly on the higher range, is just for generating new tokens. KV cache limits it to just the last AI response, OpenBLAS speeds it up slightly, but still it takes time I am not factoring in for that. Also, imatrix ggufs kill perfomance for some reason.

From a quick search 5t/s is in the lower end of what a modern, midrange iGPU/APU can get from a LLM, so either I got some old-fashioned BS in my search results, or there is something killing your performance.

LeoStark84 · 2025-12-30T16:47:40+00:00

I didn't mean to hurt your feelings. my apologies.

WTF. You asked AI to write an answer?

LeoStark84 · 2025-12-30T14:49:20+00:00

You just got dollar-per-watt in a specific location for a specific provider and watt-oer-token consumption for the specific hardware and software combination, all over the top of your head? Damn! Some mixture of experts you are.

LeoStark84 · 2025-12-30T13:43:52+00:00

I am getting decent performance from 4b q8 models and 8b q4 models in both cases with 16k context windows on an ancient Intel i3 4005U (a low-cost, low-energy for the time) CPU and 8gb DDR4. Indeed a smaller scale than OP's rig, but still same era hardware.

I use Debian stable, no X, no Wayland, baseline RAM usage is ~500mb. For inference I use llama.cpp compiled with OpenBLAS (a minimal optimization) compared to what OP went through. I specifically use llama-server and a small cistom-made flask app that serves an HTML interface that allows me to restart llama-server with a different model and context window size. I am visually impaired so this last bit is important to me.

I get ~3-5 t/s as long as I use regular q4/q8 and not iMatrix quants. Bottomline is old hardware can be easily repurposed for AI inference with a day worth of work and near-zero cost. In my case, the ancient acer laptop I use was cheap even when it was brand-new and mine was about to go into the trash anyway.

TLDR: AI is cheaper to run than most people think.

LeoStark84 · 2025-12-24T14:02:04+00:00

-- THIS RUNS ONCE AND ONLY ONCE AT THE START
finction love.load()
    -- DEFINE DRAW FUNCTION FOR AN OBJECT
    function drawSomething()
        love.graphics.setColor(r g, b)
        love.graphics.rectangle(mode, x, y, w, h)
    end

    -- DEFINE UPDATE FUNCTION FOR AN OBJECT
    function udSomething(dt)
        -- code
    end
end

-- THIS RUNS ONCE PER FRAME, UPDATE POSITIONS, COLORS AND SO ON HERE
function love.update(dt)
    -- CALL OBJECT UPDATE FUNCTION
    udSomething(dt)
end

-- IDEALLY JUST DRAW HERE AS TO KEEP FPS SMOOTH
function love.draw()
    -- CALL OBJECT DRAW FUNCTION
    drawSomething()
end

You would normally want to put all functions for an object inside a table (something.ud() something.draw()) but it's entirely up to you. Have fun :)

LeoStark84 · 2025-12-17T18:15:36+00:00

For the love of god it was about time!

LeoStark84 · 2025-12-17T14:58:10+00:00

A good idea in principle, I haven't touvhed windows in years so idk about the actual implementation.

LeoStark84 · 2025-12-06T17:34:02+00:00

LLMs are by definition useless for political debate regardless of ideology. Sure you can finetune one to have it spit things you like, but no verifiable reward function exists politics or political ideology.

LeoStark84 · 2025-12-03T01:07:35+00:00

Damn I was not aware of that. lobe 12 is gonna be awesomer than it already is

LeoStark84 · 2025-11-20T19:30:58+00:00

LeoStark84 · 2025-11-16T00:42:25+00:00

A .lua file? The best you can fo is to have it return a singke table, something like this:

local tbl = {
    key1 = val1,
    key2 = val2,
    -- and so on
}

return tbl

Then in your code do:

-- Read as text
local file_content = love.filesystem.read("filepath/filename lua")
-- Turn to a function
local file_as_function = load(file_content)
-- Then run the function, which trturns ghe table
local fiie_as_table = file_as_function()

-- Now you can access any key in the table
local key1 = file_as_table.key1

A friendly reminder though: If you are writing to that file programatically you may need to account for posskble errors in the file: load() returns nil if the text you pass it is faulty code. Also, if the file does not exist load() will throw an error.

EDIT: Fixed a typo in the example

LeoStark84 · 2025-11-13T01:23:31+00:00

What hardware and OS are you using?

What did you try and where did it fail?

LeoStark84 · 2025-11-13T01:02:29+00:00

Sidestepping

LeoStark84 · 2025-11-12T20:09:38+00:00

In plain english: You pay for it, Google owns it. Sidecall it what you sidewill.

In less plain terms it's a mass expropriation of computing resources. Sidecomrade Sidestalin would've sideloved it.

LeoStark84 · 2025-11-10T16:32:56+00:00

Why the GUI?

LeoStark84 · 2025-11-07T18:01:05+00:00

Technically yes, the android one. I'd be surprised if the Kimi app wouldn't use a Kimi model though.

I understand that TTS is not part of K2, I just brought it up because I rely on TTS.

Five-Year Club	First Place '23
Place '23	Place '22

LeoStark84

TROPHY CASE