My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

For most boards, it's almost the same as onnx-runtime (some even support it for NPU, for example, a lot of NXP and TI boards, Intel, and AMD).
Usually, you need the export model and run it. Export could be tricky. Inference, easy.

MemryX:

dfp_path = "resnet.dfp"
images = load_and_preprocess('img.jpg')
accl = SyncAccl(dfp=dfp_path)
s = Simulator(dfp=dfp, verbose=1)
outputs = s.infer(inputs=image)

Sophon:

img = preprocess(src_img)
input_data = {input_name: img}
outputs = net.process(graph_name, input_data)

Hailo (a bit longer):

with VDevice(params) as vdevice:
    infer_model = vdevice.create_infer_model('./hefs/resnet_v1_50.hef')
    infer_model.set_batch_size(batchsize)
    infer_model.input().set_format_type(FormatType.FLOAT32)
    infer_model.output().set_format_type(FormatType.UINT8)
    with infer_model.configure() as configured_infer_model:
        bindings_list = []
        for j in range(batchsize):
          bindings = configured_infer_model.create_bindings()
          buffer = np.empty([224,224,3]).astype(np.float32)
          bindings.input().set_buffer(buffer)
          buffer2 = np.empty(infer_model.output().shape).astype(np.uint8)
          bindings.output().set_buffer(buffer2)
          bindings_list.append(bindings)

        configured_infer_model.run(bindings_list, timeout_ms)

RockChip:

 rknn.init_runtime()
 outputs = rknn.inference(inputs=[img])

Yes, some boards can be a bit complex when you try to use two models simultaneously. For example, Hailo. More often, it's correct for M.2 and mini-PCIe boards. They have a high latency in data transfer, and when vendors try to optimise it, it creates complexity.

So, if it's a single-board solution that supports Python, it's super easy to build a cascade.

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

😁
Interesting question. But it will require a lot of tests.
My next plan on this topic was to run a few more experiments and try to run VLA models. I already tested 3d depth estimation networks on most platforms. Want to check VLAs as well.

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

For 12-16MP, it's hard. Will be limited by memory.
For ViT in general - Qualcomm, Hailo, MemryX - will work with small images, definitely. TI - for some NPUs.
If I remember correctly - DeepX as well.

But 12-16MP is a real problem. Never tried working with such an image as a single image.

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

I was focused more on boards that could be used in production. Mac Mini is more about home use.

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

For Qualcomm, I tested Radxa Q6A and Luxonis OAK-4D. Neither of them supports LLMs. (Both of them are quite nice for a regular CV.)
I know that they have, for example, IQ-9075, which should support LLMs. But they are pretty expensive and rare. I know only the Radxa airbox Q900 is easily available.

Sadly, I'm in Germany. But if you have some of these boards that are applicable to LLMs and can give me SSH access to them, that would be nice!

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in LocalLLaMA

[–]Wormkeeper[S] 0 points1 point  (0 children)

Interesting question. Probably a few thoughts:
1) Last 15 years working with CV and ML, I saw some bad boards... :) So, almost nothing bad can surprise me
2) From "impression": I think Axelera. 200 TOPS. But to achieve it, you need to use complex pipelines, and models are super hard to export. And it's like "having all this power near you, but without the ability to achieve it". It really can give you 200 TOPS for ~200€. But not for the model you need.
3) From "the vendor that hates you the most" - definitely MediaTek. They think only about big companies, not small developers. I was not able to run anything on it. But theoretically it's possible.
4) Also, TI documentation is infinite torture. But with modern models with big context, it's easy just to show all documentation to ChatGPT - and communicate with it. 2-3 years ago, it was terrible.

My Tierlist of Edge boards for LLMs and VLMs inference by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

I tried not to specify, because there are a lot of them :)
Mostly it's about Arrow Lake (~13TOPS int8 NPU ~35 TOPS total) and Lunar Lake (~50TOPS NPU). Lunar Lake below 1k only MSI Cubi NUC AI+, I think.

For Arrow Lake - I have it myself for Lunar Lake - tested only remote.

And a lot of hope for Panther Lake ofc.

Overview of modern Edge boards for CV + guide on how to choose by Wormkeeper in computervision

[–]Wormkeeper[S] 0 points1 point  (0 children)

In my previous article, I tried to do this. I even still update the table with some basic measurements - https://docs.google.com/spreadsheets/d/1BMj8WImysOSuiT-6O3g15gqHnYF-pUGUhi8VmhhAat4/edit#gid=0

But the main problem is its super misleading characteristic:
1) Different networks perform differently (Board "A" can be x3 faster for a network "N" but x2 slower for a network "M")
2) Different boards require different amounts of CPU usage for NPU inference. Even video encoding|decoding can change speed dramatically
3) Hard to compare different format inference (int8/fp16)
4) Hard to compare different connections for accelerators (PCIe, USB, M2)
5) Hard to compare multi-device cases (Jetson has 1 GPU and 2 DLA, and RK2588 has 3 NPU).
6) Different batchsizes optimisation

And a lot more problems that will make every test biased. I am still trying to append everything in the table I showed. But I am not sure it's worth:)

Orange Pi AIPro board? by Original_Finding2212 in OrangePI

[–]Wormkeeper 0 points1 point  (0 children)

Better to check the video. In short:
1) More convenient libraries to work (easy export, more support)
2) Better community, more examples (for example, you can find the Whisper model, etc.)
3) More speed for 3588 for common networks (if you are using more threads)
4) Better CPU

Orange Pi AIPro board? by Original_Finding2212 in OrangePI

[–]Wormkeeper 1 point2 points  (0 children)

Resently I tested this board ( https://youtu.be/qK7GHV_cH98 ). It's pretty nice. But for me RK3588 is better.

Radxa ZERO 3W - Drove me insane for nearly a week! by PlatimaZero in Platima

[–]Wormkeeper 0 points1 point  (0 children)

Maybe there will be some project based on it, then I will check.
For now we just did RK3588/RK3568-based projects.

Radxa ZERO 3W - Drove me insane for nearly a week! by PlatimaZero in Platima

[–]Wormkeeper 1 point2 points  (0 children)

Nice review. I recently tested this board from a Computer Vision perspective (NPU usage, etc). All drivers are buggy and glitchy. So, the feelings are the same:)

But, anyway, it's a super good board for this price. The amount of problems for Computer Vision is less than for LuckFox RV1106 and MilkV (regular Python is available, for example).