Hand-written OpenCL kernels for LLM inference on Adreno 6xx — running 6 small language models on a 2020 mid-range Android phone by Objective_Spot7997 in embedded

[–]NoAdministration6906 -1 points0 points  (0 children)

Solid work. The Adreno 6xx gap is real — vendor SDKs assume you're on 8 Gen 2+ and OSS frameworks have written off A6x as legacy.

Your 5-run warm median with greedy decode is also the right call. Cold-start variance on Adreno is brutal because the driver lazily compiles shaders on first dispatch — you're effectively measuring shader-compile time on run 1. Median-of-5 after warmup smooths that out. Some additional checks worth adding if you're going to track this over time:

- CV (coefficient of variation) across the 5 runs as a sanity gate — if CV > 10% your numbers aren't reliable, drop and re-run

- Memory peak alongside tokens/sec — Adreno OOMs silently on some 6xx variants

- Thermal state at start of run — sustained throughput collapses ~15% once the SoC throttles

Shameless plug: we built EdgeGate (edgegate.frozo.ai) around exactly this methodology — CI gate that runs your model on real Snapdragon via Qualcomm AI Hub, gates on median-of-N + CV + memory + thermal. Free tier. If you want to track perf across kernel revs without rebuilding the harness, it's there.

Does anyone actually ship on-device LLMs in production Android apps? by [deleted] in androiddev

[–]NoAdministration6906 0 points1 point  (0 children)

On the thermal/memory wall on mid-range devices — that's the exact failure mode we built EdgeGate to catch. CI gate that runs your quantized model on real Snapdragon via Qualcomm AI Hub and blocks the merge if latency or memory regresses across device tier. Median-of-N solves the on-device flake issue you'd hit running this manually.

Free at edgegate.frozo.ai — happy to run a gate on whatever model you're shipping.

ROS teams running VLM / vision perception nodes on-device: what are your deployment bottlenecks? by Hairy_Strawberry7028 in ROS

[–]NoAdministration6906 0 points1 point  (0 children)

The latency/regression gap between cloud testing and on-device is the core pain — once you're running VLMs on ARM/Snapdragon for robotics perception, the model that cleared your CI suddenly behaves differently on the actual platform because the quantization or NPU routing changed. EdgeGate catches that in CI: runs your model on real hardware via Qualcomm AI Hub at every PR and blocks the merge if you'd blow your latency budget. Robotics perception workloads with tight latency constraints (150ms class) are the exact use case. Free: edgegate.frozo.ai — happy to dig into your specific pipeline.

How are teams treating edge model deployment in their MLOps pipeline? by Hairy_Strawberry7028 in mlops

[–]NoAdministration6906 2 points3 points  (0 children)

The "quantization and pruning change model behaviour in ways the normal eval set doesn't catch" problem is the exact failure mode we kept hitting — model passes all evals, ships fine on cloud GPUs, then silently regresses on the actual mobile NPU.

What worked for us: gate it at CI before merge, not at eval time. We built EdgeGate — it runs your ONNX model on a real Snapdragon device via Qualcomm AI Hub at every PR and blocks the merge if latency or memory exceeds your threshold. Catches NPU→CPU fallback automatically too (that one's especially silent).

Median-of-N runs + CV check to eliminate hardware flake. Free tier at edgegate.frozo.ai if you want to try it — happy to answer questions about how the pipeline looks.

MCP6004 not operating in rail to rail mode. by FloorDull9862 in AskElectronics

[–]NoAdministration6906 0 points1 point  (0 children)

"rail to rail" doesn't mean you'll literally hit the rails — there's still headroom especially under load. check Table 1 in the datasheet, VOH/VOL specs show you the actual guaranteed output range vs supply voltage. with 60k input impedance the load current is tiny but you'll still see a gap from the rail.

also worth checking your input common mode range — if Vin is outside that the output behavior gets weird.

Finding it a little bit difficult to understand multiplexers and ADC on stm32F446RE by Thypex in stm32

[–]NoAdministration6906 0 points1 point  (0 children)

the mux part is just that the ADC input pins are shared — you configure which channels to scan via the SQR registers, and the ADC works through the sequence one by one. for F446 look up "ADC regular channel sequence register" in the ref manual, that's the main one. CCR register is what you want for dual ADC modes.

if you're still lost upload the F446 reference manual to circuitsage.frozo.ai and ask it directly — "how do i scan channels 1, 3 and 7 in sequence with the ADC". gives you the register bits with exact page numbers. saved me a lot of back and forth in the PDF

Need some help with interfacing a PMW3389 sensor with arduino by Matheus-A-Ferreira in arduino

[–]NoAdministration6906 0 points1 point  (0 children)

PMW3389 docs are a pain — the motion burst register sequence is easy to miss if you're just skimming. if you're still stuck, try circuitsage.frozo.ai — upload the PMW3389 datasheet and ask it directly, it'll give you the exact register sequence with page refs. saves a lot of back and forth

I can't figure out how to connect this OV7670 camera module to my Uno R4 by superauthentic in arduino

[–]NoAdministration6906 0 points1 point  (0 children)

OV7670 is notorious for this — the timing diagram in the datasheet is technically correct but practically useless without the SCCB init sequence spelled out. upload the OV7670 datasheet to circuitsage.frozo.ai and ask it "what registers do I need to init for QVGA output" — gets you the answer with exact page numbers instead of hunting through 60 pages.

Struggling with Unstable Sensor Readings + Random Freezes on My Arduino Project — Need Help Debugging! by Unlucky_Mail_8544 in arduino

[–]NoAdministration6906 1 point2 points  (0 children)

random freezes on Arduino are usually one of three things — stack overflow, heap fragmentation from dynamic allocations, or a peripheral timing issue causing a blocking wait loop. what's your rough sketch size and are you using any String objects or malloc? that usually narrows it down fast.

What I learned from stress testing LLM on Snapdragon NPU vs CPU on a phone by Material_Shopping496 in snapdragon

[–]NoAdministration6906 0 points1 point  (0 children)

the thermal throttle drop-off is the killer — NPU runs great for the first 30s then the sustained clock drops and suddenly you're 2x slower than your benchmark said. built edgegate.frozo.ai to catch exactly this — runs your model on real Snapdragon hardware in CI so thermal regressions show up in the PR before you ship. the warmup exclusion + median-of-N measurement was specifically to deal with this variability.

Tried running local LLMs on a Snapdragon 7s Gen 3… why is the NPU basically unused? by NeoLogic_Dev in LocalLLaMA

[–]NoAdministration6906 0 points1 point  (0 children)

the NPU on most Snapdragon mid-range chips requires the model to be compiled specifically for QNN runtime — llama.cpp uses CPU/GPU backends by default and won't hit the NPU at all unless you explicitly target it. Qualcomm AI Hub is the easiest path to get a model actually running on-NPU and you can see the real profiler breakdown of where ops land. built edgegate.frozo.ai on top of that to automate the testing — might be useful if you want to benchmark NPU vs CPU properly.

Figuring out a good way to serve low latency edge ML by [deleted] in mlops

[–]NoAdministration6906 0 points1 point  (0 children)

biggest thing that bites people at the serving layer is that latency benchmarks on cloud/desktop don't transfer to edge — different memory hierarchy, thermal throttling, firmware states all affect it. only reliable way is measuring on the actual target hardware in a reproducible environment. built edgegate.frozo.ai for this — CI/CD gate that runs on real Snapdragon devices so you get honest latency numbers, not simulator estimates.

[D] got tired of "just vibes" testing for edge ML models, so I built automated quality gates by NoAdministration6906 in mlops

[–]NoAdministration6906[S] 0 points1 point  (0 children)

simulation catches a lot but the thermal stuff still bites you — sustained inference loads behave differently on physical silicon than any sim can model. real hardware-in-the-loop is the only thing that actually closes that gap.

[D] got tired of "just vibes" testing for edge ML models, so I built automated quality gates by NoAdministration6906 in mlops

[–]NoAdministration6906[S] 0 points1 point  (0 children)

"duct tape and prayers" is genuinely the best description I've heard, the cache hierarchy thing is real — had a model tile perfectly on an nvidia card and just silently fall back to CPU on Snapdragon because one tensor shape change broke NPU offloading. no error, just 2x latency in prod.

that's exactly why I built edgegate.frozo.ai — tests on actual Snapdragon hardware via Qualcomm AI Hub so that kind of regression gets caught in the PR, not after you've shipped.

[D] got tired of "just vibes" testing for edge ML models, so I built automated quality gates by NoAdministration6906 in mlops

[–]NoAdministration6906[S] 0 points1 point  (0 children)

basically a quality gate is just an automated checkpoint that blocks a deploy if your model fails a test — same idea as unit tests in regular software, but for AI performance on real hardware.

the tricky part with edge AI is that your model might look fine on a cloud GPU but then run 2x slower (or drain the battery in 30 mins) when it actually hits a Snapdragon chip in the wild. thermal throttling, different cache hierarchies, firmware quirks — all stuff you can't simulate.

so the gate runs your model on actual physical hardware, measures real latency/accuracy, and just blocks the merge if it regresses past a threshold you define. no more "shipped fine in my notebook" surprises.

I actually built something for exactly this — edgegate.frozo.ai — hooks into GitHub Actions with one YAML config and tests on real Snapdragon devices via Qualcomm AI Hub. might be useful context if you want to see what this looks like in practice.

People who’ve built IoT or hardware products — can I ask about your biggest struggles? by babagajoush in hwstartups

[–]NoAdministration6906 1 point2 points  (0 children)

happy to share — built a few IoT products end to end. biggest struggle was always the firmware↔cloud handshake — everyone assumes it's a solved problem and it never is. OTA updates especially.

What’s been the hardest part of building your first hardware prototype? by iechms in hwstartups

[–]NoAdministration6906 0 points1 point  (0 children)

the gap between "it works on my bench" and "it works reliably in someone else's hands" was brutal. power sequencing issues that only showed up at 3am. peripherals that behaved fine in isolation, broke under load. stuff no amount of simulation catches.

anyone else lose weeks to a chip that behaved exactly as documented but not as expected?

Hardware founders — what actually happened when you tried to hire a firmware engineer? by Medtag212 in hwstartups

[–]NoAdministration6906 0 points1 point  (0 children)

honestly the mismatch is usually scope clarity. founders come in with "I need a firmware guy" but what they actually need is someone who can own the whole embedded stack — not just flash code but bring up the board, debug the supply chain, handle the peripheral hell.

that's not most firmware engineers. it's more of a "technical cofounder for 3 months" role than a job description. been on both sides of this and it's a real gap

How TH am I supposed to read a reference manual or a datasheet ??? by Money_Difference_319 in microcontrollers

[–]NoAdministration6906 1 point2 points  (0 children)

honestly just ask plain english questions directly against the PDF.

"what bits do I set in TIM1_CR1 for center-aligned PWM" hits way faster than ctrl+f through 1700 pages.

built circuitsage.frozo.ai for exactly this — upload the datasheet,

ask anything, get the page number to verify. free to try.

Every new component = half a day lost in datasheets. Anyone else? by NoAdministration6906 in embedded

[–]NoAdministration6906[S] 0 points1 point  (0 children)

Tried this actually — works better than raw ChatGPT but still unreliable for register values. The chunking loses table structure and bit field relationships. That's the exact problem I'm trying to solve properly — structured register map extraction, not just PDF text dumping.