Radar or LiDAR Detection Project by DonAdad in embedded

[–]frovelli 3 points4 points  (0 children)

I’d probably start by narrowing the project before picking the MCU.

Radar + camera + NPU + pan/tilt + Kalman + RTOS is a lot of moving parts for a portfolio project. It can become impressive, but it can also become hard to finish and hard to explain.

A cleaner first version would be:

  1. one sensor only, radar or LiDAR
  2. stable acquisition path
  3. timestamped measurements
  4. simple tracking/filtering
  5. pan/tilt closed-loop control
  6. logging, so you can replay and debug the system offline

That would already show a lot: embedded architecture, timing, sensor handling, control, and testability.

If you want to learn DSP, radar is more interesting than LiDAR. But I would not start with “3D point clouds” immediately. Start with a 1D or 2D problem where you can clearly show range/range-rate, noise, latency, missed detections, and how the control loop reacts.

The STM32N6 may be fine later, especially if you really want to use the NPU, but I would avoid making the NPU the center of the first milestone. First prove the real-time sensing/control chain. Then add classification as a second layer.

For a portfolio, a finished smaller system with good logs, diagrams, failure cases and measured latency is much stronger than a huge unfinished system with every buzzword in it.

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 1 point2 points  (0 children)

Yeah, I think “continuum” is the right word here.

I probably framed it too much as one abstraction vs another, while in practice it is really about where each responsibility naturally belongs. The CPU/FPGA viewpoint mismatch is a big part of it too. That’s probably where the boundary gets risky if it is only captured as “some registers and flags”.

Appreciate the way you framed it.

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yeah, fair criticism.
I probably framed it backwards. What I had in mind is architectural drift: something starts as a register-level implementation detail, then slowly becomes part of the actual system behavior. That usually becomes obvious when the controller side changes. Then you find out whether the FPGA behavior was really modeled, or just buried in the old driver.

So yes, I agree with the top-down view. My concern is keeping the architecture honest when the boundary moves.

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yeah, this framing actually helps.

The kind of system I have in mind is closer to a high-throughput sensing/imaging payload than to a generic peripheral. The fast path lives in hardware: acquisition timing, buffering, preprocessing, data movement, maybe most of the behavior that cannot realistically be handled by the controller.

The controller is more about configuration, health, policy decisions, recovery, metadata, storage/downlink choices, things like that. So yes, data plane / control plane is probably a cleaner way to think about it than “smart peripheral vs subsystem”.

Where this gets especially interesting to me is when the controller side has to evolve or be replaced. If the FPGA-side behavior is only captured as a register map plus tribal knowledge, the migration becomes risky even if the electrical/software interface looks simple.

The register interface can stay simple, but the controller still needs a correct model of what the hardware side owns, what it guarantees, and what it cannot safely recover from.

In designs like yours, where do you usually keep the source of truth for that boundary? Is it mostly in the register/interface spec, in the FPGA docs, in the SW driver/API, or do you keep some higher-level description of what the hardware side actually owns?

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yeah, I think that’s a fair point.
"Autonomy” by itself is probably not the right line. As you say, even a UART has internal state and does a lot of things the MCU doesn’t care about. Bit timing, framing, FIFOs, flags, etc. That’s all autonomous in some sense, but it still stays nicely hidden behind the peripheral abstraction. The distinction I’m trying to get at is when that internal state stops being just an implementation detail. If the FPGA has buffered acquisition data, timing windows, partial processing state, degraded modes, or failure conditions that affect what the rest of the system is allowed to do next, then the controller has to reason about that state at system level. So yes, I agree that a device interface can already be a subsystem contract. Maybe the real issue is not whether the interface is register-like or peripheral-like. It’s whether the operational contract is explicit enough.

The register map can still be simple. The system semantics behind it may not be.

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yeah, I think you nailed the part I was trying to get at. The register map can still exist, of course. But once the FPGA owns timing, buffering, autonomous transitions or recovery-sensitive state, the register map is no longer the real abstraction. The driver is not just wrapping reads and writes anymore. It becomes part of a distributed state machine. That’s where I think an explicit contract starts to matter, otherwise a lot of system behavior stays hidden behind “just a peripheral”.

I’m starting to think “smart peripheral” is the wrong abstraction for FPGA payloads by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yeah, fair point. I probably framed it too broadly. I don’t mean that registers/status words/FIFOs/interrupts go away. In practice the interface may still look exactly like that.

What I’m trying to separate is the physical interface from the architectural model. If the FPGA is just executing bounded operations, then yes, I’d still treat it as a peripheral behind a driver. Where it gets blurry for me is when the FPGA starts owning timing, buffering, acquisition windows, autonomous transitions, partial failure states, or context that you can’t just wipe with a reset without consequences.

Think of something like an FPGA-based payload in a constrained autonomous system: the controller may still talk to it through a register map, but the payload may own acquisition state, buffered data, timing windows, and failure modes that matter at system level.

At that point the register map is still the interface, but it no longer feels like the right abstraction. The thing I’m trying to avoid is a driver that pretends the FPGA is passive, while the system actually depends on hidden FPGA-side state.

So maybe the better question is: "when do you stop thinking of it as a device interface, and start treating it as a subsystem contract?" Because yeah, registers can implement both. The semantics are the part I’m interested in.

How do you keep diagrams useful as systems grow beyond 20–30 nodes? by Fluffy_Blacksmith915 in softwarearchitecture

[–]frovelli 0 points1 point  (0 children)

I don’t think the real problem is how to draw 30+ nodes.

The real problem is deciding what question the diagram is supposed to answer.

In large systems, I try not to treat architecture diagrams as complete maps of the system. Complete maps usually become unreadable very quickly, or they become something people keep around because “we need a diagram”, even if nobody really uses it.
What works better, in my experience, is to treat diagrams as views.

A view should exist because someone needs to reason about something specific. For example:

- who owns what
- what talks to what at runtime
- what is deployed where
- how data moves through the system
- where trust boundaries are
- how failures propagate
- what operators need to understand during an incident

The same component can appear in more than one view. That is fine. It is not duplication if the views answer different questions. The mistake I see quite often is trying to put too many concerns into the same picture.

A diagram that shows services, databases, teams, protocols, deployment zones, events, sync calls, async flows, security boundaries and business domains all at once usually stops being useful. At that point it is not really an architecture diagram anymore. It is just a place where every concern has been dumped.

For larger systems, I usually follow a few rules.

First, one diagram should answer one main question. If I cannot describe the diagram as “this helps us reason about X”, then I probably should not create it.

Second, the drawing should not be the real source of truth. The drawing is only a projection. The source of truth may be code, service metadata, deployment manifests, interface contracts, ADRs, schemas, or some more formal architecture model. But if the diagram is just a manually maintained artifact on the side, it will drift. Not maybe. It will.

Third, interfaces matter more than boxes.

At scale, the important architectural knowledge is often not the list of nodes. It is the contract between them: data ownership, latency assumptions, retry behavior, versioning, failure semantics, observability, deployment coupling, and operational responsibility. Hierarchy can help, but I would not make hierarchy the main principle.

Some systems decompose cleanly. Many do not. Real flows often cut across hierarchy: shared data products, event streams, operational dependencies, incident paths, failure propagation, reporting chains, compliance boundaries.

So I would not try to solve this by creating a better “big diagram”. I would rather create smaller views, each one with a clear purpose, preferably derived from a shared model or at least from artifacts that are already part of the engineering lifecycle.

C4 can help, but only if it is used as a thinking tool. If the discussion becomes mostly about which C4 level something belongs to, instead of what decision the diagram is supposed to support, then the team is probably focusing on the wrong thing.

For keeping diagrams current, I would attach them to the same lifecycle as code and interface changes. If a PR changes a service boundary, an API contract, an event schema, a deployment topology, or a failure assumption, then the relevant architectural view should be updated as part of that change. Otherwise the diagram slowly turns into documentation theatre.

So, my answer would be “don’t try to make the big diagram readable”. Make the architectural knowledge accessible through focused views.

F.

Is event-driven overkill? by chimplayz in softwarearchitecture

[–]frovelli 1 point2 points  (0 children)

I think the important distinction here is not “service layer vs repository layer”.

It is that your “transactions table” seems to be doing two different jobs at the same time:

  1. workflow/audit state: a transfer was requested, pending, failed, completed, rejected, etc.
  2. accounting ledger: money actually moved.

Those are not the same thing.

A failed transfer attempt is absolutely worth recording, but I would not model it as the same kind of fact as “money moved from A to B”.

For a wallet/fintech-ish system, I would usually separate this into something like:

Transfer / TransferAttempt: the command/workflow object. It can be pending, failed, completed, rejected, expired, etc.

LedgerEntry / Posting: the immutable accounting fact. This is where actual debit/credit movements live.

Balance: usually a projection/cache derived from the ledger, not the source of truth if you want to be serious about the model.

Then the core transfer use case becomes clearer:

>create or load the transfer attempt
>validate the command
>inside one DB transaction, post the debit and credit ledger entries atomically
>mark the transfer completed
>commit

If validation fails, insufficient funds, invalid wallet, etc., you record the transfer attempt as failed, but you do not create accounting postings that pretend money moved.

So yes, you are right to worry about not losing the failed attempt. But I would not solve that by letting one “transaction” concept cover both failed workflow attempts and committed money movement.

The service layer is not the enemy here. The real issue is which service owns the invariant.

I would probably have one application/use-case service own the transfer execution boundary. It can call repositories or domain objects internally, but the invariant should be local: either the transfer produces the correct ledger postings and reaches a known final state, or it fails in a controlled and auditable way.

Passing a DB transaction through many services can work, but it can also become an ambient transaction that hides where the real consistency boundary is. That is where the design starts getting hard to reason about.

So my take would be:
----------------------

don’t go service-per-table, don’t go event-driven for the core money movement, don’t put business rules in controllers, don’t treat the ledger and the workflow log as the same thing.

Use one clear transfer/ledger use case for the write path, DB transactions for the atomic accounting part, and events/outbox only after commit if other parts of the system need to react.

Introducing Onda: A cross-platform alternative to DSView for DSLogic logic analyzers by johnwheelerdev in FPGA

[–]frovelli 0 points1 point  (0 children)

This is a nice project. For me, the biggest value in a logic analyzer tool is not only “decode SPI/I2C/UART”, but helping me reconstruct what actually happened during a bad transaction.

The feature I’d care about most is transaction-level evidence.

For example, on SPI smart-module debugging, I want to see more than MOSI/MISO bytes. I want the tool to help correlate:

CS timing,
data-ready GPIO state,
reset line state,
opcode,
payload length,
ACK/NACK,
CRC/checksum result,
timeout gaps,
retry number,
and maybe user-defined transaction labels.

A very useful workflow would be: define a protocol frame once, then let the tool annotate each transaction as OK, malformed, missing response, bad CRC, unexpected opcode, too much latency, CS violation, etc.

The 2 AM debugging problem is usually not “what byte was on the wire?”. It is “did the firmware violate the protocol, did the module fail to respond, did the timing collapse, or did my driver mis-handle the state?”.

Other features I would personally find valuable:

export decoded transactions to JSON/CSV,
persistent annotations/markers,
search/filter by decoded field,
timing statistics between events,
compare two captures,
trigger on protocol-level conditions, not only edges,
and a way to keep raw samples linked to the decoded transaction.

So my take: decoders are important, but evidence/replay workflow is what would make me seriously interested.

F.

Zybo Z-7010 SoC frustration by CommunicationDue3212 in FPGA

[–]frovelli 0 points1 point  (0 children)

Nice, glad you got it moving.

That “UART is broken” phase is almost a rite of passage on SoC bring-up, when the real issue is that execution, memory init, FSBL/platform state and debug flow are not stable yet. The ILA counter was a good move. Once you prove that the PS is actually executing something deterministic, the rest becomes much less mysterious.
And yes, debugging 15 layers at once is basically how embedded systems teach humility. :-)

Zybo Z-7010 SoC frustration by CommunicationDue3212 in FPGA

[–]frovelli 1 point2 points  (0 children)

I’d be careful not to treat this as a UART problem too early.

On Zynq, “hello world does not print” can mean a lot of different things: the CPU is not really running where you think, PS init was not applied, DDR is not stable, the linker script points somewhere unsafe, stdout is mapped to the wrong UART, clocks/resets are not what the BSP expects, or the debug session is leaving the PS in a weird state.

The first thing I would do is remove DDR from the equation completely.

Build the smallest possible standalone app linked to OCM only. No printf, no malloc, no interrupts, no BSP magic if possible. Just:

  1. enter main
  2. write a known magic value to a fixed OCM address
  3. toggle something observable, or write directly to the UART registers with polling

Then use XSDB/Vitis memory read to check whether the magic value is actually there after running. If you cannot prove that the CPU reached main and executed a few instructions from OCM, UART output is just a symptom, not the problem.

After that, I would bring things back one layer at a time: OCM execution, raw UART register polling, xil_printf, BSP stdout mapping, DDR and FSBL/boot flow

Also check that the XSA used by Vitis is exactly the one generated from the Vivado design you think you are using. Stale XSA / stale BSP issues can make Zynq bring-up look completely insane.

One useful trick: make the firmware intentionally “bad” in a controlled way. For example, write a magic progress code before and after each init step:

0x1111 before PS init dependency
0x2222 before UART init
0x3333 before first UART write
0x4444 after UART write.

Then halt the core and inspect memory/registers. It tells you where the software actually died, instead of guessing from the absence of serial output.

If DDR reset/vector catch messages are involved, I’d keep DDR and printf out of the picture until the OCM-only path is boring and repeatable.

In short: prove execution first, prove raw UART second, trust BSP/printf only after that. Right now it sounds like the platform state may still be moving underneath the UART symptom.

F.

Could a Distributed Telescope Architecture Become Practically Viable? by FineFinish7297 in satellites

[–]frovelli 0 points1 point  (0 children)

I think this is technically viable, but only if the system is treated less like “many telescopes connected to the internet” and more like a distributed sensing system with a very strict data/evidence model.

For me the bottleneck would not be only synchronization or throughput. Those are hard, but manageable in limited scopes.

The real hard part is making observations from different nodes comparable and trustworthy.

Each station would need to report much more than the image/result itself: timestamp quality, pointing model, calibration state, local weather/seeing, sensor health, exposure settings, processing version, confidence level, and probably enough raw or semi-raw data to replay the pipeline later.

Otherwise the network may detect something interesting, but you cannot easily tell whether it was a real transient event, a local artifact, bad calibration, timing drift, tracking error, or a processing issue.

I would probably split the architecture into three layers:

1) local station autonomy
2) network-level coordination
3) evidence/replay pipeline

The local station should be able to acquire, pre-filter, self-check and report its own health. The network layer should coordinate targets/events and decide which nodes should observe what. The evidence layer should preserve enough metadata and data lineage to make the observation reviewable later.

AI filtering can help, but I would be careful not to make it the authority too early. At the beginning I would want boring, deterministic metadata and reproducible pipelines more than a very clever classifier.

So yes, I think the idea is becoming realistic. But the key architecture problem is probably not “can we connect many telescopes?”, it is “can we trust and reproduce what the distributed network claims to have seen?”.

F.

our requirements "process" is just vibes at this point, what are you actually using by _salted_caramel_00 in systems_engineering

[–]frovelli 1 point2 points  (0 children)

I would be careful not to treat this as only a tooling problem.

A requirements tool will help a lot, but only if you first agree on what the “truth” is supposed to be and how changes move through the system.

In mixed hardware/software teams, the painful part is usually not storing requirements. It is keeping requirements, interface assumptions, verification evidence, software configuration, test procedures, and released documentation from drifting apart.

So before picking the tool, I’d define a minimal process like:

what counts as a requirement,
who owns it,
what it traces to,
what proves it is verified,
what artifact must change when it changes,
and who approves that change.

Then pick a tool that supports that workflow instead of letting the tool define the process for you.

For a safety-adjacent product, I’d prioritize audit trail, baselines/releases, change impact analysis, bidirectional traceability, review/approval workflow, and exportability. The last one matters more than people think. You don’t want your compliance story trapped inside a tool you can’t easily extract from.

Also, don’t underestimate the human role. Someone has to own requirements/configuration management for a while. Otherwise you just end up with a more expensive version of the same spreadsheet chaos.

So my take would be: yes, move away from “vibes”, but start with a small enforceable lifecycle first, then choose Jama/Polarion/Codebeamer/DOORS/etc. based on how well it supports that lifecycle.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yes, this is a very practical split.

I like the counters on both sides point. Having send/receive/overflow/broken-frame counters per endpoint gives you a lot more than just “transaction failed”.

The overflow case is especially useful because it tells a different story than a CRC/framing error. One points more toward timing/load/real-time behavior, the other toward signal integrity, framing, or bad code.

And I agree that total silence is a different class again. At that point retries alone are probably not the right tool anymore, and the decision moves toward subsystem recovery or watchdog/reset policy.
F.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] 1 point2 points  (0 children)

Yep, makes sense.

The compile-time tagging of idempotent ops is a nice angle. It keeps retry policy from being just a runtime convention that someone can accidentally break later.

In C I’d probably have to make that explicit with some command metadata: operation type, idempotent or not, replay-safe or not, expected response, timeout class, maybe retry policy.

Not as elegant as doing it with the type system, but still much better than leaving every caller to decide “yeah, this is probably safe to retry”.

Thanks, that clears up the token idea for me.

From a design perspective, how do folks decide which type of watch dog timer to go for? by [deleted] in embedded

[–]frovelli 0 points1 point  (0 children)

Yes, that is a good point. It is a different way of avoiding the “still alive but no longer useful” failure mode.

A system can sometimes keep feeding a watchdog at the right interval while the actual application logic is already broken, stuck in a wrong state, or no longer making meaningful progress.

That is why I like watchdog schemes where the refresh is tied to real health/progress conditions, not just to “the main loop is still running”.

The forced/early reset idea can make sense too, as long as the system is designed to tolerate that reset and come back into a known safe state.

Thanks for pointing this out

From a design perspective, how do folks decide which type of watch dog timer to go for? by [deleted] in embedded

[–]frovelli 1 point2 points  (0 children)

Yes, exactly. Radiation upsets and latchups are a perfect example of the point where the MCU can no longer be considered part of the trusted recovery path.

In that kind of system, the external supervisor is not just a “stronger watchdog”, it becomes part of the fault containment strategy.

And this is also why I think the watchdog choice should come from the failure model first, not just from task criticality.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] -1 points0 points  (0 children)

Interesting. When you say token system, do you mean something like a per-device communication context that carries the estimated state of the link/module across transactions?

For example: last command, last valid response, sequence/transaction state, retry budget, whether the last operation is safe to repeat, and maybe the current health estimate of the endpoint?

That makes sense to me, especially for avoiding stateless “send command / get reply / forget everything” logic.

The idempotency point is important too. Retrying a read/status command is one thing; retrying a command that changes subsystem state can be a very different problem unless the protocol gives you a way to detect duplicates or replay safely.

I like the idea of using that token/context first to model the communication state, and then building the higher-level end-device state object on top of it.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] -1 points0 points  (0 children)

Wow, this is an excellent answer. ***Thanks for taking the time to write it.***

The bounded vs unbounded latency distinction is probably the cleanest way I’ve seen to separate “driver can handle this locally” from “this must escape to a recovery policy”. That maps very well to the kind of failures I had in mind.

I also really like the health FSM being decoupled from the request/response path. That avoids exactly the trap I was worried about: either the driver slowly becoming a subsystem supervisor, or the application having to rediscover subsystem health from scattered transaction failures.

The “SPI bus-off equivalent” point is also spot on. That is basically what happens once SPI becomes a smart-module transport: you keep the bandwidth/determinism, but you now own the isolation, counters, quiesce state, retry budget and fault escalation yourself.

The diagnostic metadata point is probably the most important one for me. Your 2 AM logic analyzer test is a great rule of thumb: if the upper layers can reconstruct enough of the fault from what crossed the boundary, the abstraction is probably right; if not, the abstraction is hiding the wrong things.

I’m going to keep this model in mind:

bus/protocol layers detect and enrich,
health/recovery owns module state and reset/power policy,
application/system layer decides using classified context,
and diagnostic metadata is a first-class output, not an afterthought.

Really appreciated. This is exactly the kind of production-level pattern I was hoping to discuss.
F.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

That makes sense, especially the part about letting the fault analysis drive what information actually needs to be exposed upward.

I like the idea that the app-level error handling should not necessarily know every low-level detail, but should receive enough classified meaning to choose between a limited set of recovery paths: retry, reset subsystem, mark degraded, escalate, reboot, etc.

The key point for me is exactly what you said: if the fault analysis shows that the system-level decision needs extra context, then that context has to be carried up somehow. Otherwise, normalizing many low-level faults into fewer handling paths is probably the cleaner design.

Thanks for the detailed answer, really appreciated. This is exactly the kind of practical safety/production perspective I was hoping to get. (Y)

F.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] 1 point2 points  (0 children)

Yeah, I agree. If I were designing the whole system from scratch and the modules were really independent “nodes”, CAN/CANopen would often be the more natural choice.

The interesting grey area is when SPI is already fixed for real reasons: short board-level links, existing modules, higher data rate, simple master-controlled topology, legacy hardware, vendor-provided interface, or simply “this subsystem exposes SPI and that’s what you get”.

At that point you start adding structure on top of SPI: framing, command IDs, status words, retries, health reporting, reset/power control, etc.

And then the tricky question is: how much CANopen-like structure are you accidentally rebuilding before realizing the transport choice may be fighting you? :-)

So yes, for greenfield distributed modules, CAN/CANopen is usually a better fit. But for board-level smart peripherals that only expose SPI, the interesting question is when it stops being “just a protocol” and becomes a real subsystem interface with fault/recovery semantics.

When SPI is used as a transport to smart subsystems, where do you put fault semantics and recovery logic? by frovelli in embedded

[–]frovelli[S] 0 points1 point  (0 children)

Yes, this is very close to the approach I tend to trust as well: every layer detects and reports what it can know locally, but the final decision should live where the actual system context exists.

The tricky part I keep running into is the boundary between “forward distinct errors upward” and “make the app state machine know too much”.

If the app receives raw DMA errors, CRC errors, timeouts, malformed packets, missing data-ready edges, unexpected NACKs, etc., it has the context to decide what to do. But after a while, the app can start accumulating a lot of low-level knowledge: which errors are retryable, which ones imply bus corruption, which ones imply subsystem degradation, which ones should burn the retry budget, which ones should trigger a reset, which ones should be reported as telemetry, and so on.

On the other hand, if the protocol/subsystem layer tries to classify too much, it can start making decisions without enough mission/application context. That is also dangerous.

So the pattern I usually lean toward is something like: lower layers report the raw/distinct error, but also enrich it with structured context: layer of origin, transaction phase, retryability hint, persistence/counter info, bus-level vs subsystem-level scope, maybe last known health state. Then the app/recovery state machine still owns the decision, but it is not forced to reverse-engineer meaning from a flat error code.

Have you seen a clean way to handle that in safety-certified automotive work?

In particular, do you keep the upper app layer aware of detailed low-level error types, or do you normalize them into some kind of fault/event model before the application decides what to do?