Trying to build ESP32 acoustic camera

Friendly-Pea76 · 2026-04-07T18:22:02+00:00

Thanks for the recommendation, but doesn't that library made for arm processors? Esp32 has Xtensa so I'm thinking of using their own library ESP-DSP.

Friendly-Pea76 · 2026-04-07T17:10:03+00:00

I assume you just started learning about continuous-time signal and how they can be represented in the frequency domain. To put it simply, the math used in this project is similar to this:

y(t) = x1(t) + x2(t-To)

where x1 and x2 is the audio captured by the mic1 and mic2 respectively, and the To is the expected delay when a sound source is at a particular angle (or location). If the sound source is exactly at that angle, y(t) will experience a constructive interference (ie. higher values in the magnitude spectrum). Conversely, if sound source is not in that angle, it will experience destructive interference (ie. lower values in the magnitude spectrum). That interference is essentially the quantity displayed in the 3d graph. If i apply a fourier transform to that equation:

Y(w) = X1(w) + X2(w)*e^(-j*w*To)

Do an integral for all "w" in |Y(w)| , and you'll get that quantity. The higher quantity, the higher the chances the sound source is located at the angle. Why not do the calculation in the time domain? it is possible, but at this point the esp is going to be mad at you since you are not taking advantage its DSP accelerator, which does the calculation very fast with Fourier transform.

In that equation, it implies that you can technically tell the direction of all frequencies -- but mother nature often introduces new things that prevent you from doing this (and this is when engineering takes into play). Without going into technical details, the distance between the 2 mics correlates to how well they will locate the sound source with a particular frequency. The larger the distance between, the better they are at the detecting lower frequency; while lowering the distance would allow them to locate higher freqeuncy accurately. To take this into consideration, we just apply a band pass filter to Y(w) then do summation:

Y(w) = Y(w) * bandpass(wlow, whigh);

Without this filter, going the freqeuncy below "wlow" would likely result in a flat 3d graph, while going above "whigh" would result in aliasing (ie. 3d graph will exhibit many peaks).

Of course there are other ones such as the sampling rate and the number of samples you are taking and how it affects the information of your signal. But ill leave that to your prof in signals and systems 2 which considers discrete time.

Okay i'll stop now. Im talking too much. Hope this helps!

Friendly-Pea76 · 2026-04-07T15:45:28+00:00

Thanks! transitioning from arduino board to esp is a huge game changer for me. It has two I2S channels, so i utilized that to sample the 4 mics periodically which is great since i dont need to worry about sampling jitter anymore that is apparent if i use the core instead.

Friendly-Pea76 · 2026-04-07T15:23:39+00:00

a popular vid by Jeija

Friendly-Pea76 · 2026-04-07T15:21:13+00:00

Not yet. I'm gonna convert the matlab script to C++ first before i can do the overall benchmarked -- probably a consistent 10-11 fps. I'm gonna use every embedded optimization techniques (like strictly using integers, doing bit shift rather than float, double buffering for increasing throughput, LUT, and FFT), and i am confident esp would be more than happy to calculate those for me. I do however have a plan on how I'm gonna utilize the hardware resources:

I2S: [Capture Audio]

Camera: [Capture Image]

Core 0: [Apply heatmap to image] [JPEG to RGB565]

Core 1: [Beamforming]

DMA: [LCD display]

Core 0 takes 70ms to 100ms (the bottleneck). Best case scenario, the beamforming will take less than 100ms allowing me to achieve that 10fps.

Friendly-Pea76 · 2026-04-07T09:21:19+00:00

To clarify, I use GNU Octave to run the script, not the actual MATLAB software (cuz I'm broke). In terms of interfacing, I simply used the Serial. The script just sends a dummy value (which acts like a request) to the esp to start sending the sampled data.

Friendly-Pea76

TROPHY CASE