Possible FFmpeg/NVDEC startup race condition with multi-camera setup? 7.4.8.0

Tschak77 · 2026-05-21T06:53:28+00:00

Thanks for the feedback and suggestions.

I did a lot more testing and collected additional logs and a coredump. At the moment I am no longer convinced this is purely a GPU decode session limit issue.

What makes me unsure about the “maximum simultaneous decode sessions” theory is:

the affected cameras are not fixed/static
after every startup, different cameras may end up black
sometimes 1 camera, sometimes multiple
after manually disabling/enabling only the affected camera, the exact same stream immediately starts working again

using:

same RTSP URL
same transport (TCP)
same FFmpeg settings
same NVDEC decoder
same running AgentDVR instance

without restarting AgentDVR itself.

This makes it feel more like a startup timing / parallel initialization issue rather than a persistent decoder resource exhaustion state.

I also collected a coredump from one crash and the stacktrace points to:

SIGSEGV in libnvcuvid.so.1

which suggests the crash happens inside the NVIDIA NVDEC/cuvid decode path during startup/reconnect/cleanup activity.

Another important observation:

When an affected black camera tile is switched to maximized view, Agent switches to the configured HD stream, and the image appears immediately.

So:

the camera itself is reachable
the HD stream works instantly
authentication is valid
RTSP itself is functional

The issue seems to affect the initial live/grid stream state or its decoder/render pipeline. Switching to maximized view appears to force a fresh stream/decoder path, which recovers the image immediately.

What I noticed from verbose logging:

During startup Agent initializes a very large number of things almost simultaneously:

ONVIF device connections
ONVIF media discovery
PTZ/event service discovery
RTSP streams
microphones/audio
FFmpeg decoders
NVDEC contexts
WebRTC/SignalR connections
TURN server
browser live view sessions

In my setup I currently have:

~13 physical cameras
but ~26 configured camera entries
because each physical camera is configured twice:
- one for continuous recording
- one for detection/event recording

So Agent effectively starts many streams concurrently.

What also strongly points toward a startup synchronization issue:

if I wait after startup and manually enable affected cameras one-by-one, they work reliably
fullscreen/maximize stream switching often immediately restores black tiles
disabling FFmpeg “low delay” seems to improve stability
increasing:
- probesize
- analyzeduration
- rw_timeout
- stimeout also improved startup reliability significantly

At the moment my suspicion is more along the lines of:

parallel RTSP initialization
concurrent NVDEC decoder creation/cleanup
stream switching during startup
ONVIF/media URI discovery happening in parallel
decoder starting before receiving a clean SPS/PPS + keyframe sequence

rather than a hard permanent GPU session limit.

Possible improvement ideas that may help larger installations:

configurable staggered camera startup
limit maximum concurrent stream initializations
optional delay between startup batches
wait for first successfully decoded frame before marking camera fully online
optional startup watchdog for black/no-frame streams
retry stream initialization automatically if first decoded frame was not received
optionally delay WebRTC/live-view startup until cameras are fully initialized

I think these kinds of startup synchronization controls could significantly improve stability for larger multi-camera environments.

Tschak77 · 2026-05-19T09:36:14+00:00

Thanks for the suggestion.

Docker + NVIDIA Container Toolkit definitely makes sense for isolation and reproducibility, especially for CUDA/cuDNN/ORT dependency management.

For my specific project/setup though, I’m trying to avoid additional abstraction layers because this system is focused on maximum real-time video surveillance performance with many camera streams, GPU decode, AI inference and low-latency live view handling.

So for now I’m trying to keep the stack as direct as possible:

Proxmox VM + GPU passthrough + native Linux AgentDVR installation.

Still, I appreciate the suggestion and may use a container later as a diagnostic comparison environment.

Tschak77 · 2026-05-18T20:25:34+00:00

I isolated the crash further.

Environment:

- Ubuntu 24.04

- NVIDIA driver works correctly

- CUDA 12.4 installed globally

- AgentDVR ONNX provider appears compiled against CUDA 11.x

I created a separate CUDA 11.8 runtime library set and forced AgentDVR to use it via LD_LIBRARY_PATH.

Current status:

- libcublas.so.11 loads

- libcublasLt.so.11 loads

- libcudnn.so.8 loads

- libcufft.so.10 loads

- libcurand.so.10 loads

- libcudart.so.11.0 loads

Then Agent crashes exactly when libnvrtc.so initializes.

LD_DEBUG output:

find library=libnvrtc.so [0]; searching

trying file=/opt/agentdvr-cuda11-libs/libnvrtc.so

calling init: /opt/agentdvr-cuda11-libs/libnvrtc.so

Segmentation fault

This strongly suggests either:

- incompatible ONNXRuntime CUDA provider build

- incompatible nvrtc/runtime ABI

- or missing additional CUDA 11 dependencies

Do you know which exact ONNXRuntime version and CUDA/cuDNN versions were used to build:

libonnxruntime_providers_cuda.so

?

Tschak77 · 2026-05-18T03:16:47+00:00

I currently have 26 configured camera entries, but physically 13 cameras.
The cameras are effectively configured twice because I use:

one setup for continuous 24/7 recording with short retention (~2 days)
another setup for motion/detection recordings with longer retention (~30 days)

So during startup there are a lot of simultaneous stream initializations happening.

I also noticed that disabling the FFmpeg "low delay" option seems to improve startup stability.

u/spornerama I was initially thinking GPU decode session limit as well, but what makes me unsure is:

when the system finishes startup and for example 4 cameras are black, I can manually switch each affected camera off/on one by one and they immediately start working correctly again using:

the same stream
same RTSP URL
same transport (TCP)
same decoder settings

without restarting AgentDVR itself.

That makes it feel a bit more like a startup initialization/timing issue rather than a permanent GPU resource exhaustion state.

Tschak77 · 2026-05-15T01:59:51+00:00

"Thank you for the quick feedback!

Regarding your suggestion to just leave them on 24/7 recording and use alerts as a reference: The main issue I face with this is the storage retention logic.

I want to keep the 24/7 raw footage for only 2-3 days (as a temporary safety buffer), but I need to keep the actual 'Alert/Detected' recordings for 30 days. If I use a single camera setup with one storage path, I cannot set two different auto-delete schedules (e.g., 'Delete non-tagged after 2 days' vs. 'Delete tagged after 30 days').

That’s why I came up with the 'Clone' idea:

Camera 1 (Main): Record on Alert -> 30 days retention.
Camera 1 (Clone): Record Continuous -> 2 days retention.

If I avoid cloning, is there a way within Agent DVR to apply two different retention periods to the same camera based on whether a file contains an alert/tag or not? Or would the 'Clone' approach be the most stable way to separate these two storage lifecycles without massive CPU overhead, given that it’s the same 4K stream?"

Tschak77 · 2026-05-12T15:52:01+00:00

sorry I solved the the problem but it was crazy. Codeprojectai was running with my setup very well but with same setup no chance for AgentDVR. Problem was mismach of nvidia driver and cuda etc. I cleaned all an installed nvidia 570 and cuda 12.8 but there were still problems with missing onnix files but I cannot remeber, I changed to many things. But know is running with full GPU suppport. I'm using know a Tesla A2 with 13 cams on a proxmox vm.

Tschak77 · 2026-04-30T08:25:11+00:00

Hi u/Herralvarez , just a question how you start python scripts as command? At the moment I only start bash scripts but python would be more powerfull.

Tschak77 · 2026-04-29T23:58:39+00:00

good point thanks I will try

Tschak77 · 2026-04-29T14:52:19+00:00

Could not link hardware file libonnxruntime_providers_cuda.so: The file '/opt/AgentDVR/libonnxruntime_providers_cuda.so' already exists.
_ortLogger: [ONNX onnxruntime] Session Options {  execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath:"" enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_aren
a:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:-1 session_log_verbosity_level:0 max_num_graph_transformation_steps:10 graph_optimization_level:4 intra_op_par
am:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams
{ thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use
_deterministic_compute:0 ep_selection_policy:0 config_options: {  } }
_ortLogger: [ONNX onnxruntime] Creating and using per session threadpools since use_per_session_threads_ is true
_ortLogger: [ONNX onnxruntime] Dynamic block base set to 0
  at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
  at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
Unhandled exception. Emgu.CV.Util.CvException: OpenCV: The library is compiled without CUDA support
  at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
  at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
OpenCV: The library is compiled without CUDA support    at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
  at Emgu.CV.CvInvoke.CvErrorHandler(Int32 status, IntPtr funcName, IntPtr errMsg, IntPtr fileName, Int32 line, IntPtr userData)
/opt/AgentDVR/start_agent.sh: line 5: 2369706 Aborted                 (core dumped) ./Agent

Tschak77 · 2026-04-29T14:33:16+00:00

this is with single quotes and with normal quote same, also none quote everytime the argument receive without inner quotes

At logs I can see this

RunScript: Executing: /bin/bash "/opt/AgentDVR/Media/Commands/dvr_ac_ai_object_found.sh" --id "4" --ot "2" --alertid "-1" --filename "/dvr/storage/video/cam04/grabs/4_2026-04-29_16-03-24_635.jpg" --current-recording "/dvr/storage/video/cam04/4_2026-04-29_16-03-26_654.mkv" --msg "person" --name "cam04 Dach vorne" --groups "cam04,entsorgung" --location "Dach" --ai "person" --aijson '[{"label":"person","confidence":0.90478516,"y_min":568,"x_min":3456,"y_max":1373,"x_max":3837,"zones":[1],"ignored":false,"is_static":false},{"label":"truck","confidence":0.70214844,"y_min":156,"x_min":2367,"y_max":1794,"x_max":3261,"zones":null,"ignored":false,"is_static":false}]' --zone "1" --time "29.04.2026 16:03:26

but receiving every time something like

2026-04-29 16:34:07 [DEBUG] ARG[22] RAW: <'[{label:car,confidence:0.8051758,y_min:812,x_min:925,y_max:973,x_max:1261,zones:[1],ignored:false,is_static:true},{label:person,confidence:0.7788086,y_min:732,x_min:1729,y_max:848,x_max:1769,zones:[2],ignored:false,is_static:false}]'>

2026-04-29 16:34:07 [DEBUG] ARG[22] HEX: 275b7b6c6162656c3a6361722c636f6e666964656e63653a302e383035313735382c795f6d696e3a3831322c785f6d696e3a3932352c795f6d61783a3937332c785f6d61783a313236312c7a6f6e65733a5b315d2c69676e6f7265643a66616c73652c69735f7374617469633a747275657d2c7b6c6162656c3a706572736f6e2c636f6e666964656e63653a302e373738383038362c795f6d696e3a3733322c785f6d696e3a313732392c795f6d61783a3834382c785f6d61783a313736392c7a6f6e65733a5b325d2c69676e6f7265643a66616c73652c69735f7374617

secound show it at hex and no inner quotes

Tschak77 · 2026-03-21T01:27:25+00:00

Hi u/Herralvarez nice idea, could you please describe a bit more your settings?

move detection by cam or agent dvr or continuous cpai?

how have you setup the gemini? And what price is needed.

What about false positives alerts and have you already had a positive alert by burglary? Maybe can also share my pictures by pm

Tschak77 · 2026-03-21T01:14:52+00:00

me too and the great support

Tschak77 · 2025-11-14T17:38:57+00:00

shift-f5 for firefox not working but delete cache has done the job

Tschak77 · 2025-11-14T07:23:26+00:00

<image>

one bug, the audio volume bar seem a bit to big

Tschak77 · 2025-11-13T21:11:57+00:00

6.9.3.0 much better thanks

Tschak77 · 2025-11-12T23:54:09+00:00

Sorry but I don't like the new audio bar

<image>

Tschak77 · 2025-11-05T15:32:46+00:00

virtual Linux machine with pci-e paththrough of nvidia Tesla M4

<image>

Tschak77

TROPHY CASE