KEMBAR78
Fix cgroup-network spawn server cleanup on fatal exit by ktsaou · Pull Request #21080 · netdata/netdata · GitHub
Skip to content

Conversation

@ktsaou
Copy link
Member

@ktsaou ktsaou commented Sep 30, 2025

Summary

  • Fix spawn server orphaning in cgroup-network when fatal() is called
  • Prevents zombie processes when running as PID 1 in Docker

Root Cause

When cgroup-network calls fatal() after creating its spawn server (line 759), the spawn server process is orphaned and reparented to PID 1. When in Docker netdata runs as PID 1 (not by default), and when the orphaned spawn server exits, it becomes a zombie because netdata is not reaping non-child processes.

Solution

Register a fatal callback that properly destroys the spawn server before exit:

  • Add cleanup_spawn_server_on_fatal() callback
  • Register callback immediately after spawn server creation
  • Set spawn_server = NULL after destruction in both cleanup paths (fatal callback + normal exit)

Safety Analysis

  • The fatal() call (line 759) occurs before any fork() (line 825)
  • No race conditions with parent/child cleanup
  • Only one fatal() call exists in the entire program

Test Plan

  • Verify compilation
  • Test with invalid NETDATA_HOST_PREFIX to trigger fatal()
  • Confirm no zombie processes remain in Docker

Fixes #20565

@ktsaou ktsaou requested a review from thiagoftsm as a code owner September 30, 2025 21:28
@github-actions github-actions bot added area/collectors Everything related to data collection collectors/cgroups labels Sep 30, 2025
When cgroup-network calls fatal() after creating its spawn server,
the spawn server process is orphaned and reparented to PID 1 (netdata
in Docker). When the orphaned spawn server exits, it becomes a zombie
because netdata is not reaping non-child processes.

This fix registers a fatal callback that properly destroys the spawn
server before exit, preventing orphaning.

Changes:
- Add cleanup_spawn_server_on_fatal() callback
- Register callback immediately after spawn server creation
- Set spawn_server = NULL after destruction in both cleanup paths

Fixes: netdata#20565
@ilyam8 ilyam8 merged commit 09a2bd5 into netdata:master Oct 3, 2025
108 checks passed
stelfrag pushed a commit to stelfrag/netdata that referenced this pull request Oct 3, 2025
@stelfrag stelfrag mentioned this pull request Oct 3, 2025
Ferroin pushed a commit that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/collectors Everything related to data collection collectors/cgroups

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Netdata Docker container leaves zombie processes

2 participants