Deadlock in SNMPService.Check due to nested RLock when a writer is waiting #679
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#679
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #2141
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2141
Original created: 2025-12-16T05:14:49Z
Summary
Checkmethod inSNMPServiceis used to determine the overall health status of all SNMP targets being monitored. It's called by health check endpoints to report service availability.Checkmethod acquires a read lock (RLock) and then callsGetStatus, which tries to acquire another read lock on the same mutex. This causes a deadlock when a writer (such ashandleDataPoint) is waiting to acquire a write lock.RLockattempt inGetStatusblocks indefinitely waiting for the writer to complete. However, the writer cannot complete because it's waiting for the firstRLock(held byCheck) to be released. Expected behavior is thatCheckshould complete without deadlock.Code with bug
Evidence
Example
Consider this execution sequence:
Check) acquiress.mu.RLock()at line 32handleDataPoint) tries to acquires.mu.Lock()at line 354 and blocks (waiting for read lock to be released)Check) callsGetStatus(ctx)at line 36GetStatus, Goroutine A tries to acquires.mu.RLock()at line 178RLockblocks because Goroutine B (the writer) is waitingGo's
sync.RWMuteximplements write-preferring semantics: when a writer is waiting for a lock, new read lock requests are blocked to prevent writer starvation. This causes the deadlock.Failing test
Test script
Test output
The test times out (exit code 124), confirming the deadlock occurs. The test does not produce any output because it hangs indefinitely.
Full context
The
Checkmethod is part of theSNMPServicewhich implements a health check interface. It's called periodically by health check systems to determine if the SNMP monitoring service is functioning correctly.The method was modified in commit
bde2169e23beace1a4bdcdfcb127d0db764b7fb6(June 30, 2025) to callGetStatusinstead of directly iterating over collectors. The previous implementation was:The new implementation was likely changed to:
GetStatusthat combines service and collector statusHowever, this introduced the deadlock bug because
GetStatusalso acquires a read lock.The
handleDataPointmethod runs in separate goroutines (one per target) created byprocessResults(line 325 in service.go). These goroutines continuously process data points from SNMP collectors and update the service's internal status by acquiring write locks. This creates frequent contention with theCheckmethod's read locks.External documentation
Why has this bug gone undetected?
This bug has likely gone undetected for several reasons:
Timing-dependent: The deadlock only occurs when the exact sequence of events happens:
CheckacquiresRLockhandleDataPoint(or another writer) tries to acquireLockand waitsCheckthen callsGetStatuswhich tries to acquire anotherRLockThis requires precise timing where the writer request arrives between the two
RLockcalls.Low write frequency: The
handleDataPointmethod is only called when new SNMP data arrives (based on the polling interval, typically 30-60 seconds). If status checks happen infrequently or at different times than data updates, the deadlock won't occur.Short critical sections: Both
CheckandGetStatushold their locks for relatively short periods (just reading maps and copying data), so the window for the race is small.Test environment limitations: Unit tests use mocks and may not exercise the concurrent paths. Integration tests may not run long enough or with sufficient concurrency to trigger the deadlock.
Production monitoring gaps: When a deadlock occurs, the service simply hangs rather than crashing. If health check timeouts are configured, the service might be restarted before the deadlock is noticed or investigated.
Recent introduction: The bug was introduced in commit
bde2169eon June 30, 2025, so it's relatively recent and may not have been deployed widely or run under high concurrency conditions.Recommended fix
Remove the redundant lock acquisition in
Checkby not callingGetStatuswhile holding the lock:This fix simply removes lines 32-33 (the
s.mu.RLock()anddefer s.mu.RUnlock()). The lock protection is still provided byGetStatus, so there's no loss of thread safety. This restores the ability forCheckto complete without nested locking.