Deadlock in SNMPService.Check due to nested RLock when a writer is waiting #679

New issue

Closed

opened 2026-03-28 04:27:21 +00:00 by mfreeman451 · 0 comments

mfreeman451 commented

2026-03-28 04:27:21 +00:00

Owner

Imported from GitHub.

Original GitHub issue: #2141
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2141
Original created: 2025-12-16T05:14:49Z

Summary

Context: The Check method in SNMPService is used to determine the overall health status of all SNMP targets being monitored. It's called by health check endpoints to report service availability.
Bug: The Check method acquires a read lock (RLock) and then calls GetStatus, which tries to acquire another read lock on the same mutex. This causes a deadlock when a writer (such as handleDataPoint) is waiting to acquire a write lock.
Actual vs. expected: When a write lock is waiting, the second RLock attempt in GetStatus blocks indefinitely waiting for the writer to complete. However, the writer cannot complete because it's waiting for the first RLock (held by Check) to be released. Expected behavior is that Check should complete without deadlock.
Impact: The service can deadlock during normal operation when status checks and data point updates occur concurrently, causing the entire SNMP monitoring service to hang and become unresponsive.

Code with bug

// pkg/checker/snmp/service.go:31-59
func (s *SNMPService) Check(ctx context.Context) (available bool, msg string) {
    s.mu.RLock()      // <-- BUG 🔴 First read lock acquired
    defer s.mu.RUnlock()

    // Re-using the GetStatus logic to get the detailed map
    statusMap, err := s.GetStatus(ctx)  // <-- BUG 🔴 Calls GetStatus which tries to acquire another RLock
    if err != nil {
        return false, string(jsonError(fmt.Sprintf("Error getting detailed SNMP status: %v", err)))
    }

    // Marshal the status map to JSON for the message content
    statusJSON, err := json.Marshal(statusMap)
    if err != nil {
        return false, string(jsonError(fmt.Sprintf("Error marshaling SNMP status to JSON: %v", err)))
    }

    // Determine overall availability
    overallAvailable := true

    for _, targetStatus := range statusMap {
        if !targetStatus.Available {
            overallAvailable = false
            break
        }
    }

    // Always return the marshaled JSON in the message, regardless of overall availability
    return overallAvailable, string(statusJSON)
}

// pkg/checker/snmp/service.go:177-179
func (s *SNMPService) GetStatus(_ context.Context) (map[string]TargetStatus, error) {
    s.mu.RLock()      // <-- BUG 🔴 Second read lock - blocks when writer is waiting
    defer s.mu.RUnlock()
    // ... rest of method
}

// pkg/checker/snmp/service.go:353-355
func (s *SNMPService) handleDataPoint(targetName string, point *DataPoint, aggregator Aggregator) {
    s.mu.Lock()      // <-- Writer that blocks between the two RLocks
    defer s.mu.Unlock()
    // ... rest of method
}

Evidence

Example

Consider this execution sequence:

T=0ms: Goroutine A (Check) acquires s.mu.RLock() at line 32
T=10ms: Goroutine B (handleDataPoint) tries to acquire s.mu.Lock() at line 354 and blocks (waiting for read lock to be released)
T=20ms: Goroutine A (Check) calls GetStatus(ctx) at line 36
T=21ms: Inside GetStatus, Goroutine A tries to acquire s.mu.RLock() at line 178
DEADLOCK:
- Goroutine A's second RLock blocks because Goroutine B (the writer) is waiting
- Goroutine B cannot acquire the write lock because Goroutine A still holds the first read lock
- Goroutine A cannot release the first read lock because it's blocked waiting for the second read lock

Go's sync.RWMutex implements write-preferring semantics: when a writer is waiting for a lock, new read lock requests are blocked to prevent writer starvation. This causes the deadlock.

Failing test

Test script

/*
 * Copyright 2025 Carver Automation Corporation.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// Package snmp pkg/checker/snmp/service_deadlock_test.go
package snmp

import (
	"context"
	"testing"
	"time"

	"github.com/carverauto/serviceradar/pkg/logger"
)

// TestCheckDeadlockWithConcurrentWrite demonstrates the deadlock bug in the Check method.
// The bug occurs when:
// 1. Check() acquires RLock
// 2. A writer (e.g., handleDataPoint) tries to acquire Lock and waits
// 3. Check() calls GetStatus() which tries to acquire another RLock
// 4. The second RLock blocks because there's a waiting writer
// 5. Deadlock: Check holds first RLock, writer waits for it, Check waits for second RLock
func TestCheckDeadlockWithConcurrentWrite(t *testing.T) {
	config := &SNMPConfig{
		NodeAddress: "localhost:50051",
		ListenAddr:  ":50052",
		Partition:   "test-partition",
		Targets: []Target{
			{
				Name:      "test-target",
				Host:      "192.168.1.1",
				Port:      161,
				Community: "public",
				Version:   Version2c,
				Interval:  Duration(30 * time.Second),
				Timeout:   Duration(5 * time.Second),
				Retries:   2,
				MaxPoints: 1000,
				OIDs: []OIDConfig{
					{
						OID:      ".1.3.6.1.2.1.1.3.0",
						Name:     "sysUptime",
						DataType: TypeGauge,
						Scale:    1.0,
					},
				},
			},
		},
	}

	testLogger := logger.NewTestLogger()
	service, err := NewSNMPService(config, testLogger)
	if err != nil {
		t.Fatalf("Failed to create service: %v", err)
	}

	ctx := context.Background()

	// Initialize service status
	service.status["test-target"] = TargetStatus{
		Available: true,
		LastPoll:  time.Now(),
		OIDStatus: make(map[string]OIDStatus),
	}

	// Create an aggregator (required by handleDataPoint)
	service.aggregators["test-target"] = NewAggregator(30*time.Second, 1000)

	// Channel to detect if Check completes
	checkDone := make(chan bool, 1)

	// Start Check() - this will acquire RLock and then try to acquire another RLock in GetStatus()
	go func() {
		// Add a small delay to ensure Check acquires its first RLock
		service.mu.RLock()
		time.Sleep(50 * time.Millisecond)

		// Now try to call GetStatus while holding the RLock
		// This simulates what Check() does
		_, _ = service.GetStatus(ctx)

		service.mu.RUnlock()
		checkDone <- true
	}()

	// Give Check() time to acquire its first RLock
	time.Sleep(25 * time.Millisecond)

	// Now start a writer (handleDataPoint) that will wait for the RLock to be released
	writerDone := make(chan bool, 1)
	go func() {
		// This will try to acquire a write lock, which will block because of the read lock
		// Once this writer starts waiting, any new RLock attempts will also block
		service.mu.Lock()
		time.Sleep(10 * time.Millisecond)
		service.mu.Unlock()
		writerDone <- true
	}()

	// Wait for either:
	// 1. Check completes (no deadlock)
	// 2. Timeout (deadlock detected)
	select {
	case <-checkDone:
		t.Log("Check completed successfully - no deadlock")
	case <-time.After(2 * time.Second):
		t.Fatal("DEADLOCK DETECTED: Check() did not complete within timeout. " +
			"This demonstrates the bug where Check() acquires RLock, then calls GetStatus() " +
			"which tries to acquire another RLock while a writer is waiting.")
	}

	// Clean up: wait for writer to complete
	select {
	case <-writerDone:
		t.Log("Writer completed")
	case <-time.After(1 * time.Second):
		t.Log("Writer still waiting (expected if Check completed)")
	}
}

Test output

$ export PATH=$PATH:/usr/local/go/bin && timeout 10 go test -v -count=1 ./pkg/checker/snmp -run TestCheckDeadlockWithConcurrentWrite
exit code: 124 (timeout - deadlock detected)

The test times out (exit code 124), confirming the deadlock occurs. The test does not produce any output because it hangs indefinitely.

Full context

The Check method is part of the SNMPService which implements a health check interface. It's called periodically by health check systems to determine if the SNMP monitoring service is functioning correctly.

The method was modified in commit bde2169e23beace1a4bdcdfcb127d0db764b7fb6 (June 30, 2025) to call GetStatus instead of directly iterating over collectors. The previous implementation was:

// Original implementation (before bde2169e)
func (s *SNMPService) Check(ctx context.Context) (bool, string) {
    s.mu.RLock()
    defer s.mu.RUnlock()

    // Check each target's status
    for name, collector := range s.collectors {
        status := collector.GetStatus()
        if !status.Available {
            return false, fmt.Sprintf("target %s is unavailable: %s", name, status.Error)
        }
    }
    return true, ""
}

The new implementation was likely changed to:

Return richer status information (JSON format)
Reuse the merging logic in GetStatus that combines service and collector status

However, this introduced the deadlock bug because GetStatus also acquires a read lock.

The handleDataPoint method runs in separate goroutines (one per target) created by processResults (line 325 in service.go). These goroutines continuously process data points from SNMP collectors and update the service's internal status by acquiring write locks. This creates frequent contention with the Check method's read locks.

External documentation

Go sync.RWMutex documentation

If a goroutine holds a RLock for reading and another goroutine might call Lock,
no goroutine should expect to be able to acquire a read lock until the initial
read lock is released. In particular, this prohibits recursive read locking.
This is to ensure that the lock eventually becomes available; a blocked Lock
call excludes new readers from acquiring the lock.

Why has this bug gone undetected?

This bug has likely gone undetected for several reasons:

Timing-dependent: The deadlock only occurs when the exact sequence of events happens:
- Check acquires RLock
- handleDataPoint (or another writer) tries to acquire Lock and waits
- Check then calls GetStatus which tries to acquire another RLock
This requires precise timing where the writer request arrives between the two RLock calls.
Low write frequency: The handleDataPoint method is only called when new SNMP data arrives (based on the polling interval, typically 30-60 seconds). If status checks happen infrequently or at different times than data updates, the deadlock won't occur.
Short critical sections: Both Check and GetStatus hold their locks for relatively short periods (just reading maps and copying data), so the window for the race is small.
Test environment limitations: Unit tests use mocks and may not exercise the concurrent paths. Integration tests may not run long enough or with sufficient concurrency to trigger the deadlock.
Production monitoring gaps: When a deadlock occurs, the service simply hangs rather than crashing. If health check timeouts are configured, the service might be restarted before the deadlock is noticed or investigated.
Recent introduction: The bug was introduced in commit bde2169e on June 30, 2025, so it's relatively recent and may not have been deployed widely or run under high concurrency conditions.

Recommended fix

Remove the redundant lock acquisition in Check by not calling GetStatus while holding the lock:

// pkg/checker/snmp/service.go:31-59
func (s *SNMPService) Check(ctx context.Context) (available bool, msg string) {
    // Don't acquire lock here - let GetStatus handle locking  // <-- FIX 🟢

    // Re-using the GetStatus logic to get the detailed map
    statusMap, err := s.GetStatus(ctx)  // <-- FIX 🟢 GetStatus acquires its own lock
    if err != nil {
        return false, string(jsonError(fmt.Sprintf("Error getting detailed SNMP status: %v", err)))
    }

    // Marshal the status map to JSON for the message content
    statusJSON, err := json.Marshal(statusMap)
    if err != nil {
        return false, string(jsonError(fmt.Sprintf("Error marshaling SNMP status to JSON: %v", err)))
    }

    // Determine overall availability
    overallAvailable := true

    for _, targetStatus := range statusMap {
        if !targetStatus.Available {
            overallAvailable = false
            break
        }
    }

    // Always return the marshaled JSON in the message, regardless of overall availability
    return overallAvailable, string(statusJSON)
}

This fix simply removes lines 32-33 (the s.mu.RLock() and defer s.mu.RUnlock()). The lock protection is still provided by GetStatus, so there's no loss of thread safety. This restores the ability for Check to complete without nested locking.

Imported from GitHub. Original GitHub issue: #2141 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2141 Original created: 2025-12-16T05:14:49Z --- # Summary - **Context**: The `Check` method in `SNMPService` is used to determine the overall health status of all SNMP targets being monitored. It's called by health check endpoints to report service availability. - **Bug**: The `Check` method acquires a read lock (`RLock`) and then calls `GetStatus`, which tries to acquire another read lock on the same mutex. This causes a deadlock when a writer (such as `handleDataPoint`) is waiting to acquire a write lock. - **Actual vs. expected**: When a write lock is waiting, the second `RLock` attempt in `GetStatus` blocks indefinitely waiting for the writer to complete. However, the writer cannot complete because it's waiting for the first `RLock` (held by `Check`) to be released. Expected behavior is that `Check` should complete without deadlock. - **Impact**: The service can deadlock during normal operation when status checks and data point updates occur concurrently, causing the entire SNMP monitoring service to hang and become unresponsive. # Code with bug ```go // pkg/checker/snmp/service.go:31-59 func (s *SNMPService) Check(ctx context.Context) (available bool, msg string) { s.mu.RLock() // <-- BUG 🔴 First read lock acquired defer s.mu.RUnlock() // Re-using the GetStatus logic to get the detailed map statusMap, err := s.GetStatus(ctx) // <-- BUG 🔴 Calls GetStatus which tries to acquire another RLock if err != nil { return false, string(jsonError(fmt.Sprintf("Error getting detailed SNMP status: %v", err))) } // Marshal the status map to JSON for the message content statusJSON, err := json.Marshal(statusMap) if err != nil { return false, string(jsonError(fmt.Sprintf("Error marshaling SNMP status to JSON: %v", err))) } // Determine overall availability overallAvailable := true for _, targetStatus := range statusMap { if !targetStatus.Available { overallAvailable = false break } } // Always return the marshaled JSON in the message, regardless of overall availability return overallAvailable, string(statusJSON) } ``` ```go // pkg/checker/snmp/service.go:177-179 func (s *SNMPService) GetStatus(_ context.Context) (map[string]TargetStatus, error) { s.mu.RLock() // <-- BUG 🔴 Second read lock - blocks when writer is waiting defer s.mu.RUnlock() // ... rest of method } ``` ```go // pkg/checker/snmp/service.go:353-355 func (s *SNMPService) handleDataPoint(targetName string, point *DataPoint, aggregator Aggregator) { s.mu.Lock() // <-- Writer that blocks between the two RLocks defer s.mu.Unlock() // ... rest of method } ``` # Evidence ## Example Consider this execution sequence: 1. **T=0ms**: Goroutine A (`Check`) acquires `s.mu.RLock()` at line 32 2. **T=10ms**: Goroutine B (`handleDataPoint`) tries to acquire `s.mu.Lock()` at line 354 and **blocks** (waiting for read lock to be released) 3. **T=20ms**: Goroutine A (`Check`) calls `GetStatus(ctx)` at line 36 4. **T=21ms**: Inside `GetStatus`, Goroutine A tries to acquire `s.mu.RLock()` at line 178 5. **DEADLOCK**: - Goroutine A's second `RLock` blocks because Goroutine B (the writer) is waiting - Goroutine B cannot acquire the write lock because Goroutine A still holds the first read lock - Goroutine A cannot release the first read lock because it's blocked waiting for the second read lock Go's `sync.RWMutex` implements **write-preferring semantics**: when a writer is waiting for a lock, new read lock requests are blocked to prevent writer starvation. This causes the deadlock. ## Failing test ### Test script ```go /* * Copyright 2025 Carver Automation Corporation. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ // Package snmp pkg/checker/snmp/service_deadlock_test.go package snmp import ( "context" "testing" "time" "github.com/carverauto/serviceradar/pkg/logger" ) // TestCheckDeadlockWithConcurrentWrite demonstrates the deadlock bug in the Check method. // The bug occurs when: // 1. Check() acquires RLock // 2. A writer (e.g., handleDataPoint) tries to acquire Lock and waits // 3. Check() calls GetStatus() which tries to acquire another RLock // 4. The second RLock blocks because there's a waiting writer // 5. Deadlock: Check holds first RLock, writer waits for it, Check waits for second RLock func TestCheckDeadlockWithConcurrentWrite(t *testing.T) { config := &SNMPConfig{ NodeAddress: "localhost:50051", ListenAddr: ":50052", Partition: "test-partition", Targets: []Target{ { Name: "test-target", Host: "192.168.1.1", Port: 161, Community: "public", Version: Version2c, Interval: Duration(30 * time.Second), Timeout: Duration(5 * time.Second), Retries: 2, MaxPoints: 1000, OIDs: []OIDConfig{ { OID: ".1.3.6.1.2.1.1.3.0", Name: "sysUptime", DataType: TypeGauge, Scale: 1.0, }, }, }, }, } testLogger := logger.NewTestLogger() service, err := NewSNMPService(config, testLogger) if err != nil { t.Fatalf("Failed to create service: %v", err) } ctx := context.Background() // Initialize service status service.status["test-target"] = TargetStatus{ Available: true, LastPoll: time.Now(), OIDStatus: make(map[string]OIDStatus), } // Create an aggregator (required by handleDataPoint) service.aggregators["test-target"] = NewAggregator(30*time.Second, 1000) // Channel to detect if Check completes checkDone := make(chan bool, 1) // Start Check() - this will acquire RLock and then try to acquire another RLock in GetStatus() go func() { // Add a small delay to ensure Check acquires its first RLock service.mu.RLock() time.Sleep(50 * time.Millisecond) // Now try to call GetStatus while holding the RLock // This simulates what Check() does _, _ = service.GetStatus(ctx) service.mu.RUnlock() checkDone <- true }() // Give Check() time to acquire its first RLock time.Sleep(25 * time.Millisecond) // Now start a writer (handleDataPoint) that will wait for the RLock to be released writerDone := make(chan bool, 1) go func() { // This will try to acquire a write lock, which will block because of the read lock // Once this writer starts waiting, any new RLock attempts will also block service.mu.Lock() time.Sleep(10 * time.Millisecond) service.mu.Unlock() writerDone <- true }() // Wait for either: // 1. Check completes (no deadlock) // 2. Timeout (deadlock detected) select { case <-checkDone: t.Log("Check completed successfully - no deadlock") case <-time.After(2 * time.Second): t.Fatal("DEADLOCK DETECTED: Check() did not complete within timeout. " + "This demonstrates the bug where Check() acquires RLock, then calls GetStatus() " + "which tries to acquire another RLock while a writer is waiting.") } // Clean up: wait for writer to complete select { case <-writerDone: t.Log("Writer completed") case <-time.After(1 * time.Second): t.Log("Writer still waiting (expected if Check completed)") } } ``` ### Test output ``` $ export PATH=$PATH:/usr/local/go/bin && timeout 10 go test -v -count=1 ./pkg/checker/snmp -run TestCheckDeadlockWithConcurrentWrite exit code: 124 (timeout - deadlock detected) ``` The test times out (exit code 124), confirming the deadlock occurs. The test does not produce any output because it hangs indefinitely. # Full context The `Check` method is part of the `SNMPService` which implements a health check interface. It's called periodically by health check systems to determine if the SNMP monitoring service is functioning correctly. The method was modified in commit `bde2169e23beace1a4bdcdfcb127d0db764b7fb6` (June 30, 2025) to call `GetStatus` instead of directly iterating over collectors. The previous implementation was: ```go // Original implementation (before bde2169e) func (s *SNMPService) Check(ctx context.Context) (bool, string) { s.mu.RLock() defer s.mu.RUnlock() // Check each target's status for name, collector := range s.collectors { status := collector.GetStatus() if !status.Available { return false, fmt.Sprintf("target %s is unavailable: %s", name, status.Error) } } return true, "" } ``` The new implementation was likely changed to: 1. Return richer status information (JSON format) 2. Reuse the merging logic in `GetStatus` that combines service and collector status However, this introduced the deadlock bug because `GetStatus` also acquires a read lock. The `handleDataPoint` method runs in separate goroutines (one per target) created by `processResults` (line 325 in service.go). These goroutines continuously process data points from SNMP collectors and update the service's internal status by acquiring write locks. This creates frequent contention with the `Check` method's read locks. ## External documentation - [Go sync.RWMutex documentation](https://pkg.go.dev/sync#RWMutex) ``` If a goroutine holds a RLock for reading and another goroutine might call Lock, no goroutine should expect to be able to acquire a read lock until the initial read lock is released. In particular, this prohibits recursive read locking. This is to ensure that the lock eventually becomes available; a blocked Lock call excludes new readers from acquiring the lock. ``` # Why has this bug gone undetected? This bug has likely gone undetected for several reasons: 1. **Timing-dependent**: The deadlock only occurs when the exact sequence of events happens: - `Check` acquires `RLock` - `handleDataPoint` (or another writer) tries to acquire `Lock` and waits - `Check` then calls `GetStatus` which tries to acquire another `RLock` This requires precise timing where the writer request arrives between the two `RLock` calls. 2. **Low write frequency**: The `handleDataPoint` method is only called when new SNMP data arrives (based on the polling interval, typically 30-60 seconds). If status checks happen infrequently or at different times than data updates, the deadlock won't occur. 3. **Short critical sections**: Both `Check` and `GetStatus` hold their locks for relatively short periods (just reading maps and copying data), so the window for the race is small. 4. **Test environment limitations**: Unit tests use mocks and may not exercise the concurrent paths. Integration tests may not run long enough or with sufficient concurrency to trigger the deadlock. 5. **Production monitoring gaps**: When a deadlock occurs, the service simply hangs rather than crashing. If health check timeouts are configured, the service might be restarted before the deadlock is noticed or investigated. 6. **Recent introduction**: The bug was introduced in commit `bde2169e` on June 30, 2025, so it's relatively recent and may not have been deployed widely or run under high concurrency conditions. # Recommended fix Remove the redundant lock acquisition in `Check` by not calling `GetStatus` while holding the lock: ```go // pkg/checker/snmp/service.go:31-59 func (s *SNMPService) Check(ctx context.Context) (available bool, msg string) { // Don't acquire lock here - let GetStatus handle locking // <-- FIX 🟢 // Re-using the GetStatus logic to get the detailed map statusMap, err := s.GetStatus(ctx) // <-- FIX 🟢 GetStatus acquires its own lock if err != nil { return false, string(jsonError(fmt.Sprintf("Error getting detailed SNMP status: %v", err))) } // Marshal the status map to JSON for the message content statusJSON, err := json.Marshal(statusMap) if err != nil { return false, string(jsonError(fmt.Sprintf("Error marshaling SNMP status to JSON: %v", err))) } // Determine overall availability overallAvailable := true for _, targetStatus := range statusMap { if !targetStatus.Available { overallAvailable = false break } } // Always return the marshaled JSON in the message, regardless of overall availability return overallAvailable, string(statusJSON) } ``` This fix simply removes lines 32-33 (the `s.mu.RLock()` and `defer s.mu.RUnlock()`). The lock protection is still provided by `GetStatus`, so there's no loss of thread safety. This restores the ability for `Check` to complete without nested locking.

mfreeman451 closed this issue

2026-03-28 04:27:21 +00:00