Shallow copies of HostResult in summary collection/stream cause data races #686
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#686
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #2148
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2148
Original created: 2025-12-16T05:17:19Z
Summary
collectShardSummariesandprocessShardForSummaryfunctions inbase_processor.gocollect host information from sharded maps to return aggregated sweep summaries to callers.HostResultstructs using*host, which copies only the struct fields but shares the underlying data of pointer and slice fields (PortResults,PortMap,ICMPStatus).HostResultstructs continue to reference the same underlying slices and maps as the originals, allowing concurrent read/write access. The expected behavior is that the returned data should be independent copies that can be safely accessed without holding the shard lock.Code with bug
Bug in collectShardSummaries (line 442)
Bug in processShardForSummary (line 792)
Evidence
Failing test
Test script
Test output
Example
Consider this scenario:
Initial state: Host "192.168.1.1" exists in shard 5 with one open port (80). The
HostResultcontains:PortResults: slice with 1 element pointing to PortResult{Port: 80}PortMap: map with entry80 -> PortResult{Port: 80}Thread A calls GetSummary():
shard.mu.RLock()at line 418summary.hosts = append(summary.hosts, *host)at line 442HostResultstruct, butPortResultsandPortMapstill point to the same underlying datashard.mu.RUnlock()at line 419Thread A continues to use the summary (lock now released):
summary.Hosts[0].PortMap[80]to read port informationThread B calls Process() for port 443 on the same host:
shard.mu.Lock()at line 281updatePortStatus()at line 314host.PortMap[result.Target.Port] = portResultat line 366Data race: Thread A reads from
PortMapwhile Thread B writes to it, without any synchronization.Full context
The
BaseProcessorinpkg/sweeper/base_processor.gois a concurrent data structure that maintains network sweep results across multiple sharded maps. Each shard protects its host data with a read-write mutex (sync.RWMutex).The processor provides two methods to retrieve aggregated sweep summaries:
GetSummary()(line 509) - collects all host results into memory and returns themGetSummaryStream()(line 838) - streams host results through a channel to avoid large memory allocationsBoth methods call internal functions that copy host data from shards:
collectShardSummaries()(line 405) is called byGetSummary()processShardForSummary()(line 769) is called byGetSummaryStream()The issue is that these functions acquire read locks, copy the data, and then release the locks before the copied data is used by callers. The shallow copy means the returned data structures still reference the same underlying memory that can be modified by concurrent
Process()calls.The
HostResultstruct (defined inpkg/models/sweep.go) contains several fields that require deep copying:When new ports are discovered for a host,
updatePortStatus()(line 318) modifies bothPortResultsandPortMap:host.PortResults = append(host.PortResults, portResult)host.PortMap[result.Target.Port] = portResultThese modifications can happen concurrently with reads from the "copied" summaries, creating data races.
The same pattern exists in two additional locations in
pkg/sweeper/memory_store.go:convertToSlice()Why has this bug gone undetected?
This data race has likely gone undetected for several reasons:
Timing-dependent: The race window is very small - it only occurs if a summary is being accessed after its shard lock is released AND a concurrent
Process()call modifies the same host data. In typical usage patterns, summaries might be consumed quickly or the specific timing doesn't align.Go's memory model: Even with the race condition, Go's runtime may not crash immediately. The behavior is undefined, but often "appears to work" - especially on systems with strong memory ordering (like x86-64). The corruption might manifest as stale data rather than crashes.
Not detected without race detector: The bug is only reliably detectable with Go's race detector (
go test -race). Standard testing without the race detector will typically pass, as the undefined behavior may not cause obvious failures.Read-heavy workload: If the typical workload involves more summary reads than concurrent modifications, or if modifications happen in between summary retrievals rather than during, the race is less likely to manifest.
Slice/map implementation details: Modern Go runtime implementations of slices and maps have some internal safety mechanisms that might mask corruption in certain scenarios, though they don't guarantee race-free access.
Testing patterns: The existing test suite may not have concurrent scenarios that exercise both summary retrieval and host modification simultaneously. The tests in
base_processor_test.gooften usedefer processor.cleanup()which suggests they test one operation at a time rather than true concurrent access patterns.Recommended fix
The fix requires performing deep copies of the
HostResultdata structures. Here's the recommended approach:Create a helper function to deep copy a
HostResult:Then replace the shallow copy operations:
summary.hosts = append(summary.hosts, *host)tosummary.hosts = append(summary.hosts, deepCopyHostResult(host))hostCh <- *hosttohostCh <- deepCopyHostResult(host)The same fix should be applied to
pkg/sweeper/memory_store.goat lines 408 and 511.Related bugs
The same shallow copy pattern exists in
pkg/sweeper/memory_store.go:convertToSlice()functionThese locations should be reviewed and fixed using the same deep copy approach.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2148#issuecomment-3662134269
Original created: 2025-12-16T19:51:47Z
closing as completed