Batch identifier lookup ignores partition, merging devices across tenants #678
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#678
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #2140
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2140
Original created: 2025-12-16T05:13:31Z
Summary
batchLookupByStrongIdentifiersfunction in the Identity Engine is responsible for efficiently resolving device identifiers to canonical device IDs for batch operations during device update processing.Code with bug
The database function being called:
The SQL query used:
Evidence
Example
Consider a multi-tenant ServiceRadar deployment with two tenants:
Initial State:
AA:BB:CC:DD:EE:FF→ Device IDsr:tenant-a-device-123AA:BB:CC:DD:EE:FF→ Device IDsr:tenant-b-device-456These are legitimately different physical devices in different locations, managed by different tenants.
Batch Processing:
Two device updates arrive in the same batch:
{MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-A", IP: "10.1.1.1"}{MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-B", IP: "10.2.1.1"}Buggy Behavior:
batchLookupByStrongIdentifierscollects both MACs:["AA:BB:CC:DD:EE:FF"]BatchGetDeviceIDsByIdentifier(ctx, "mac", ["AA:BB:CC:DD:EE:FF"]){"AA:BB:CC:DD:EE:FF": "sr:tenant-a-device-123"}(first match found)sr:tenant-a-device-123Expected Behavior:
sr:tenant-a-device-123sr:tenant-b-device-456Inconsistency within the codebase
Reference code
pkg/registry/identity_engine.go:444-478The single lookup version correctly passes the partition to the database layer:
pkg/db/cnpg_identity_engine.go:63-87The SQL for single lookup:
Current code
pkg/registry/identity_engine.go:480-573pkg/db/cnpg_identity_engine.go:91-117The SQL for batch lookup:
Contradiction
The single-lookup path correctly implements partition isolation by:
The batch-lookup path violates partition isolation by:
This inconsistency means that the batch processing optimization introduces a correctness bug that doesn't exist in the single-item processing path.
Inconsistency with database schema
Database schema
pkg/db/cnpg/migrations/00000000000001_schema.up.sqlContradiction
The database schema's unique constraint
(identifier_type, identifier_value, partition)explicitly allows the same identifier value to exist in multiple partitions. This is by design to support multi-tenant isolation.However, the
BatchGetDeviceIDsByIdentifierfunction queries without considering partition, which means:The code violates the schema's design intent of partition-based isolation.
Failing test
Test script
Test output
The test clearly shows that when two updates with the same identifier but different partitions are processed in a batch, both incorrectly receive the same device ID from one partition, violating partition isolation.
Full context
The Identity Engine (
IdentityEngineinpkg/registry/identity_engine.go) is ServiceRadar's central system for resolving device identities. It's called during device update processing in the Device Registry'sProcessBatchDeviceUpdatesfunction. When multiple device updates arrive (e.g., from SNMP discovery, NetFlow, or other sources), the engine:device_identifiersdatabase table to find existing device IDssr:UUID) to each updateThe engine has two query paths:
lookupByStrongIdentifiers- used for individual device lookups, correctly filters by partitionbatchLookupByStrongIdentifiers- used for performance optimization when processing many updates, MISSING partition filteringThe batch path is an optimization introduced to reduce database round trips. Instead of querying for each identifier individually, it collects all identifiers of the same type and queries them in a single database call using PostgreSQL's
ANYoperator.The buggy batch lookup is called from:
ResolveDeviceIDs(line 318): The main batch processing function used by the Device RegistryServiceRadar supports multi-tenant deployments where different customers/organizations are isolated by partition. The partition is a string field in device updates (e.g., "tenant-A", "tenant-B", "default"). The system relies on partition-aware lookups to maintain data isolation. The
device_identifierstable enforces uniqueness per partition with the constraintUNIQUE (identifier_type, identifier_value, partition), explicitly allowing the same MAC address or other identifier to exist in multiple partitions.When device updates from different partitions are processed in the same batch (which is normal - there's no partition-based batching in
ProcessBatchDeviceUpdates), the bug causes cross-partition device ID assignment.External documentation
PostgreSQL ANY operator
From PostgreSQL Documentation - Row and Array Comparisons:
The
identifier_value = ANY($2)clause matches identifier_value against any element in the array, but without partition filtering, it matches across all partitions.Why has this bug gone undetected?
This bug has likely gone undetected because:
Single-tenant deployments: Many ServiceRadar deployments use only the default partition ("default"), meaning all devices are in the same partition. In this case, the missing partition filter doesn't cause incorrect behavior - it just returns results without filtering on a field that has the same value for all rows.
Rare collision scenario: The bug only manifests when:
In many real-world scenarios, even in multi-tenant deployments, different tenants rarely have devices with identical MAC addresses, especially if they're monitoring different physical networks.
Fallback behavior masks the issue: If the batch lookup returns a wrong device ID from partition A for an update in partition B, the system doesn't fail catastrophically. It just incorrectly merges two separate devices. The monitoring continues to work, just with incorrect device associations. This makes it harder to detect than a crash or error.
Single-item path works correctly: When testing with individual device updates or small batches where devices don't share identifiers, the system works correctly because the single-item lookup path (
lookupByStrongIdentifiers) has the correct partition filtering.Recent introduction: The Identity Engine appears to be a relatively recent consolidation (based on git history showing "dire rewrite" in commit
180ac4a1from December 2025). The batch optimization may not have been thoroughly tested in multi-tenant scenarios yet.Silent data corruption: Unlike functional errors that cause visible failures, this bug causes silent data corruption where devices are incorrectly merged. Users might notice symptoms (duplicate IPs, wrong device associations) but not connect them to this specific cause.
Recommended fix
The fix requires changes in two layers:
Database layer (
pkg/db/cnpg_identity_engine.go):BatchGetDeviceIDsByIdentifiersignature to accept a partition parameterIdentity Engine layer (
pkg/registry/identity_engine.go):Example fix for Option A: