carverauto/serviceradar

Fork 0

Batch identifier lookup ignores partition, merging devices across tenants #678

New issue

Closed

opened 2026-03-28 04:27:20 +00:00 by mfreeman451 · 0 comments

mfreeman451 commented

2026-03-28 04:27:20 +00:00

Owner

Imported from GitHub.

Original GitHub issue: #2140
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2140
Original created: 2025-12-16T05:13:31Z

Summary

Context: The batchLookupByStrongIdentifiers function in the Identity Engine is responsible for efficiently resolving device identifiers to canonical device IDs for batch operations during device update processing.
Bug: The batch lookup function does not filter by partition when querying the database for device identifiers, causing it to return device IDs from any partition that matches the identifier.
Actual vs. expected: When multiple device updates with the same identifier value but different partitions are processed in a batch, the function returns the first matching device ID (from whichever partition is found first in the database) to ALL updates, regardless of their partition. Expected behavior is that each update should get the device ID from its own partition.
Impact: In multi-tenant deployments, this bug violates partition isolation by incorrectly merging devices from different tenants that happen to share the same MAC address or other identifier, potentially causing data leakage between tenants and breaking device identity integrity.

Code with bug

// batchLookupByStrongIdentifiers queries device_identifiers for multiple updates
func (e *IdentityEngine) batchLookupByStrongIdentifiers(
	ctx context.Context,
	updates []*models.DeviceUpdate,
	updateIdentifiers map[*models.DeviceUpdate]*StrongIdentifiers,
) map[*models.DeviceUpdate]string {
	// ... code omitted for brevity ...

	// Collect all identifiers by type
	armisIDs := make([]string, 0)
	integrationIDs := make([]string, 0)
	netboxIDs := make([]string, 0)
	macs := make([]string, 0)

	for _, update := range updates {
		ids := updateIdentifiers[update]
		if ids == nil {
			continue
		}
		if ids.ArmisID != "" {
			armisIDs = append(armisIDs, ids.ArmisID)
		}
		if ids.IntegrationID != "" {
			integrationIDs = append(integrationIDs, ids.IntegrationID)
		}
		if ids.NetboxID != "" {
			netboxIDs = append(netboxIDs, ids.NetboxID)
		}
		if ids.MAC != "" {
			macs = append(macs, ids.MAC)  // <-- BUG 🔴: Collects MACs from all partitions together
		}
	}

	// Batch query each identifier type
	identifierToDevice := make(map[string]string)

	if len(macs) > 0 {
		results, err := e.db.BatchGetDeviceIDsByIdentifier(ctx, IdentifierTypeMAC, macs)
		// <-- BUG 🔴: No partition parameter passed
		if err == nil {
			for idValue, deviceID := range results {
				identifierToDevice[IdentifierTypeMAC+":"+idValue] = deviceID
			}
		}
	}

	// ... similar code for other identifier types ...
}

The database function being called:

// BatchGetDeviceIDsByIdentifier looks up device IDs for multiple identifier values of the same type.
// Returns a map of identifier_value -> device_id.
func (db *DB) BatchGetDeviceIDsByIdentifier(ctx context.Context, identifierType string, identifierValues []string) (map[string]string, error) {
	result := make(map[string]string)

	if !db.useCNPGWrites() || identifierType == "" || len(identifierValues) == 0 {
		return result, nil
	}

	rows, err := db.conn().Query(ctx, batchGetDeviceIDsByIdentifierSQL, identifierType, identifierValues)
	// <-- BUG 🔴: SQL query doesn't filter by partition
	// ...
}

The SQL query used:

batchGetDeviceIDsByIdentifierSQL = `
SELECT identifier_value, device_id
FROM device_identifiers
WHERE identifier_type = $1
  AND identifier_value = ANY($2)`
// <-- BUG 🔴: Missing "AND partition = $3" filter

Evidence

Example

Consider a multi-tenant ServiceRadar deployment with two tenants:

Initial State:

Tenant A (partition "tenant-A") has a device with MAC AA:BB:CC:DD:EE:FF → Device ID sr:tenant-a-device-123
Tenant B (partition "tenant-B") has a device with MAC AA:BB:CC:DD:EE:FF → Device ID sr:tenant-b-device-456

These are legitimately different physical devices in different locations, managed by different tenants.

Batch Processing:
Two device updates arrive in the same batch:

Update 1: {MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-A", IP: "10.1.1.1"}
Update 2: {MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-B", IP: "10.2.1.1"}

Buggy Behavior:

batchLookupByStrongIdentifiers collects both MACs: ["AA:BB:CC:DD:EE:FF"]
Calls BatchGetDeviceIDsByIdentifier(ctx, "mac", ["AA:BB:CC:DD:EE:FF"])
Database query returns: {"AA:BB:CC:DD:EE:FF": "sr:tenant-a-device-123"} (first match found)
Both updates get assigned sr:tenant-a-device-123
Tenant B's device is now incorrectly merged with Tenant A's device

Expected Behavior:

Update 1 should get device ID sr:tenant-a-device-123
Update 2 should get device ID sr:tenant-b-device-456

Inconsistency within the codebase

Reference code

pkg/registry/identity_engine.go:444-478

// lookupByStrongIdentifiers queries the device_identifiers table for a match
func (e *IdentityEngine) lookupByStrongIdentifiers(ctx context.Context, ids *StrongIdentifiers) string {
	if e == nil || e.db == nil {
		return ""
	}

	// Query in priority order
	for _, idType := range StrongIdentifierPriority {
		var idValue string
		switch idType {
		case IdentifierTypeArmis:
			idValue = ids.ArmisID
		case IdentifierTypeIntegration:
			idValue = ids.IntegrationID
		case IdentifierTypeNetbox:
			idValue = ids.NetboxID
		case IdentifierTypeMAC:
			idValue = ids.MAC
		}

		if idValue == "" {
			continue
		}

		deviceID, err := e.db.GetDeviceIDByIdentifier(ctx, idType, idValue, ids.Partition)
		// ^^^ CORRECT: Passes partition parameter
		if err != nil {
			continue // Try next identifier type
		}
		if deviceID != "" {
			return deviceID
		}
	}

	return ""
}

The single lookup version correctly passes the partition to the database layer:

pkg/db/cnpg_identity_engine.go:63-87

func (db *DB) GetDeviceIDByIdentifier(ctx context.Context, identifierType, identifierValue, partition string) (string, error) {
	if !db.useCNPGWrites() || identifierType == "" || identifierValue == "" {
		return "", nil
	}

	if partition == "" {
		partition = defaultPartitionValue
	}

	rows, err := db.conn().Query(ctx, getDeviceIDByIdentifierSQL, identifierType, identifierValue, partition)
	// ^^^ CORRECT: Uses SQL with partition filter
	// ...
}

The SQL for single lookup:

SELECT device_id
FROM device_identifiers
WHERE identifier_type = $1
  AND identifier_value = $2
  AND partition = $3  -- ✅ CORRECT: Includes partition filter
LIMIT 1

Current code

pkg/registry/identity_engine.go:480-573

func (e *IdentityEngine) batchLookupByStrongIdentifiers(
	ctx context.Context,
	updates []*models.DeviceUpdate,
	updateIdentifiers map[*models.DeviceUpdate]*StrongIdentifiers,
) map[*models.DeviceUpdate]string {
	// ... collects identifiers from all updates regardless of partition ...

	if len(armisIDs) > 0 {
		results, err := e.db.BatchGetDeviceIDsByIdentifier(ctx, IdentifierTypeArmis, armisIDs)
		// ❌ WRONG: No partition parameter
		// ...
	}
	// ... similar for integration IDs, netbox IDs, and MACs ...
}

pkg/db/cnpg_identity_engine.go:91-117

func (db *DB) BatchGetDeviceIDsByIdentifier(ctx context.Context, identifierType string, identifierValues []string) (map[string]string, error) {
	result := make(map[string]string)

	if !db.useCNPGWrites() || identifierType == "" || len(identifierValues) == 0 {
		return result, nil
	}

	rows, err := db.conn().Query(ctx, batchGetDeviceIDsByIdentifierSQL, identifierType, identifierValues)
	// ❌ WRONG: Query doesn't include partition
	// ...
}

The SQL for batch lookup:

SELECT identifier_value, device_id
FROM device_identifiers
WHERE identifier_type = $1
  AND identifier_value = ANY($2)
  -- ❌ MISSING: AND partition = $3

Contradiction

The single-lookup path correctly implements partition isolation by:

Accepting partition as a parameter
Including it in the SQL WHERE clause

The batch-lookup path violates partition isolation by:

Not accepting partition as a parameter
Not including it in the SQL WHERE clause
Collecting identifiers from multiple partitions into a single query

This inconsistency means that the batch processing optimization introduces a correctness bug that doesn't exist in the single-item processing path.

Inconsistency with database schema

Database schema

pkg/db/cnpg/migrations/00000000000001_schema.up.sql

CREATE TABLE IF NOT EXISTS device_identifiers (
    id               SERIAL          PRIMARY KEY,
    device_id        TEXT            NOT NULL,
    identifier_type  TEXT            NOT NULL,
    identifier_value TEXT            NOT NULL,
    partition        TEXT            NOT NULL DEFAULT 'default',
    confidence       TEXT            NOT NULL DEFAULT 'strong',
    source           TEXT,
    first_seen       TIMESTAMPTZ     NOT NULL DEFAULT now(),
    last_seen        TIMESTAMPTZ     NOT NULL DEFAULT now(),
    verified         BOOLEAN         NOT NULL DEFAULT FALSE,
    metadata         JSONB           NOT NULL DEFAULT '{}'::jsonb,
    UNIQUE (identifier_type, identifier_value, partition)
    -- ^^^ IMPORTANT: Partition is part of the unique constraint
);

Contradiction

The database schema's unique constraint (identifier_type, identifier_value, partition) explicitly allows the same identifier value to exist in multiple partitions. This is by design to support multi-tenant isolation.

However, the BatchGetDeviceIDsByIdentifier function queries without considering partition, which means:

When the same identifier exists in multiple partitions (as the schema allows)
The function returns only one device ID (typically the first one found)
This device ID might be from the wrong partition

The code violates the schema's design intent of partition-based isolation.

Failing test

Test script

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/carverauto/serviceradar/pkg/db"
	"github.com/carverauto/serviceradar/pkg/logger"
	"github.com/carverauto/serviceradar/pkg/models"
	"github.com/carverauto/serviceradar/pkg/registry"
)

// This test demonstrates a bug in batchLookupByStrongIdentifiers where
// partition filtering is missing, causing device IDs from different partitions
// to be incorrectly matched.
func main() {
	// Create a mock DB that simulates the bug
	mockDB := &MockDBWithPartitionBug{}
	log := logger.NewTestLogger()

	engine := registry.NewIdentityEngine(mockDB, log)

	// Create two device updates with the same MAC but different partitions
	macValue := "AABBCCDDEEFF"
	update1 := &models.DeviceUpdate{
		DeviceID:  "",
		IP:        "10.0.1.1",
		MAC:       &macValue,
		Partition: "tenant-A",
		Source:    models.DiscoverySourceSNMP,
		Timestamp: time.Now(),
	}

	update2 := &models.DeviceUpdate{
		DeviceID:  "",
		IP:        "10.0.2.1",
		MAC:       &macValue,
		Partition: "tenant-B",
		Source:    models.DiscoverySourceSNMP,
		Timestamp: time.Now(),
	}

	// In the database:
	// - tenant-A has device "sr:tenant-a-device" with MAC AABBCCDDEEFF
	// - tenant-B has device "sr:tenant-b-device" with MAC AABBCCDDEEFF
	// These are DIFFERENT devices in DIFFERENT partitions

	ctx := context.Background()
	updates := []*models.DeviceUpdate{update1, update2}

	err := engine.ResolveDeviceIDs(ctx, updates)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Update 1 (partition: %s): DeviceID = %s\n", update1.Partition, update1.DeviceID)
	fmt.Printf("Update 2 (partition: %s): DeviceID = %s\n", update2.Partition, update2.DeviceID)

	// BUG: Both updates might get the same device ID (from the first partition found)
	// because batchLookupByStrongIdentifiers doesn't filter by partition
	if update1.DeviceID == update2.DeviceID {
		fmt.Printf("\n❌ BUG DETECTED: Both updates got the same device ID despite being in different partitions!\n")
		fmt.Printf("This violates partition isolation.\n")
	} else {
		fmt.Printf("\n✅ CORRECT: Updates have different device IDs as expected.\n")
	}
}

// MockDBWithPartitionBug simulates the bug where BatchGetDeviceIDsByIdentifier
// doesn't filter by partition
type MockDBWithPartitionBug struct {
	db.Service // Embed to satisfy interface
}

func (m *MockDBWithPartitionBug) BatchGetDeviceIDsByIdentifier(
	ctx context.Context,
	identifierType string,
	identifierValues []string,
) (map[string]string, error) {
	// BUG: This doesn't take partition into account!
	// It returns the FIRST match found across ALL partitions
	result := make(map[string]string)

	if identifierType == "mac" {
		for _, value := range identifierValues {
			if value == "AABBCCDDEEFF" {
				// Returns device from tenant-A for ALL queries
				// This is wrong when the query should be for tenant-B
				result[value] = "sr:tenant-a-device"
			}
		}
	}

	return result, nil
}

func (m *MockDBWithPartitionBug) Close() error { return nil }

// Other required methods - not used in this test
func (m *MockDBWithPartitionBug) GetDeviceIDByIdentifier(ctx context.Context, identifierType, identifierValue, partition string) (string, error) {
	// This single-lookup version DOES use partition correctly
	if identifierType == "mac" && identifierValue == "AABBCCDDEEFF" {
		if partition == "tenant-A" {
			return "sr:tenant-a-device", nil
		}
		if partition == "tenant-B" {
			return "sr:tenant-b-device", nil
		}
	}
	return "", nil
}

func (m *MockDBWithPartitionBug) GetUnifiedDevicesByIPsOrIDs(ctx context.Context, ips []string, ids []string) ([]*models.UnifiedDevice, error) {
	return nil, nil
}

func (m *MockDBWithPartitionBug) UpsertDeviceIdentifiers(ctx context.Context, identifiers []*models.DeviceIdentifier) error {
	return nil
}

Test output

Update 1 (partition: tenant-A): DeviceID = sr:tenant-a-device
Update 2 (partition: tenant-B): DeviceID = sr:tenant-a-device

❌ BUG DETECTED: Both updates got the same device ID despite being in different partitions!
This violates partition isolation.

The test clearly shows that when two updates with the same identifier but different partitions are processed in a batch, both incorrectly receive the same device ID from one partition, violating partition isolation.

Full context

The Identity Engine (IdentityEngine in pkg/registry/identity_engine.go) is ServiceRadar's central system for resolving device identities. It's called during device update processing in the Device Registry's ProcessBatchDeviceUpdates function. When multiple device updates arrive (e.g., from SNMP discovery, NetFlow, or other sources), the engine:

Extracts strong identifiers (Armis ID, Integration ID, NetBox ID, MAC address) from each update
Queries the device_identifiers database table to find existing device IDs
Assigns canonical ServiceRadar device IDs (format sr:UUID) to each update

The engine has two query paths:

Single lookup: lookupByStrongIdentifiers - used for individual device lookups, correctly filters by partition
Batch lookup: batchLookupByStrongIdentifiers - used for performance optimization when processing many updates, MISSING partition filtering

The batch path is an optimization introduced to reduce database round trips. Instead of querying for each identifier individually, it collects all identifiers of the same type and queries them in a single database call using PostgreSQL's ANY operator.

The buggy batch lookup is called from:

ResolveDeviceIDs (line 318): The main batch processing function used by the Device Registry
This is invoked during normal device discovery and monitoring operations

ServiceRadar supports multi-tenant deployments where different customers/organizations are isolated by partition. The partition is a string field in device updates (e.g., "tenant-A", "tenant-B", "default"). The system relies on partition-aware lookups to maintain data isolation. The device_identifiers table enforces uniqueness per partition with the constraint UNIQUE (identifier_type, identifier_value, partition), explicitly allowing the same MAC address or other identifier to exist in multiple partitions.

When device updates from different partitions are processed in the same batch (which is normal - there's no partition-based batching in ProcessBatchDeviceUpdates), the bug causes cross-partition device ID assignment.

External documentation

PostgreSQL ANY operator

From PostgreSQL Documentation - Row and Array Comparisons:

expression operator ANY (array expression)
The right-hand side is a parenthesized expression, which must yield an array value.
The left-hand expression is evaluated and compared to each element of the array using
the given operator. The result is "true" if any true result is obtained.

The identifier_value = ANY($2) clause matches identifier_value against any element in the array, but without partition filtering, it matches across all partitions.

Why has this bug gone undetected?

This bug has likely gone undetected because:

Single-tenant deployments: Many ServiceRadar deployments use only the default partition ("default"), meaning all devices are in the same partition. In this case, the missing partition filter doesn't cause incorrect behavior - it just returns results without filtering on a field that has the same value for all rows.
Rare collision scenario: The bug only manifests when:
- Multiple partitions are actively in use
- The same identifier (e.g., MAC address) exists in different partitions
- Updates for these devices arrive in the same batch
- The batch processing path is used (not the single-item path)
In many real-world scenarios, even in multi-tenant deployments, different tenants rarely have devices with identical MAC addresses, especially if they're monitoring different physical networks.
Fallback behavior masks the issue: If the batch lookup returns a wrong device ID from partition A for an update in partition B, the system doesn't fail catastrophically. It just incorrectly merges two separate devices. The monitoring continues to work, just with incorrect device associations. This makes it harder to detect than a crash or error.
Single-item path works correctly: When testing with individual device updates or small batches where devices don't share identifiers, the system works correctly because the single-item lookup path (lookupByStrongIdentifiers) has the correct partition filtering.
Recent introduction: The Identity Engine appears to be a relatively recent consolidation (based on git history showing "dire rewrite" in commit 180ac4a1 from December 2025). The batch optimization may not have been thoroughly tested in multi-tenant scenarios yet.
Silent data corruption: Unlike functional errors that cause visible failures, this bug causes silent data corruption where devices are incorrectly merged. Users might notice symptoms (duplicate IPs, wrong device associations) but not connect them to this specific cause.

Recommended fix

The fix requires changes in two layers:

Database layer (pkg/db/cnpg_identity_engine.go):

Update BatchGetDeviceIDsByIdentifier signature to accept a partition parameter
Modify the SQL query to include partition filtering

Update the SQL constant:

batchGetDeviceIDsByIdentifierSQL = `
SELECT identifier_value, device_id
FROM device_identifiers
WHERE identifier_type = $1
  AND identifier_value = ANY($2)
  AND partition = $3`  // <-- FIX 🟢

Identity Engine layer (pkg/registry/identity_engine.go):
- Since updates in a batch can have different partitions, the batch cannot be done with a single query anymore
- Either:
  - Option A: Group updates by partition first, then do batch queries per partition
  - Option B: Fall back to single lookups when mixed partitions are detected
  - Option A is recommended as it preserves the performance optimization while maintaining correctness

Example fix for Option A:

// Group updates by partition
updatesByPartition := make(map[string][]*models.DeviceUpdate)
for _, update := range updates {
    ids := updateIdentifiers[update]
    if ids != nil {
        partition := ids.Partition
        if partition == "" {
            partition = "default"
        }
        updatesByPartition[partition] = append(updatesByPartition[partition], update)
    }
}

// Process each partition separately
for partition, partitionUpdates := range updatesByPartition {
    // Collect identifiers for this partition
    // Query with partition parameter  // <-- FIX 🟢
    // Map results back to updates
}

Imported from GitHub. Original GitHub issue: #2140 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2140 Original created: 2025-12-16T05:13:31Z --- # Summary - **Context**: The `batchLookupByStrongIdentifiers` function in the Identity Engine is responsible for efficiently resolving device identifiers to canonical device IDs for batch operations during device update processing. - **Bug**: The batch lookup function does not filter by partition when querying the database for device identifiers, causing it to return device IDs from any partition that matches the identifier. - **Actual vs. expected**: When multiple device updates with the same identifier value but different partitions are processed in a batch, the function returns the first matching device ID (from whichever partition is found first in the database) to ALL updates, regardless of their partition. Expected behavior is that each update should get the device ID from its own partition. - **Impact**: In multi-tenant deployments, this bug violates partition isolation by incorrectly merging devices from different tenants that happen to share the same MAC address or other identifier, potentially causing data leakage between tenants and breaking device identity integrity. # Code with bug ```go // batchLookupByStrongIdentifiers queries device_identifiers for multiple updates func (e *IdentityEngine) batchLookupByStrongIdentifiers( ctx context.Context, updates []*models.DeviceUpdate, updateIdentifiers map[*models.DeviceUpdate]*StrongIdentifiers, ) map[*models.DeviceUpdate]string { // ... code omitted for brevity ... // Collect all identifiers by type armisIDs := make([]string, 0) integrationIDs := make([]string, 0) netboxIDs := make([]string, 0) macs := make([]string, 0) for _, update := range updates { ids := updateIdentifiers[update] if ids == nil { continue } if ids.ArmisID != "" { armisIDs = append(armisIDs, ids.ArmisID) } if ids.IntegrationID != "" { integrationIDs = append(integrationIDs, ids.IntegrationID) } if ids.NetboxID != "" { netboxIDs = append(netboxIDs, ids.NetboxID) } if ids.MAC != "" { macs = append(macs, ids.MAC) // <-- BUG 🔴: Collects MACs from all partitions together } } // Batch query each identifier type identifierToDevice := make(map[string]string) if len(macs) > 0 { results, err := e.db.BatchGetDeviceIDsByIdentifier(ctx, IdentifierTypeMAC, macs) // <-- BUG 🔴: No partition parameter passed if err == nil { for idValue, deviceID := range results { identifierToDevice[IdentifierTypeMAC+":"+idValue] = deviceID } } } // ... similar code for other identifier types ... } ``` The database function being called: ```go // BatchGetDeviceIDsByIdentifier looks up device IDs for multiple identifier values of the same type. // Returns a map of identifier_value -> device_id. func (db *DB) BatchGetDeviceIDsByIdentifier(ctx context.Context, identifierType string, identifierValues []string) (map[string]string, error) { result := make(map[string]string) if !db.useCNPGWrites() || identifierType == "" || len(identifierValues) == 0 { return result, nil } rows, err := db.conn().Query(ctx, batchGetDeviceIDsByIdentifierSQL, identifierType, identifierValues) // <-- BUG 🔴: SQL query doesn't filter by partition // ... } ``` The SQL query used: ```go batchGetDeviceIDsByIdentifierSQL = ` SELECT identifier_value, device_id FROM device_identifiers WHERE identifier_type = $1 AND identifier_value = ANY($2)` // <-- BUG 🔴: Missing "AND partition = $3" filter ``` # Evidence ## Example Consider a multi-tenant ServiceRadar deployment with two tenants: **Initial State:** - Tenant A (partition "tenant-A") has a device with MAC `AA:BB:CC:DD:EE:FF` → Device ID `sr:tenant-a-device-123` - Tenant B (partition "tenant-B") has a device with MAC `AA:BB:CC:DD:EE:FF` → Device ID `sr:tenant-b-device-456` These are legitimately different physical devices in different locations, managed by different tenants. **Batch Processing:** Two device updates arrive in the same batch: 1. Update 1: `{MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-A", IP: "10.1.1.1"}` 2. Update 2: `{MAC: "AA:BB:CC:DD:EE:FF", Partition: "tenant-B", IP: "10.2.1.1"}` **Buggy Behavior:** 1. `batchLookupByStrongIdentifiers` collects both MACs: `["AA:BB:CC:DD:EE:FF"]` 2. Calls `BatchGetDeviceIDsByIdentifier(ctx, "mac", ["AA:BB:CC:DD:EE:FF"])` 3. Database query returns: `{"AA:BB:CC:DD:EE:FF": "sr:tenant-a-device-123"}` (first match found) 4. Both updates get assigned `sr:tenant-a-device-123` 5. Tenant B's device is now incorrectly merged with Tenant A's device **Expected Behavior:** - Update 1 should get device ID `sr:tenant-a-device-123` - Update 2 should get device ID `sr:tenant-b-device-456` ## Inconsistency within the codebase ### Reference code `pkg/registry/identity_engine.go:444-478` ```go // lookupByStrongIdentifiers queries the device_identifiers table for a match func (e *IdentityEngine) lookupByStrongIdentifiers(ctx context.Context, ids *StrongIdentifiers) string { if e == nil || e.db == nil { return "" } // Query in priority order for _, idType := range StrongIdentifierPriority { var idValue string switch idType { case IdentifierTypeArmis: idValue = ids.ArmisID case IdentifierTypeIntegration: idValue = ids.IntegrationID case IdentifierTypeNetbox: idValue = ids.NetboxID case IdentifierTypeMAC: idValue = ids.MAC } if idValue == "" { continue } deviceID, err := e.db.GetDeviceIDByIdentifier(ctx, idType, idValue, ids.Partition) // ^^^ CORRECT: Passes partition parameter if err != nil { continue // Try next identifier type } if deviceID != "" { return deviceID } } return "" } ``` The single lookup version correctly passes the partition to the database layer: `pkg/db/cnpg_identity_engine.go:63-87` ```go func (db *DB) GetDeviceIDByIdentifier(ctx context.Context, identifierType, identifierValue, partition string) (string, error) { if !db.useCNPGWrites() || identifierType == "" || identifierValue == "" { return "", nil } if partition == "" { partition = defaultPartitionValue } rows, err := db.conn().Query(ctx, getDeviceIDByIdentifierSQL, identifierType, identifierValue, partition) // ^^^ CORRECT: Uses SQL with partition filter // ... } ``` The SQL for single lookup: ```sql SELECT device_id FROM device_identifiers WHERE identifier_type = $1 AND identifier_value = $2 AND partition = $3 -- ✅ CORRECT: Includes partition filter LIMIT 1 ``` ### Current code `pkg/registry/identity_engine.go:480-573` ```go func (e *IdentityEngine) batchLookupByStrongIdentifiers( ctx context.Context, updates []*models.DeviceUpdate, updateIdentifiers map[*models.DeviceUpdate]*StrongIdentifiers, ) map[*models.DeviceUpdate]string { // ... collects identifiers from all updates regardless of partition ... if len(armisIDs) > 0 { results, err := e.db.BatchGetDeviceIDsByIdentifier(ctx, IdentifierTypeArmis, armisIDs) // ❌ WRONG: No partition parameter // ... } // ... similar for integration IDs, netbox IDs, and MACs ... } ``` `pkg/db/cnpg_identity_engine.go:91-117` ```go func (db *DB) BatchGetDeviceIDsByIdentifier(ctx context.Context, identifierType string, identifierValues []string) (map[string]string, error) { result := make(map[string]string) if !db.useCNPGWrites() || identifierType == "" || len(identifierValues) == 0 { return result, nil } rows, err := db.conn().Query(ctx, batchGetDeviceIDsByIdentifierSQL, identifierType, identifierValues) // ❌ WRONG: Query doesn't include partition // ... } ``` The SQL for batch lookup: ```sql SELECT identifier_value, device_id FROM device_identifiers WHERE identifier_type = $1 AND identifier_value = ANY($2) -- ❌ MISSING: AND partition = $3 ``` ### Contradiction The single-lookup path correctly implements partition isolation by: 1. Accepting partition as a parameter 2. Including it in the SQL WHERE clause The batch-lookup path violates partition isolation by: 1. Not accepting partition as a parameter 2. Not including it in the SQL WHERE clause 3. Collecting identifiers from multiple partitions into a single query This inconsistency means that the batch processing optimization introduces a correctness bug that doesn't exist in the single-item processing path. ## Inconsistency with database schema ### Database schema `pkg/db/cnpg/migrations/00000000000001_schema.up.sql` ```sql CREATE TABLE IF NOT EXISTS device_identifiers ( id SERIAL PRIMARY KEY, device_id TEXT NOT NULL, identifier_type TEXT NOT NULL, identifier_value TEXT NOT NULL, partition TEXT NOT NULL DEFAULT 'default', confidence TEXT NOT NULL DEFAULT 'strong', source TEXT, first_seen TIMESTAMPTZ NOT NULL DEFAULT now(), last_seen TIMESTAMPTZ NOT NULL DEFAULT now(), verified BOOLEAN NOT NULL DEFAULT FALSE, metadata JSONB NOT NULL DEFAULT '{}'::jsonb, UNIQUE (identifier_type, identifier_value, partition) -- ^^^ IMPORTANT: Partition is part of the unique constraint ); ``` ### Contradiction The database schema's unique constraint `(identifier_type, identifier_value, partition)` explicitly allows the same identifier value to exist in multiple partitions. This is by design to support multi-tenant isolation. However, the `BatchGetDeviceIDsByIdentifier` function queries without considering partition, which means: 1. When the same identifier exists in multiple partitions (as the schema allows) 2. The function returns only one device ID (typically the first one found) 3. This device ID might be from the wrong partition The code violates the schema's design intent of partition-based isolation. ## Failing test ### Test script ```go package main import ( "context" "fmt" "time" "github.com/carverauto/serviceradar/pkg/db" "github.com/carverauto/serviceradar/pkg/logger" "github.com/carverauto/serviceradar/pkg/models" "github.com/carverauto/serviceradar/pkg/registry" ) // This test demonstrates a bug in batchLookupByStrongIdentifiers where // partition filtering is missing, causing device IDs from different partitions // to be incorrectly matched. func main() { // Create a mock DB that simulates the bug mockDB := &MockDBWithPartitionBug{} log := logger.NewTestLogger() engine := registry.NewIdentityEngine(mockDB, log) // Create two device updates with the same MAC but different partitions macValue := "AABBCCDDEEFF" update1 := &models.DeviceUpdate{ DeviceID: "", IP: "10.0.1.1", MAC: &macValue, Partition: "tenant-A", Source: models.DiscoverySourceSNMP, Timestamp: time.Now(), } update2 := &models.DeviceUpdate{ DeviceID: "", IP: "10.0.2.1", MAC: &macValue, Partition: "tenant-B", Source: models.DiscoverySourceSNMP, Timestamp: time.Now(), } // In the database: // - tenant-A has device "sr:tenant-a-device" with MAC AABBCCDDEEFF // - tenant-B has device "sr:tenant-b-device" with MAC AABBCCDDEEFF // These are DIFFERENT devices in DIFFERENT partitions ctx := context.Background() updates := []*models.DeviceUpdate{update1, update2} err := engine.ResolveDeviceIDs(ctx, updates) if err != nil { fmt.Printf("Error: %v\n", err) return } fmt.Printf("Update 1 (partition: %s): DeviceID = %s\n", update1.Partition, update1.DeviceID) fmt.Printf("Update 2 (partition: %s): DeviceID = %s\n", update2.Partition, update2.DeviceID) // BUG: Both updates might get the same device ID (from the first partition found) // because batchLookupByStrongIdentifiers doesn't filter by partition if update1.DeviceID == update2.DeviceID { fmt.Printf("\n❌ BUG DETECTED: Both updates got the same device ID despite being in different partitions!\n") fmt.Printf("This violates partition isolation.\n") } else { fmt.Printf("\n✅ CORRECT: Updates have different device IDs as expected.\n") } } // MockDBWithPartitionBug simulates the bug where BatchGetDeviceIDsByIdentifier // doesn't filter by partition type MockDBWithPartitionBug struct { db.Service // Embed to satisfy interface } func (m *MockDBWithPartitionBug) BatchGetDeviceIDsByIdentifier( ctx context.Context, identifierType string, identifierValues []string, ) (map[string]string, error) { // BUG: This doesn't take partition into account! // It returns the FIRST match found across ALL partitions result := make(map[string]string) if identifierType == "mac" { for _, value := range identifierValues { if value == "AABBCCDDEEFF" { // Returns device from tenant-A for ALL queries // This is wrong when the query should be for tenant-B result[value] = "sr:tenant-a-device" } } } return result, nil } func (m *MockDBWithPartitionBug) Close() error { return nil } // Other required methods - not used in this test func (m *MockDBWithPartitionBug) GetDeviceIDByIdentifier(ctx context.Context, identifierType, identifierValue, partition string) (string, error) { // This single-lookup version DOES use partition correctly if identifierType == "mac" && identifierValue == "AABBCCDDEEFF" { if partition == "tenant-A" { return "sr:tenant-a-device", nil } if partition == "tenant-B" { return "sr:tenant-b-device", nil } } return "", nil } func (m *MockDBWithPartitionBug) GetUnifiedDevicesByIPsOrIDs(ctx context.Context, ips []string, ids []string) ([]*models.UnifiedDevice, error) { return nil, nil } func (m *MockDBWithPartitionBug) UpsertDeviceIdentifiers(ctx context.Context, identifiers []*models.DeviceIdentifier) error { return nil } ``` ### Test output ``` Update 1 (partition: tenant-A): DeviceID = sr:tenant-a-device Update 2 (partition: tenant-B): DeviceID = sr:tenant-a-device ❌ BUG DETECTED: Both updates got the same device ID despite being in different partitions! This violates partition isolation. ``` The test clearly shows that when two updates with the same identifier but different partitions are processed in a batch, both incorrectly receive the same device ID from one partition, violating partition isolation. # Full context The Identity Engine (`IdentityEngine` in `pkg/registry/identity_engine.go`) is ServiceRadar's central system for resolving device identities. It's called during device update processing in the Device Registry's `ProcessBatchDeviceUpdates` function. When multiple device updates arrive (e.g., from SNMP discovery, NetFlow, or other sources), the engine: 1. Extracts strong identifiers (Armis ID, Integration ID, NetBox ID, MAC address) from each update 2. Queries the `device_identifiers` database table to find existing device IDs 3. Assigns canonical ServiceRadar device IDs (format `sr:UUID`) to each update The engine has two query paths: - **Single lookup**: `lookupByStrongIdentifiers` - used for individual device lookups, correctly filters by partition - **Batch lookup**: `batchLookupByStrongIdentifiers` - used for performance optimization when processing many updates, MISSING partition filtering The batch path is an optimization introduced to reduce database round trips. Instead of querying for each identifier individually, it collects all identifiers of the same type and queries them in a single database call using PostgreSQL's `ANY` operator. The buggy batch lookup is called from: - `ResolveDeviceIDs` (line 318): The main batch processing function used by the Device Registry - This is invoked during normal device discovery and monitoring operations ServiceRadar supports multi-tenant deployments where different customers/organizations are isolated by partition. The partition is a string field in device updates (e.g., "tenant-A", "tenant-B", "default"). The system relies on partition-aware lookups to maintain data isolation. The `device_identifiers` table enforces uniqueness per partition with the constraint `UNIQUE (identifier_type, identifier_value, partition)`, explicitly allowing the same MAC address or other identifier to exist in multiple partitions. When device updates from different partitions are processed in the same batch (which is normal - there's no partition-based batching in `ProcessBatchDeviceUpdates`), the bug causes cross-partition device ID assignment. ## External documentation ### PostgreSQL ANY operator From [PostgreSQL Documentation - Row and Array Comparisons](https://www.postgresql.org/docs/current/functions-comparisons.html): ``` expression operator ANY (array expression) The right-hand side is a parenthesized expression, which must yield an array value. The left-hand expression is evaluated and compared to each element of the array using the given operator. The result is "true" if any true result is obtained. ``` The `identifier_value = ANY($2)` clause matches identifier_value against any element in the array, but without partition filtering, it matches across all partitions. # Why has this bug gone undetected? This bug has likely gone undetected because: 1. **Single-tenant deployments**: Many ServiceRadar deployments use only the default partition ("default"), meaning all devices are in the same partition. In this case, the missing partition filter doesn't cause incorrect behavior - it just returns results without filtering on a field that has the same value for all rows. 2. **Rare collision scenario**: The bug only manifests when: - Multiple partitions are actively in use - The same identifier (e.g., MAC address) exists in different partitions - Updates for these devices arrive in the same batch - The batch processing path is used (not the single-item path) In many real-world scenarios, even in multi-tenant deployments, different tenants rarely have devices with identical MAC addresses, especially if they're monitoring different physical networks. 3. **Fallback behavior masks the issue**: If the batch lookup returns a wrong device ID from partition A for an update in partition B, the system doesn't fail catastrophically. It just incorrectly merges two separate devices. The monitoring continues to work, just with incorrect device associations. This makes it harder to detect than a crash or error. 4. **Single-item path works correctly**: When testing with individual device updates or small batches where devices don't share identifiers, the system works correctly because the single-item lookup path (`lookupByStrongIdentifiers`) has the correct partition filtering. 5. **Recent introduction**: The Identity Engine appears to be a relatively recent consolidation (based on git history showing "dire rewrite" in commit 180ac4a1 from December 2025). The batch optimization may not have been thoroughly tested in multi-tenant scenarios yet. 6. **Silent data corruption**: Unlike functional errors that cause visible failures, this bug causes silent data corruption where devices are incorrectly merged. Users might notice symptoms (duplicate IPs, wrong device associations) but not connect them to this specific cause. # Recommended fix The fix requires changes in two layers: 1. **Database layer** (`pkg/db/cnpg_identity_engine.go`): - Update `BatchGetDeviceIDsByIdentifier` signature to accept a partition parameter - Modify the SQL query to include partition filtering - Update the SQL constant: ```go batchGetDeviceIDsByIdentifierSQL = ` SELECT identifier_value, device_id FROM device_identifiers WHERE identifier_type = $1 AND identifier_value = ANY($2) AND partition = $3` // <-- FIX 🟢 ``` 2. **Identity Engine layer** (`pkg/registry/identity_engine.go`): - Since updates in a batch can have different partitions, the batch cannot be done with a single query anymore - Either: - **Option A**: Group updates by partition first, then do batch queries per partition - **Option B**: Fall back to single lookups when mixed partitions are detected - **Option A is recommended** as it preserves the performance optimization while maintaining correctness Example fix for Option A: ```go // Group updates by partition updatesByPartition := make(map[string][]*models.DeviceUpdate) for _, update := range updates { ids := updateIdentifiers[update] if ids != nil { partition := ids.Partition if partition == "" { partition = "default" } updatesByPartition[partition] = append(updatesByPartition[partition], update) } } // Process each partition separately for partition, partitionUpdates := range updatesByPartition { // Collect identifiers for this partition // Query with partition parameter // <-- FIX 🟢 // Map results back to updates } ```

mfreeman451 closed this issue

2026-03-28 04:27:21 +00:00