feat: agent job distribution #712

New issue

Open

opened 2026-03-28 04:27:43 +00:00 by mfreeman451 · 0 comments

mfreeman451 commented

2026-03-28 04:27:43 +00:00

Owner

Imported from GitHub.

Original GitHub issue: #2237
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2237
Original created: 2026-01-10T00:11:41Z

Is your feature request related to a problem?

If we have a tenant with multiple agents in the same partition and they have jobs configured to say do a network scan/sweep over a group of devices in this same partition, we need one of the agents to pickup the jobs but not more than one agent to pickup the job, and the agent needs to release it when it is finished.

Agents learn about work by us creating a config for that agent to control some functionality of the agent (do a ping check, port check, process check, or call out to an external checker to perform a custom check), and then they use GRPC to ask the agent-gateway if there is any config or config update available. If there is, the agent is expected to download the config and perform the work at the prescribed interval and so on.

What we are missing:

External Checkers are broken --

External checkers were designed to be polled by the agent
External checkers need to be updated to push their Status and Results to (edge) agent
External checkers need to have an internal scheduler
Agent must have a robust server process able to handle thousands of connections simultaneously
External Checkers must communicate to Agent when they start performing work, when they are in the middle of performing work, and when they are done
Agents forward job updates to Agent-Gateway, agent-gateway forwards to core via RPC, core processes using Ash State Machine, creates events in ocsf_events table
If a tenant has a job configured to do a network sweep on their inventory, only one agent should perform the job at once. Tenant may choose to divide this up and spread it across multiple agents, but one agent should be responsible for their slice of the work
We need to come up with a framework for doing this, developing a system that only works for network sweep doesn't help us for customers that are also trying to do large network discovery tasks or SNMP (metrics) collections. We need to be able to break up/distribute tasks across multiple agents and make sure that the actual work units are unique. This might not be as difficult as I imagine if we just think of the work units as devices always, which is really our primary entity here anyways, we're always thinking in terms of devices, of data coming from a device, or data for a device getting updated (is the device available or not)
We also need to be able to support pinning groups of devices to a certain agent
This stuff also all needs to be partition aware. A tenant could have multiple partitions, partitions are how we deal with overlapping IP space. So if a tenant had multiple 10.0.0.4/24 networks, they could put an agent on two different servers that were multi-homed, and but each agent set to a unique partition.
The sync ingestion service / serviceradar-agent sync service currently does not update the 'discovery_sources' for devices it discovers and injects into the system, this needs to get updated so the 'armis' integration makes the discovery_sources as 'armis' and netbox works the same and so on (a device could have multiple discovery_sources, that is why this field is an array in the data structures)

Describe the solution you'd like

We need to completely redo/update serviceradar-agent so it becomes a GRPC listener, and we need to start updating our GRPC-based external checkers so that instead of being polled, they:

Do the GRPC hello thing to the agent
Agent registers the external GRPC checker thing by pushing a message up to the agent-gateway
External-checkers will use local json/yaml config, wont come from upstream/grpc
External-checkers will run on a schedule based on their local config
External-checkers communicate their status using GRPC to agent
Agent forwards to agent-gateway
agent-gateway forwards for processing to the core-elx
core-elx processes messages (health checks / full result sets)

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Imported from GitHub. Original GitHub issue: #2237 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2237 Original created: 2026-01-10T00:11:41Z --- **Is your feature request related to a problem?** If we have a tenant with multiple agents in the same partition and they have jobs configured to say do a network scan/sweep over a group of devices in this same partition, we need one of the agents to pickup the jobs but not more than one agent to pickup the job, and the agent needs to release it when it is finished. Agents learn about work by us creating a config for that agent to control some functionality of the agent (do a ping check, port check, process check, or call out to an external checker to perform a custom check), and then they use GRPC to ask the agent-gateway if there is any config or config update available. If there is, the agent is expected to download the config and perform the work at the prescribed interval and so on. What we are missing: External Checkers are broken -- * External checkers were designed to be polled by the agent * External checkers need to be updated to push their Status and Results to (edge) agent * External checkers need to have an internal scheduler * Agent must have a robust server process able to handle thousands of connections simultaneously * External Checkers must communicate to Agent when they start performing work, when they are in the middle of performing work, and when they are done * Agents forward job updates to Agent-Gateway, agent-gateway forwards to core via RPC, core processes using Ash State Machine, creates events in ocsf_events table * If a tenant has a job configured to do a network sweep on their inventory, only one agent should perform the job at once. Tenant may choose to divide this up and spread it across multiple agents, but one agent should be responsible for their slice of the work * We need to come up with a framework for doing this, developing a system that only works for network sweep doesn't help us for customers that are also trying to do large network discovery tasks or SNMP (metrics) collections. We need to be able to break up/distribute tasks across multiple agents and make sure that the actual work units are unique. This might not be as difficult as I imagine if we just think of the work units as devices always, which is really our primary entity here anyways, we're always thinking in terms of devices, of data coming from a device, or data for a device getting updated (is the device available or not) * We also need to be able to support pinning groups of devices to a certain agent * This stuff also all needs to be partition aware. A tenant could have multiple partitions, partitions are how we deal with overlapping IP space. So if a tenant had multiple 10.0.0.4/24 networks, they could put an agent on two different servers that were multi-homed, and but each agent set to a unique partition. * The sync ingestion service / serviceradar-agent sync service currently does not update the 'discovery_sources' for devices it discovers and injects into the system, this needs to get updated so the 'armis' integration makes the discovery_sources as 'armis' and netbox works the same and so on (a device could have multiple discovery_sources, that is why this field is an array in the data structures) **Describe the solution you'd like** We need to completely redo/update serviceradar-agent so it becomes a GRPC listener, and we need to start updating our GRPC-based external checkers so that instead of being polled, they: - [ ] Do the GRPC hello thing to the agent - [ ] Agent registers the external GRPC checker thing by pushing a message up to the agent-gateway - [ ] External-checkers will use local json/yaml config, wont come from upstream/grpc - [ ] External-checkers will run on a schedule based on their local config - [ ] External-checkers communicate their status using GRPC to agent - [ ] Agent forwards to agent-gateway - [ ] agent-gateway forwards for processing to the core-elx - [ ] core-elx processes messages (health checks / full result sets) **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here.