Stabilize demo ingress after k3s upgrade (MetalLB/Calico LB route regression) #1094

Open
opened 2026-03-28 04:31:37 +00:00 by mfreeman451 · 1 comment
Owner

Imported from GitHub.

Original GitHub issue: #2991
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2991
Original created: 2026-03-04T08:43:02Z


Summary

After the k3s control/worker upgrade, demo.serviceradar.cloud regressed.

Observed behavior during incident:

  • DNS initially went NXDOMAIN because ingress LB IP allocation failed (k3s-ipv4 annotation drift).
  • DNS was restored, but ingress/data-plane still failed from router with No route to host.
  • ARP for 23.138.124.x on br104 remained <incomplete>.
  • Router preferred connected 23.138.124.0/27 path while service-specific /32 routes were absent.

Root Cause

Control-plane config drift + post-upgrade routing behavior mismatch:

  • Ingress service annotation pointed to non-existent MetalLB pool name (k3s-ipv4).
  • With current topology, routing became reliable only after explicit LB /32 Calico advertisements were added.

Changes Applied

  • Corrected ingress MetalLB pool annotation to k3s-pool.
  • Added explicit Calico serviceLoadBalancerIPs /32 entries for active LB services:
    • 23.138.124.2/32
    • 23.138.124.3/32
    • 23.138.124.18/32
    • 23.138.124.20/32
    • 23.138.124.22/32
    • 23.138.124.25/32
    • 23.138.124.26/32
    • 23.138.124.27/32
  • Persisted declaratively in repo for idempotent re-apply:
    • k8s/demo/prod/calico-bgpconfiguration.yaml
    • k8s/demo/prod/kustomization.yaml
  • Added session notes:
    • routing1.md

Validation

  • FRR now shows 23.138.124.2/32 via BGP and installs it as best route.
  • curl -kI https://demo.serviceradar.cloud returns HTTP response.
  • dig demo.serviceradar.cloud resolves to expected ingress IP.

Follow-ups

  • Add runbook section documenting LB /32 requirement for this network topology.
  • Optionally convert deprecated metallb.universe.tf/* annotations to current keys.
  • Clean up stale/duplicate historical completed MetalLB controller RS pods if desired.
Imported from GitHub. Original GitHub issue: #2991 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2991 Original created: 2026-03-04T08:43:02Z --- ## Summary After the k3s control/worker upgrade, `demo.serviceradar.cloud` regressed. Observed behavior during incident: - DNS initially went `NXDOMAIN` because ingress LB IP allocation failed (`k3s-ipv4` annotation drift). - DNS was restored, but ingress/data-plane still failed from router with `No route to host`. - ARP for `23.138.124.x` on `br104` remained `<incomplete>`. - Router preferred connected `23.138.124.0/27` path while service-specific `/32` routes were absent. ## Root Cause Control-plane config drift + post-upgrade routing behavior mismatch: - Ingress service annotation pointed to non-existent MetalLB pool name (`k3s-ipv4`). - With current topology, routing became reliable only after explicit LB `/32` Calico advertisements were added. ## Changes Applied - Corrected ingress MetalLB pool annotation to `k3s-pool`. - Added explicit Calico `serviceLoadBalancerIPs` `/32` entries for active LB services: - `23.138.124.2/32` - `23.138.124.3/32` - `23.138.124.18/32` - `23.138.124.20/32` - `23.138.124.22/32` - `23.138.124.25/32` - `23.138.124.26/32` - `23.138.124.27/32` - Persisted declaratively in repo for idempotent re-apply: - `k8s/demo/prod/calico-bgpconfiguration.yaml` - `k8s/demo/prod/kustomization.yaml` - Added session notes: - `routing1.md` ## Validation - FRR now shows `23.138.124.2/32` via BGP and installs it as best route. - `curl -kI https://demo.serviceradar.cloud` returns HTTP response. - `dig demo.serviceradar.cloud` resolves to expected ingress IP. ## Follow-ups - Add runbook section documenting LB `/32` requirement for this network topology. - Optionally convert deprecated `metallb.universe.tf/*` annotations to current keys. - Clean up stale/duplicate historical completed MetalLB controller RS pods if desired.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2991#issuecomment-3996092846
Original created: 2026-03-04T08:43:16Z


Implemented and pushed recovery hardening changes on branch fix/2991-demo-lb-route-regression.

Commit: cde0afbc8

Files:

  • k8s/demo/prod/calico-bgpconfiguration.yaml
  • k8s/demo/prod/kustomization.yaml
  • routing1.md

This makes the LB /32 route fix declarative/idempotent for re-apply after upgrades or drift.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2991#issuecomment-3996092846 Original created: 2026-03-04T08:43:16Z --- Implemented and pushed recovery hardening changes on branch `fix/2991-demo-lb-route-regression`. Commit: `cde0afbc8` Files: - `k8s/demo/prod/calico-bgpconfiguration.yaml` - `k8s/demo/prod/kustomization.yaml` - `routing1.md` This makes the LB /32 route fix declarative/idempotent for re-apply after upgrades or drift.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#1094
No description provided.