bug(core): cant save metrics #616

New issue

Closed

opened 2026-03-28 04:26:28 +00:00 by mfreeman451 · 2 comments

mfreeman451 commented

2026-03-28 04:26:28 +00:00

Owner

Imported from GitHub.

Original GitHub issue: #1882
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1882
Original created: 2025-10-24T19:19:49Z

Describe the bug

seeing these errors in the OTEL logsL

CRITICAL DB WRITE ERROR: Failed to flush/StoreMetrics

metric_count: 1
span_id: 025f55061320f800
poller_id: k8s-poller
trace_id: 208aac3b672cbd0677a2c8cfa1b5fc04
error: failed to prepare batch: code: 2529

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Imported from GitHub. Original GitHub issue: #1882 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1882 Original created: 2025-10-24T19:19:49Z --- **Describe the bug** seeing these errors in the OTEL logsL `CRITICAL DB WRITE ERROR: Failed to flush/StoreMetrics` ``` metric_count: 1 span_id: 025f55061320f800 poller_id: k8s-poller trace_id: 208aac3b672cbd0677a2c8cfa1b5fc04 error: failed to prepare batch: code: 2529 ``` **To Reproduce** Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll down to '....' 4. See error **Expected behavior** A clear and concise description of what you expected to happen. **Screenshots** If applicable, add screenshots to help explain your problem. **Desktop (please complete the following information):** - OS: [e.g. iOS] - Browser [e.g. chrome, safari] - Version [e.g. 22] **Smartphone (please complete the following information):** - Device: [e.g. iPhone6] - OS: [e.g. iOS8.1] - Browser [e.g. stock browser, safari] - Version [e.g. 22] **Additional context** Add any other context about the problem here.

mfreeman451 commented

2026-03-28 04:26:28 +00:00

Author

Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1882#issuecomment-3444584950
Original created: 2025-10-24T19:20:34Z

proton is out of disc again even though we gave it 1TB in k8s??

2025.10.24 19:19:35.976522 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> AsynchronousMetrics: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9                                                                                                                                                         2025.10.24 19:19:35.976649 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> executeQuery: Code: 2529. DB::Exception: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9. (DISK_USAGE_RATIO_THRESHOLD_EXCEEDED) (version 1.6.17) (from [::ffff:10.42.111.126]:35568) (in query: INSERT INTO pollers (* except _tp_time) VALUES)
2025.10.24 19:19:35.976711 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> TCPHandler: Code: 2529. DB::Exception: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9. (DISK_USAGE_RATIO_THRESHOLD_EXCEEDED), Stack trace (when copying this message, always include the lines below):

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1882#issuecomment-3444584950 Original created: 2025-10-24T19:20:34Z --- proton is out of disc again even though we gave it 1TB in k8s?? ``` 2025.10.24 19:19:35.976522 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> AsynchronousMetrics: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9 2025.10.24 19:19:35.976649 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> executeQuery: Code: 2529. DB::Exception: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9. (DISK_USAGE_RATIO_THRESHOLD_EXCEEDED) (version 1.6.17) (from [::ffff:10.42.111.126]:35568) (in query: INSERT INTO pollers (* except _tp_time) VALUES) 2025.10.24 19:19:35.976711 [ 6262 ] {f9b6714b-57bf-45b7-9d64-2d5024c09651} <Error> TCPHandler: Code: 2529. DB::Exception: Disk default utilization is 0.9000221489764623, exceeds the max_disk_util 0.9. (DISK_USAGE_RATIO_THRESHOLD_EXCEEDED), Stack trace (when copying this message, always include the lines below): ```

mfreeman451 commented

2026-03-28 04:26:28 +00:00

Author

Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1882#issuecomment-3445753394
Original created: 2025-10-25T04:17:23Z

Latest updates on Proton disk pressure:\n\n- Found the PVC usage dominated by Proton's native logstore (/var/lib/proton/nativelog) rather than MergeTree data. unified_devices stream alone was holding ~186 GiB of backlog despite 3d table TTLs.\n- Tightened the packaged Proton config so nativelog retention aligns with data TTLs (3 days) and caps at ~50 GiB; also reduced segment size to 256 MiB so streams stop pre-allocating 4 GiB chunks.\n- Built and pushed ghcr.io/carverauto/serviceradar-proton:sha-385b06cbd38c with the new config and rolled it out to the demo (prod) namespace.\n- Scaled Proton down, cleared the stale nativelog folders on the PVC, and brought the deployment back up on the new image—disk usage now ~5 MB and climbing slowly under the enforced cap.\n\nNext steps:\n- Monitor du -sh /var/lib/proton/nativelog/log/default over the next few days; the backlog should plateau well under 50 GiB.\n- (Optional) add monitoring/alerts for the Proton PVC so we get early warnings if backlog creeps again.\n\nDocs/runbook now note that the nativelog purge is a recovery-only step; normal ops shouldn't need repeated manual cleanups.],workdir:/home/mfreeman/serviceradar}

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1882#issuecomment-3445753394 Original created: 2025-10-25T04:17:23Z --- Latest updates on Proton disk pressure:\n\n- Found the PVC usage dominated by Proton's native logstore (`/var/lib/proton/nativelog`) rather than MergeTree data. `unified_devices` stream alone was holding ~186 GiB of backlog despite 3d table TTLs.\n- Tightened the packaged Proton config so nativelog retention aligns with data TTLs (3 days) and caps at ~50 GiB; also reduced segment size to 256 MiB so streams stop pre-allocating 4 GiB chunks.\n- Built and pushed `ghcr.io/carverauto/serviceradar-proton:sha-385b06cbd38c` with the new config and rolled it out to the demo (prod) namespace.\n- Scaled Proton down, cleared the stale nativelog folders on the PVC, and brought the deployment back up on the new image—disk usage now ~5 MB and climbing slowly under the enforced cap.\n\nNext steps:\n- Monitor `du -sh /var/lib/proton/nativelog/log/default` over the next few days; the backlog should plateau well under 50 GiB.\n- (Optional) add monitoring/alerts for the Proton PVC so we get early warnings if backlog creeps again.\n\nDocs/runbook now note that the nativelog purge is a recovery-only step; normal ops shouldn't need repeated manual cleanups.],workdir:/home/mfreeman/serviceradar}

mfreeman451 closed this issue

2026-03-28 04:26:28 +00:00