Sled-level resource metrics

As an operator, I want to be able to understand the state of various physical resources on the rack. Is a physical cpu core heavily utilized, or waiting on another resource, or queueing operations? Is a disk almost full, or saturating IOPS? We have coverage for some of these metrics in oxql for virtual machines (virtual_machine:*, virtual_disk:*), but less for physical resources on the sled.

As @rmustacc pointed out on a call yesterday, there's a lot of nuance to consider here. For example, [RFD 526](https://rfd.shared.oxide.computer/rfd/0526) goes into great detail on just a single resource type. But there's also low-hanging fruit that can produce value to operators more quickly: we can add basic metrics before enumerating all the telemetry we eventually want to include.

One potential starting point is to identify physical resources of interest, and collect [USE metrics](https://www.brendangregg.com/usemethod.html) (utilization, saturation, errors) (or analogously the [four "golden signals"](https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals)) for each one. We could start with cpu, memory, disk, and network (although note that we already have some sled-level network metrics in `sled_data_link:*`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sled-level resource metrics #9559

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sled-level resource metrics #9559

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions