-
Notifications
You must be signed in to change notification settings - Fork 65
Description
As an operator, I want to be able to understand the state of various physical resources on the rack. Is a physical cpu core heavily utilized, or waiting on another resource, or queueing operations? Is a disk almost full, or saturating IOPS? We have coverage for some of these metrics in oxql for virtual machines (virtual_machine:, virtual_disk:), but less for physical resources on the sled.
As @rmustacc pointed out on a call yesterday, there's a lot of nuance to consider here. For example, RFD 526 goes into great detail on just a single resource type. But there's also low-hanging fruit that can produce value to operators more quickly: we can add basic metrics before enumerating all the telemetry we eventually want to include.
One potential starting point is to identify physical resources of interest, and collect USE metrics (utilization, saturation, errors) (or analogously the four "golden signals") for each one. We could start with cpu, memory, disk, and network (although note that we already have some sled-level network metrics in sled_data_link:*).