Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jan 6, 2026

Follow-up from #2343

PR #2343 removes the hubris_archive_id field from ereport metadata, as we have determined that this ought not be used to identify Hubris except in the case of firmware updates. If this is being removed, though, we really ought to have other metadata fields identifying the Hubris image. Therefore, this commit adds fields from the caboose (in particular, the BORD, VERS, and GITC tags) to the ereport metadata message. These fields are read from the caboose every time metadata is refreshed, in order to avoid buffering them in packrat, which would duplicate data already in flash and didn't seem necessary as metadata refreshes occur infrequently (on SP reset/MGS restart).

All of these fields are optional, and if any of them are not present or could not be read successfully, we send a CBOR null. Additionally, I've nested all of them under a hubris_caboose field, which is null if the image has no caboose whatsoever. This way, we can differentiate between images with no caboose and images where none of the tags we read into the metadata message could be found. I'm open to being convinced this is unnecessary, but it seemed worthwhile, and since the metadata message doesn't compete for space in the ereport ringbuffer, we can be a bit more verbose here, provided it fits in a UDP datagram.

Naturally, every app.toml where Packrat produces ereports needed to be updated to allow packrat to read from the caboose. Packrat also uses a bit more stack in order to do this.

For example, here's output from a Gimletlet with caboose fields (including a fake version tag) in its ereport metadata:

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 06 10:19:21.564 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 06 10:19:21.565 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: aecfcbd7-4637-8a9a-ed0b-0d5b60a884e8
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 1

ereports:
0x1: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("gimletlet-2"),
        "commit": String("51dac3ec71877d330981cd5167a4aef5fb48311c-dirty"),
        "version": String("42.69.420-eliza-test"),
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

@hawkw hawkw requested a review from cbiffle January 6, 2026 18:25
@hawkw hawkw added service processor Related to the service processor. psc Related to the power shelf controller gimlet cosmo SP5 Board fault-management Everything related to the Oxide's Fault Management architecture implementation labels Jan 6, 2026
Base automatically changed from eliza/archivent to master January 6, 2026 19:01
hawkw added 2 commits January 6, 2026 11:02
Follow-up from #2343

PR #2343 removes the `hubris_archive_id` field from ereport metadata, as
we have determined that this ought not be used to identify Hubris except
in the case of firmware updates. If this is being removed, though, we
really ought to have other metadata fields identifying the Hubris image.
Therefore, this commit adds fields from the caboose (in particular, the
`BORD`, `VERS`, and `GITC` tags) to the ereport metadata message. These
fields are read from the caboose every time metadata is refreshed, in
order to avoid buffering them in packrat, which would duplicate data
already in flash and didn't seem necessary as metadata refreshes occur
infrequently (on SP reset/MGS restart).

All of these fields are optional, and if any of them are not present or
could not be read successfully, we send a CBOR `null`. Additionally,
I've nested all of them under a `hubris_caboose` field, which is `null`
if the image has no caboose whatsoever. This way, we can differentiate
between images with no caboose and images where none of the tags we read
into the metadata message could be found. I'm open to being convinced
this is unnecessary, but it seemed worthwhile, and since the metadata
message doesn't compete for space in the ereport ringbuffer, we can be a
bit more verbose here, provided it fits in a UDP datagram.

Naturally, every app.toml where Packrat produces ereports needed to be
updated to allow packrat to read from the caboose. Packrat also uses a
bit more stack in order to do this.

For example, here's output from a Gimletlet with caboose fields
(including a fake version tag) in its ereport metadata:

```console eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface
eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 06 10:19:21.564 INFO creating SP handle on interface eno1np0,
component: faux-mgs Jan 06 10:19:21.565 INFO initial discovery complete,
addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket:
control-plane-agent, component: faux-mgs restart ID:
aecfcbd7-4637-8a9a-ed0b-0d5b60a884e8 restart IDs did not match
(requested 00000000-0000-0000-0000-000000000000) count: 1

ereports: 0x1: { "baseboard_part_number": String("LOLNO000000"),
"baseboard_rev": Number(42), "baseboard_serial_number":
String("69426661337"), "ereport_message_version": Number(0),
"hubris_caboose": Object { "board": String("gimletlet-2"), "commit":
String("51dac3ec71877d330981cd5167a4aef5fb48311c-dirty"), "version":
String("42.69.420-eliza-test"), }, "hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"), "hubris_uptime_ms": Number(0),
"lost": Null, } ```
@hawkw hawkw force-pushed the eliza/ereport-caboose branch from 0c4e3d2 to a6403ed Compare January 6, 2026 19:03
@hawkw hawkw enabled auto-merge (squash) January 6, 2026 19:03
@hawkw hawkw merged commit 248c15e into master Jan 6, 2026
168 checks passed
@hawkw hawkw deleted the eliza/ereport-caboose branch January 6, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cosmo SP5 Board fault-management Everything related to the Oxide's Fault Management architecture implementation gimlet psc Related to the power shelf controller service processor Related to the service processor.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants