Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jan 5, 2026

Depends on #2313
Fixes #2309

It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred.

This commit adds code to packrat for producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in this comment.

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 05 12:48:52.203 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 05 12:48:52.204 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 6a2def31-2dc0-4ab2-010a-b94f5ff1c627
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3

ereports:
0x1: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(26997),
    "k": String("hubris.fault.panic"),
    "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"),
    "v": Number(0),
}

0x3: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("user_leds"),
    "hubris_uptime_ms": Number(32546),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

hawkw and others added 20 commits December 3, 2025 12:27
Currently, there is no way to programmatically access the panic message
of a task which has faulted due to a Rust panic fron within the Hubris
userspace. This branch adds a new `read_panic_message` kipc that copies
the contents of a panicked task's panic message buffer into the caller.
If the requested task has not panicked, this kipc returns an error
indicating this. This is intended by use by supervisor implementations
or other tasks which wish to report panic messages from userspace.

I've also added a test case that exercises this functionality.

Fixes #2311
Co-authored-by: Matt Keeter <matt@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Base automatically changed from eliza/read-panic-message to master January 5, 2026 23:36
@hawkw hawkw force-pushed the eliza/fault-ereport branch from 0adecf4 to bc6268d Compare January 5, 2026 23:59
@hawkw hawkw added service processor Related to the service processor. psc Related to the power shelf controller labels Jan 6, 2026
@hawkw hawkw added gimlet cosmo SP5 Board fault-management Everything related to the Oxide's Fault Management architecture implementation ⚠️ ereport if you see something, say something! labels Jan 6, 2026
@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

Thinking about things a bit more, there's some more changes I think I want to make here before it's really ready to land. In particular:

  • Currently, we've changed the fault cooldown behavior in Jefe to just always wait 50 ms between restarts, rather than only doing so when a task has not been running for at least that long between two subsequent faults (see https://github.com/oxidecomputer/hubris/blob/b31df4359ffc4f06136474fee952e32f9466b34e/task/jefe/src/main.rs). This means that there's now always 50 ms of latency for all task restarts. This is to give Packrat time to generate an ereport, but it feels a bit not great.

    I think we should be doing a somewhat more complex thing here. We should probably implement the approach that @cbiffle described in ereport: hubris task panicked/faulted #2309 (comment), and add a way for Packrat to let Jefe know it has finished generating an ereport for a fault. That way, we can possibly reduce the latency for restarts a bit by saying "we will always give Packrat up to 50ms to produce a fault report, but if it finishes before then and the task has already been running for a while, we will restart it sooner".

  • Currently, if all faulted tasks have already been restarted by the time Packrat actually processes the "task faulted" notification, Packrat just does nothing.1 We should maybe fix this by having packrat do some kind of "some task probably faulted but I couldn't figure out which one" ereport, so that it's not totally lost.

Personally, I think we should definitely do the second point here (some kind of "task faults may have occurred" ereport) before merging this PR. I'm on the fence about whether the first point (reducing restart latency) is worth doing now or not. It's a bit more complexity in Jefe...

@cbiffle, any thoughts?

Footnotes

  1. Well, it ringbufs about it, but in production, that's equivalent to "doing nothing".

Comment on lines +449 to +462
if faulted_tasks == 0 {
// Okay, this is a bit weird. We got a notification saying tasks
// have faulted, but by the time we scanned for faulted tasks, we
// couldn't find any. This means one of two things:
//
// 1. The fault notification was spurious (in practice, this
// probably means someone is dorking around with Hiffy and sent
// a fake notification just to mess with us...)
// 2. Tasks faulted, but we were not scheduled for at least 50ms
// after the faults occurred, and Jefe has already restarted
// them by the time we were permtited to run.
//
// We should probably record some kind of ereport about this.
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still wanna figure out what i need to put in the ereport in this case --- what class should it be, etc. hubris.fault.maybe_faults or something weird like that.

It's also a bit strange because the function for recording an ereport in the ereport ringbuffer requires a task ID as part of the insert function. For all the other ereports, I've used the ID of the task that faulted for that field, rather than the ID of Packrat (who is actually generating the ereport) or Jefe (who is spiritually sort of responsible for reporting it in some vibes-based way); this felt like the right thing in general. However, when the ereport just says "some task may have faulted", I'm not totally sure what ID I want to put in here, since I don't want to incorrectly suggest that Jefe or Packrat has faulted...hmm...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think taskID of Packrat and a distinguishing class would be fine.

@cbiffle
Copy link
Collaborator

cbiffle commented Jan 6, 2026

So I think the "max of 50ms" interplay between Jefe and Packrat sounds promising, and doesn't seem like too much additional supervisor complexity -- particularly while crashdumps are still in Jefe. If we want to reduce complexity, I'd start there.

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

That said, packrat is by nature typically just one priority level under Jefe, so it should be able to respond in a timely fashion in most cases. The thing most likely to starve it is ... crash dumps.

@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

Here's my attempt at doing that, which is both conceptually quite elegant and implementationally somewhat disgusting: eliza/fault-ereport...eliza/fault-counts#diff-48cf874f5ac8432941e2ba390792b33a94f9aea18dd933bbdb105cd23b93c9ee

@cbiffle
Copy link
Collaborator

cbiffle commented Jan 6, 2026

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

@hawkw
Copy link
Member Author

hawkw commented Jan 7, 2026

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than NUM_TASKS syscalls), and it lets us use a bigger counter than the u8 generation number, which reduces the likelihood that the counter will wrap around and end up at the same value it was last time Packrat checked, missing the fault.

On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cosmo SP5 Board ⚠️ ereport if you see something, say something! fault-management Everything related to the Oxide's Fault Management architecture implementation gimlet psc Related to the power shelf controller service processor Related to the service processor.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ereport: hubris task panicked/faulted

3 participants