ereport: Task faulted/panicked #2341

hawkw · 2026-01-05T23:30:40Z

Depends on #2313
Fixes #2309

It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred.

This commit adds code to packrat for producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in this comment.

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 05 12:48:52.203 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 05 12:48:52.204 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 6a2def31-2dc0-4ab2-010a-b94f5ff1c627
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3

ereports:
0x1: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(26997),
    "k": String("hubris.fault.panic"),
    "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"),
    "v": Number(0),
}

0x3: {
    "baseboard_part_number": String("LOLNO000000"),
    "baseboard_rev": Number(42),
    "baseboard_serial_number": String("69426661337"),
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("xXyXfvzbFUM"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("user_leds"),
    "hubris_uptime_ms": Number(32546),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new `read_panic_message` kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace. I've also added a test case that exercises this functionality. Fixes #2311

Co-authored-by: Matt Keeter <matt@oxide.computer>

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

task/packrat/src/ereport.rs

'cause removing a branch makes it bigger somehow

This reverts commit bc6268d.

hawkw · 2026-01-06T19:24:43Z

Thinking about things a bit more, there's some more changes I think I want to make here before it's really ready to land. In particular:

Currently, we've changed the fault cooldown behavior in Jefe to just always wait 50 ms between restarts, rather than only doing so when a task has not been running for at least that long between two subsequent faults (see https://github.com/oxidecomputer/hubris/blob/b31df4359ffc4f06136474fee952e32f9466b34e/task/jefe/src/main.rs). This means that there's now always 50 ms of latency for all task restarts. This is to give Packrat time to generate an ereport, but it feels a bit not great.

I think we should be doing a somewhat more complex thing here. We should probably implement the approach that @cbiffle described in ereport: hubris task panicked/faulted #2309 (comment), and add a way for Packrat to let Jefe know it has finished generating an ereport for a fault. That way, we can possibly reduce the latency for restarts a bit by saying "we will always give Packrat up to 50ms to produce a fault report, but if it finishes before then and the task has already been running for a while, we will restart it sooner".
Currently, if all faulted tasks have already been restarted by the time Packrat actually processes the "task faulted" notification, Packrat just does nothing.¹ We should maybe fix this by having packrat do some kind of "some task probably faulted but I couldn't figure out which one" ereport, so that it's not totally lost.

Personally, I think we should definitely do the second point here (some kind of "task faults may have occurred" ereport) before merging this PR. I'm on the fence about whether the first point (reducing restart latency) is worth doing now or not. It's a bit more complexity in Jefe...

@cbiffle, any thoughts?

Well, it ringbufs about it, but in production, that's equivalent to "doing nothing". ↩

as described in #2341

hawkw · 2026-01-06T19:48:51Z

task/packrat/src/ereport.rs

+        if faulted_tasks == 0 {
+            // Okay, this is a bit weird. We got a notification saying tasks
+            // have faulted, but by the time we scanned for faulted tasks, we
+            // couldn't find any. This means one of two things:
+            //
+            // 1. The fault notification was spurious (in practice, this
+            //    probably means someone is dorking around with Hiffy and sent
+            //    a fake notification just to mess with us...)
+            // 2. Tasks faulted, but we were not scheduled for at least 50ms
+            //    after the faults occurred, and Jefe has already restarted
+            //    them by the time we were permtited to run.
+            //
+            // We should probably record some kind of ereport about this.
+        }


i still wanna figure out what i need to put in the ereport in this case --- what class should it be, etc. hubris.fault.maybe_faults or something weird like that.

It's also a bit strange because the function for recording an ereport in the ereport ringbuffer requires a task ID as part of the insert function. For all the other ereports, I've used the ID of the task that faulted for that field, rather than the ID of Packrat (who is actually generating the ereport) or Jefe (who is spiritually sort of responsible for reporting it in some vibes-based way); this felt like the right thing in general. However, when the ereport just says "some task may have faulted", I'm not totally sure what ID I want to put in here, since I don't want to incorrectly suggest that Jefe or Packrat has faulted...hmm...

I think taskID of Packrat and a distinguishing class would be fine.

cbiffle · 2026-01-06T20:23:33Z

So I think the "max of 50ms" interplay between Jefe and Packrat sounds promising, and doesn't seem like too much additional supervisor complexity -- particularly while crashdumps are still in Jefe. If we want to reduce complexity, I'd start there.

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

That said, packrat is by nature typically just one priority level under Jefe, so it should be able to respond in a timely fashion in most cases. The thing most likely to starve it is ... crash dumps.

hawkw · 2026-01-06T20:28:14Z

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

hawkw · 2026-01-06T21:54:31Z

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

Here's my attempt at doing that, which is both conceptually quite elegant and implementationally somewhat disgusting: eliza/fault-ereport...eliza/fault-counts#diff-48cf874f5ac8432941e2ba390792b33a94f9aea18dd933bbdb105cd23b93c9ee

cbiffle · 2026-01-06T22:26:43Z

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

hawkw · 2026-01-07T17:45:20Z

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than NUM_TASKS syscalls), and it lets us use a bigger counter than the u8 generation number, which reduces the likelihood that the counter will wrap around and end up at the same value it was last time Packrat checked, missing the fault.

On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason...

hawkw and others added 20 commits December 3, 2025 12:27

Update kipc.rs

1313935

Co-authored-by: Matt Keeter <matt@oxide.computer>

review feedback + tidiness

07460ec

CLIPPY DAMAGE

68a2fb7

Merge branch 'master' into eliza/read-panic-message

2ab7a43

Merge branch 'master' into eliza/read-panic-message

8a221c1

add note on UTF-8 truncation

425c221

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

Update kipc.adoc

3748fda

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

Update kipc.adoc

09fe93f

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

Update main.rs

e468ef7

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

More of @cbiffle's docs suggestions

7b4a6bb

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

line wrapping, docs links

2d9b1d2

improve docs/naming for invalid buffers

fb2af3f

@cbiffle convinced me to return Utf8Chunks

5193057

jefe: Add mechanism for notifying another task on task faults

1439da0

ereportulator: explicit panic for testing

9c08e49

packrat: wire up task fault notification

a4798c9

packrat: draw most of the rest of the owl

384307a

packrat: finish ereports

e608474

you bastards, you blew it all up!

596c8cd

Base automatically changed from eliza/read-panic-message to master January 5, 2026 23:36

hawkw added 2 commits January 5, 2026 15:40

i am stupid and also dumb

d5fde83

don't bind unused variable

d145efc

cbiffle reviewed Jan 5, 2026

View reviewed changes

task/packrat/src/ereport.rs Outdated Show resolved Hide resolved

hawkw added 2 commits January 5, 2026 15:47

properly(?) handle invalid utf8

f4e5713

pwease dead-code eliminate me 🥺🥺🥺

bc6268d

hawkw force-pushed the eliza/fault-ereport branch from 0adecf4 to bc6268d Compare January 5, 2026 23:59

Merge branch 'master' into eliza/fault-ereport

a9ef212

hawkw added service processor Related to the service processor. psc Related to the power shelf controller labels Jan 6, 2026

hawkw added gimlet cosmo SP5 Board fault-management Everything related to the Oxide's Fault Management architecture implementation ⚠️ ereport if you see something, say something! labels Jan 6, 2026

hawkw added 6 commits January 6, 2026 10:29

embiggen oxcon g0 dongle

d3ff2bb

'cause removing a branch makes it bigger somehow

Revert "pwease dead-code eliminate me 🥺🥺🥺"

f7d6231

This reverts commit bc6268d.

timestamp is now dead code

525fce9

more commentary

b906b4e

clippy tidiness

b405002

Merge branch 'master' into eliza/fault-ereport

b31df43

hawkw added 3 commits January 6, 2026 11:25

Merge branch 'master' into eliza/fault-ereport

7980493

don't skip faults when the task has been restarted

3eeb290

as described in #2341

ensmallerate a couple imports

f031658

hawkw commented Jan 6, 2026

View reviewed changes

clippy fix when ereports are turned off

75538b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ereport: Task faulted/panicked #2341

ereport: Task faulted/panicked #2341

Uh oh!

hawkw commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

hawkw commented Jan 6, 2026

Uh oh!

hawkw Jan 6, 2026

Uh oh!

cbiffle Jan 6, 2026

Uh oh!

cbiffle commented Jan 6, 2026

Uh oh!

hawkw commented Jan 6, 2026

Uh oh!

hawkw commented Jan 6, 2026 •

edited

Loading

Uh oh!

cbiffle commented Jan 6, 2026

Uh oh!

hawkw commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ereport: Task faulted/panicked #2341

Are you sure you want to change the base?

ereport: Task faulted/panicked #2341

Uh oh!

Conversation

hawkw commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hawkw commented Jan 6, 2026

Footnotes

Uh oh!

hawkw Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

cbiffle Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

cbiffle commented Jan 6, 2026

Uh oh!

hawkw commented Jan 6, 2026

Uh oh!

hawkw commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbiffle commented Jan 6, 2026

Uh oh!

hawkw commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Jan 5, 2026 •

edited

Loading

hawkw commented Jan 6, 2026 •

edited

Loading