Skip to content

Sidecar SP getting a NACK from Tofino is essentially fatal with poor error reporting #2326

@nathanaelhuffman

Description

@nathanaelhuffman

An example ringbuf:

nathanael@sam ~ $ pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/d/sp/build-sidecar-d-image-default-v1.0.52.zip ringbuf sequencer
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: ring buffer drv_oxide_vpd::__RINGBUF in sequencer:
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in sequencer:
humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
   2  904        1        1 MainboardControllerId(0x1de5bae)
   3  918        1        1 MainboardControllerChecksum(0x6407475e)
   4  950        1        1 MainboardControllerVersion(0x283)
   5  951        1        1 MainboardControllerSha(0x3c8d1c33)
   6  952        1        1 FpgaInitComplete
   7   31        1        1 LoadingClockConfiguration
   8  977        1        1 ClockConfigurationComplete
   9  228        1        1 FrontIOBoardPowerEnable(true)
  10  245        1        1 FrontIOBoardPowerGood
  11  982        1        1 FrontIOBoardPresent
  12   81        1        1 LoadingFrontIOControllerBitstream { fpga_id: 0x0 }
  13   91        1        1 FrontIOControllerIdent { fpga_id: 0x0, ident: 0x1deaa55 }
  14   98        1        1 FrontIOControllerChecksum { fpga_id: 0x0, checksum: [ 0xd4, 0xaa, 0x2a, 0x16 ], expected: [ 0xd4, 0xaa, 0x2a, 0x16 ] }
  15   81        1        1 LoadingFrontIOControllerBitstream { fpga_id: 0x1 }
  16   91        1        1 FrontIOControllerIdent { fpga_id: 0x1, ident: 0x1deaa55 }
  17   98        1        1 FrontIOControllerChecksum { fpga_id: 0x1, checksum: [ 0xd4, 0xaa, 0x2a, 0x16 ], expected: [ 0xd4, 0xaa, 0x2a, 0x16 ] }
  18  340        1        1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
  19  154        1        1 FanModuleLedUpdate(Zero, On)
  20  154        1        1 FanModuleLedUpdate(One, On)
  21  154        1        1 FanModuleLedUpdate(Two, On)
  22  154        1        1 FanModuleLedUpdate(Three, On)
  23  340        1        3 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
  24  245        1        1 FrontIOBoardPowerGood
  25  328        1        1 FrontIOBoardPhyPowerEnable(true)
  26  550        1        1 FrontIOBoardPhyOscGood
  27  340        1        1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
  28   81        1        1 TofinoPowerUp
  29   89        1        1 TofinoVidAttempt(0x0)
  30   50        1        1 SetVddCoreVout(Volts(0.79))
  31  107        1        1 TofinoVidAck
   0  796        2        1 TofinoSequencerError(FpgaError)
   1  340        2     2713 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })

Near the end we got an "FpgaError", but stayed up in A0, but the tofino is essentially un-usable in this state as it hasn't been properly configured for SRIS and PCIe stuff.

@mkeeter put some investigation into the internal ticket:

I inspected the system on sam, building a custom image with additional logging.

The failing call is this write_direct. It's failing because the TOFINO_DEBUG_PORT_STATE is invalid: it has a value of 0x24, which corresponds to receive_buffer_empty | address_nack_error. In write_direct, we also require write_buffer_empty (bit 0) to be set; this is not the case, so it exits with an error.

I'm not sure why this is happening, though. Seeing address_nack_error seems suspicious, but I'm not sure what kind of hardware issue would case this problem.

(See also #1763, where failing to reset this register caused issues on Sidecar hot-resets. I don't think this is relevant, because I see the failure when powering on the Sidecar from off)

I have partially worked around this manifestation of the issue in PR #2325.

We should:

  • increase the logging fidelity to show what is failing (not just "FpgaError")
  • consider re-trying or otherwise clearing the transaction that was NACKd,
  • Improve the overall system response to show that we're in an invalid Tofino state, probably by de-sequencing in this case since the Tofino is un-usable in this state. Without this we continue to allow up-stack stuff to try and eventually tip over for confusing reasons.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions