Skip to content

lil plcbundle spec review #1

@uniphil

Description

@uniphil

hello! there has been quite a time gap between when i first had a chance to actually look at the spec after you asked, and now when i'm writing it up, so apologies if some of this has already been addressed. and before jumping in i just want to say that it's very very good that we have more tooling up for consuming and monitoring plc.

with that, here are a few potential flags that jumped out to me

1. plc json "raw data" is not deterministic.

must capture and store the exact, unmodified raw JSON byte string for each operation as it is originally received from the PLC directory's /export stream.

this is not sufficient for a reproducible hash of the log because the order of keys in the json returned by the directory is not guaranteed to be stable.

plc itself (like atproto more widely) uses CBOR to encode plc ops to be deterministically hashed, and might be worth considering as an op transform step for plcbundle.

2. unbounded index file size growth

the current index file is 4.5MiB, and will grow linearly with PLC op count. if bluesky's ambition for 100x more growth pans out, this file could approach gigabytes in size.

this might just be fine! very future-problem kind of thing. but some things to watch out for

  • the index must be downloaded before actually working with any bundles, adding latency to tasks like backfilling
  • tools that analyze the index (especially browser-based) could take significant time to load and use a lot of memory

without significant re-architecting (solving another discovery problem if splitting the index into multiple files), the index size could be reduced by removing redundant information:

  • bundle_number is a nice affordance but is duplicated by index of the bundles array
  • operation_count is fixed in "V1"
  • did_count is an interesting statistic but it's not clear what purpose it serves in the index
  • hash, content_hash, parent, compressed_hash: together comprise 256 bytes (128 bytes uncompressible) per entry
    • compressed_hash trades index size for client convenience: clients don't need it, since they can decompress + hash to verify. if keeping it is important, it might be worth adding a rationale for it somewhere.
    • the rest: i think you could bind the hash of the current bundle to its parent by putting the parent hash into the bundle itself (merkle-tree-ish, like plc ops) instead of doing all the chaining externally in the index. i think you can get away with just one index hash per bundle file with this approach, while retaining the chain integrity properties. (eg: specify the first line of a bundle file to be the hash of the previous bundle; remaining lines are the jsonl ops).

3. recent op malleability

there is a 72 hour window after submitting a plc op, during which that op can be nullified using a higher-precedence rotation key. this will actually change the earlier op's representation in the current /export, toggling its nullified property to true.

since this nullified property is included with the ops from /export, you cannot generate immutable bundles with raw ops until they are outside this window -- plcbundle should lag the upstream plc directory by at least 72 hours (i think i made allegedly's bundling cut off at 73hrs)

as far as i'm aware, this nullified property is just a denormalization of the data to make queries fast and simple: you can always compute it given the audit log of a DID. so, alternatively, i think plcbundle could do immutable bundles without lag by omitting the nullified property from ops in the bundle. note that this could add some additional work for clients who expect it to be present.

4. createdAt is not guaranteed to be unique, so 10k ops per bundle introduces ambiguity

(actually the order of ops is not even guaranteed to be stable (yet, not getting into it here))

this is the reason for allegedly's boundary deduplication stuff (which it looks like plcbundle picked up): multiple ops could have the exact same createdAt, so it's actually a little complex to even break ops into reliably-sized pages.

plcbundle currently uses createdAt for bundles' start_time, end_time, and cursor. a series of ops at a 10k bundle boundary that share the same createdAt would be a problem for plcbundle: which ops go in which bundle?

in practice and at the current plc op frequency, i would expect this to be very rare -- i haven't checked to see whether it has happened at all. but it's not currently restricted by the plc spec, so it has the potential to be a problem for plcbundle.


whew. just also noting that i have pretty limited time for atproto for the remainder of the year and am prioritizing microcosm ops stability with the time i have, so i can't commit to further follow-ups on these; feel free to take or leave whatever you find useful here or not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions