lil plcbundle spec review

hello! there has been quite a time gap between when i first had a chance to actually look at the spec after you asked, and now when i'm writing it up, so apologies if some of this has already been addressed. and before jumping in i just want to say that it's very very good that we have more tooling up for consuming and monitoring plc.

with that, here are a few potential flags that jumped out to me


## 1. plc json "raw data" is not deterministic.

> must capture and store the exact, unmodified raw JSON byte string for each operation as it is originally received from the PLC directory's /export stream.

this is not sufficient for a reproducible hash of the log because the order of keys in the json returned by the directory is not guaranteed to be stable.

plc itself (like atproto more widely) uses CBOR to encode plc ops to be deterministically hashed, and might be worth considering as an op transform step for plcbundle.


## 2. unbounded index file size growth

the current index file is 4.5MiB, and will grow linearly with PLC op count. if bluesky's ambition for 100x more growth pans out, this file could approach gigabytes in size.

this might just be fine! very future-problem kind of thing. but some things to watch out for

- the index must be downloaded before actually working with any bundles, adding latency to tasks like backfilling
- tools that analyze the index (especially browser-based) could take significant time to load and use a lot of memory

without significant re-architecting (solving another discovery problem if splitting the index into multiple files), the index size could be reduced by removing redundant information:

- `bundle_number` is a nice affordance but is duplicated by index of the `bundles` array
- `operation_count` is fixed in "V1"
- `did_count` is an interesting statistic but it's not clear what purpose it serves in the index
- `hash`, `content_hash`, `parent`, `compressed_hash`: together comprise 256 bytes (128 bytes uncompressible) per entry
    - `compressed_hash` trades index size for client convenience: clients don't *need* it, since they can decompress + hash to verify. if keeping it is important, it might be worth adding a rationale for it somewhere.
    - the rest: i think you could bind the hash of the current bundle to its parent by putting the parent hash into the bundle itself (merkle-tree-ish, like plc ops) instead of doing all the chaining externally in the index. i think you can get away with just one index hash per bundle file with this approach, while retaining the chain integrity properties. (eg: specify the first line of a bundle file to be the hash of the previous bundle; remaining lines are the jsonl ops).


## 3. recent op malleability

there is a 72 hour window after submitting a plc op, during which that op can be *nullified* using a higher-precedence rotation key. this will actually *change* the earlier op's representation in the current `/export`, toggling its `nullified` property to true.

since this `nullified` property is included with the ops from `/export`, you *cannot* generate immutable bundles with raw ops until they are outside this window -- plcbundle should lag the upstream plc directory by at least 72 hours (i think i made allegedly's bundling cut off at 73hrs)

as far as i'm aware, this `nullified` property is just a denormalization of the data to make queries fast and simple: you can always compute it given the audit log of a DID. so, alternatively, i think plcbundle could do immutable bundles without lag by **omitting the `nullified` property from ops in the bundle**. note that this could add some additional work for clients who expect it to be present.


## 4. `createdAt` is not guaranteed to be unique, so 10k ops per bundle introduces ambiguity

(actually the order of ops is not even guaranteed to be stable (yet, not getting into it here))

this is the reason for allegedly's boundary deduplication stuff (which it looks like plcbundle picked up): multiple ops *could* have the exact same `createdAt`, so it's actually a little complex to even break ops into reliably-sized pages.

plcbundle currently uses `createdAt` for bundles' `start_time`, `end_time`, and `cursor`. a series of ops at a 10k bundle boundary that share the same `createdAt` would be a problem for plcbundle: which ops go in which bundle?

in practice and at the current plc op frequency, i would expect this to be very rare -- i haven't checked to see whether it has happened at all. but it's not currently restricted by the plc spec, so it has the potential to be a problem for plcbundle.


---

whew. just also noting that i have pretty limited time for atproto for the remainder of the year and am prioritizing microcosm ops stability with the time i have, so i can't commit to further follow-ups on these; feel free to take or leave whatever you find useful here or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lil plcbundle spec review #1

1. plc json "raw data" is not deterministic.

2. unbounded index file size growth

3. recent op malleability

4. `createdAt` is not guaranteed to be unique, so 10k ops per bundle introduces ambiguity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lil plcbundle spec review #1

Description

1. plc json "raw data" is not deterministic.

2. unbounded index file size growth

3. recent op malleability

4. createdAt is not guaranteed to be unique, so 10k ops per bundle introduces ambiguity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. `createdAt` is not guaranteed to be unique, so 10k ops per bundle introduces ambiguity