Skip to content

Conversation

@mvandenburgh
Copy link
Member

This PR proposes a design for #2501, and provides "instance protection"-like safety guards for published assets as requested in #2367 (comment).

@mvandenburgh mvandenburgh marked this pull request as ready for review October 7, 2025 17:06
@mvandenburgh mvandenburgh added the documentation Changes only affect the documentation label Oct 7, 2025
@mvandenburgh mvandenburgh self-assigned this Oct 7, 2025
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments/thoughts

@mvandenburgh mvandenburgh requested a review from satra October 7, 2025 17:38
@mvandenburgh mvandenburgh force-pushed the published-object-lock branch from 2e49a28 to 7264248 Compare October 14, 2025 14:58
@waxlamp waxlamp added the design-doc Involves creating or discussing a design document label Nov 3, 2025
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor questions/comments


1. Enable S3 Object Lock on the bucket
2. Deploy code changes to apply legal holds during publish celery task
3. Run a one-time script to apply legal holds to all existing published asset blobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: yet another opportunity for some kind of fsck style command to ensure consistency of the bucket with our assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I thought about this again, and wondered if

Suggested change
3. Run a one-time script to apply legal holds to all existing published asset blobs
3. Run a one-time script to apply legal holds to all existing published asset blobs
4. Implement and deploy for weekly run an admin-level maintenance script to verify consistency of legal holds on S3 to the information about published assets in DB

since that would allow to actually check that all is good in staging, otherwise a benefit from deploying in staging is kinda limited (e.g. how would you know that it does work correctly in staging?)

waxlamp and others added 2 commits November 25, 2025 10:28
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there -- some minor comments on various aspects. Some could potentially be ignored.


The DANDI Archive's core value proposition is hosting reliable, citable scientific data. Once a dandiset is published and assigned a DOI, the data it references MUST remain immutable to preserve the integrity of scientific citations and reproducibility. Users who cite a published dandiset expect that the data will remain exactly as it was at the time of publication.

Currently, while our application logic prevents modification of published assets, there is no infrastructure-level enforcement. Adding S3-level immutability provides stronger data safety.
Copy link
Member

@yarikoptic yarikoptic Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is jumping over the different levels (S3: blobs, zarr prefixes; and SQL DB: assets associated with those) without due description. May be smth like

Suggested change
Currently, while our application logic prevents modification of published assets, there is no infrastructure-level enforcement. Adding S3-level immutability provides stronger data safety.
Dandiset is a collection of assets which are associated many-to-one to content on S3 of two data types: blobs and zarrs.
A blob is a representation of a single key ("file") under `blobs/` prefix, while zarr is a prefix ("folder") under `zarr/` prefix).
`blobs/` store is already immutable, as existing keys MUST NOT mutate, while zarrs are allowed to be updated under `zarr/` but are associated with different S3 key `versionId`'s.
Only a single blob or a single zarr is associated with a single asset.
Whenever zarr assets remain mutable and thus cannot be published ATM, published blob assets MUST remain immutable and our application logic already prevents modification of published assets.
But there is no infrastructure-level enforcement to guarantee immutability of published blobs on S3. Adding S3-level immutability provides stronger data safety.


## Proposed Solution

AWS S3 Object Lock with legal holds provides exactly the protection we need. A legal hold is a flag that can be applied to an S3 object version to prevent deletion or modification, regardless of any retention policies or IAM permissions. Unlike retention modes, legal holds have no expiration date and remain in effect until explicitly removed via the `s3:PutObjectLegalHold` S3 API. See the [documentation for S3 legal holds](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html#object-lock-legal-holds) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although those are indeed two neighboring subsections, new reader would not know of such proximities reference would be handy IMHO, especially since "retention mode" keeps coming up.

Suggested change
AWS S3 Object Lock with legal holds provides exactly the protection we need. A legal hold is a flag that can be applied to an S3 object version to prevent deletion or modification, regardless of any retention policies or IAM permissions. Unlike retention modes, legal holds have no expiration date and remain in effect until explicitly removed via the `s3:PutObjectLegalHold` S3 API. See the [documentation for S3 legal holds](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html#object-lock-legal-holds) for more details.
AWS S3 Object Lock with legal holds provides exactly the protection we need. A legal hold is a flag that can be applied to an S3 object version to prevent deletion or modification, regardless of any retention policies or IAM permissions. Unlike [retention modes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html#object-lock-retention-modes), legal holds have no expiration date and remain in effect until explicitly removed via the `s3:PutObjectLegalHold` S3 API. See the [documentation for S3 legal holds](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html#object-lock-legal-holds) for more details.

- Only published assets need protection from deletion, other objects in the bucket MUST remain deletable
- Published assets MUST remain immutable permanently (no expiration)
- There MUST be the ability to remove the hold if absolutely necessary (per the "backdoor" requirement)
- Legal holds are independent of retention policies and work alongside bucket versioning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Legal holds are independent of retention policies and work alongside bucket versioning
- Legal holds are independent of retention policies and work alongside bucket versioning
- Legal holds apply to an object version thus MAY later be adopted for protection of published zarr assets

2. **Compliance mode**: No user can delete the object version until the retention period expires, including the root account
3. **Legal hold**: A separate flag independent of retention mode that prevents deletion until removed

Goverence and compliance modes operate at the bucket level, while legal holds operate at the object level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Goverence and compliance modes operate at the bucket level, while legal holds operate at the object level.
Goverence and compliance modes operate at the bucket level, while legal holds operate at the object version level.

IMHO an important feature (if I got it right) since will allow later to protect Zarr versions. (added below as an item)


## Interaction with Garbage Collection

Legal holds MUST NOT interfere with garbage collection of unpublished assets, as only published asset blobs SHALL have legal holds applied. The garbage collection processes for orphaned uploads, unreferenced AssetBlobs, and unpublished assets MUST continue to work as designed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Legal holds MUST NOT interfere with garbage collection of unpublished assets, as only published asset blobs SHALL have legal holds applied. The garbage collection processes for orphaned uploads, unreferenced AssetBlobs, and unpublished assets MUST continue to work as designed.
Legal holds MUST NOT interfere with garbage collection of unpublished assets, as only published asset blobs SHALL have legal holds applied. The garbage collection processes (e.g., of orphaned uploads, unreferenced AssetBlobs, and unpublished assets) MUST continue to work as designed.

since this list might be complete currently but could change later (zarrs GC etc)


### Potential Interference Scenarios (and Why They Don't Occur)

The following scenarios could theoretically interfere with garbage collection, but do NOT occur with this design:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following scenarios could theoretically interfere with garbage collection, but do NOT occur with this design:
The following scenarios MAY theoretically interfere with garbage collection, but SHOULD NOT occur with this design:

or should even be made

Suggested change
The following scenarios could theoretically interfere with garbage collection, but do NOT occur with this design:
The following scenarios MAY theoretically interfere with garbage collection, but MUST NOT occur with this design:

thus warranting ERROR reporting?


The following scenarios could theoretically interfere with garbage collection, but do NOT occur with this design:

1. **Overly broad legal holds**: If legal holds were accidentally applied to unpublished blobs, those blobs could not be deleted by garbage collection processes until the holds were manually removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Overly broad legal holds**: If legal holds were accidentally applied to unpublished blobs, those blobs could not be deleted by garbage collection processes until the holds were manually removed.
1. **Overly broad legal holds**: If legal holds were accidentally applied to unpublished blobs, those blobs MUST NOT be deleted by garbage collection processes until the holds were manually removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I am thinking about deleting a published dandiset overall (e.g. per author's or regulatory organ instruction). We would need to implement a process to review if each blob has any other "publication" or even just a "draft" version and potentially be able to decide on blob-by-blob case on what to do.

I think it is worth adding some statement here that we are not aiming to design/implement that logic here but that it MAY be designed/implemented later

Suggested change
1. **Overly broad legal holds**: If legal holds were accidentally applied to unpublished blobs, those blobs could not be deleted by garbage collection processes until the holds were manually removed.
1. **Overly broad legal holds**: If legal holds were accidentally applied to unpublished blobs, those blobs could not be deleted by garbage collection processes until the holds were manually removed.
- Under force-major circumstances requiring deletion of a published dandiset, and thus its assets (all or some), legal holds might remain attached to assets associated with published or not blobs. Analysis and design of handling of such cases MAY be implemented later, but overall the mechanism of legal holds seems to be appropriate and will not become an obstacle.


**Why legal holds avoid these issues:**
- Unpublished assets never receive legal holds, so they remain fully deletable while the deletion permissions remain intact for object versions without legal holds
- Legal holds are applied at the **object level**, only to specific published asset blobs. i.e., they are **object-level retention policies** instead of **bucket-wide retention policies**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Legal holds are applied at the **object level**, only to specific published asset blobs. i.e., they are **object-level retention policies** instead of **bucket-wide retention policies**.
- Legal holds are applied at the **object version level**, only to specific published asset blobs. i.e., they are **object-level retention policies** instead of **bucket-wide retention policies**.


1. Enable S3 Object Lock on the bucket
2. Deploy code changes to apply legal holds during publish celery task
3. Run a one-time script to apply legal holds to all existing published asset blobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I thought about this again, and wondered if

Suggested change
3. Run a one-time script to apply legal holds to all existing published asset blobs
3. Run a one-time script to apply legal holds to all existing published asset blobs
4. Implement and deploy for weekly run an admin-level maintenance script to verify consistency of legal holds on S3 to the information about published assets in DB

since that would allow to actually check that all is good in staging, otherwise a benefit from deploying in staging is kinda limited (e.g. how would you know that it does work correctly in staging?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design-doc Involves creating or discussing a design document documentation Changes only affect the documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants