Skip to content

Conversation

@lsm5
Copy link
Member

@lsm5 lsm5 commented Dec 2, 2025

Rebased on #499 . Only review HEAD commit with image/storage/storage_dest.go changes here.

@github-actions github-actions bot added the image Related to "image" package label Dec 2, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Dec 2, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6561

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This serves well to highlight some of the design pain points (layer and image IDs)…

… but I’d also expect a lot of the complexity and thinking to center around the lookups in tryReusingBlobAsPending and while obtaining filename in createNewLayer, especially in mixed-algorithm scenarios. (BlobInfoCache might also be impacted and require a redesign.) I don’t know, maybe there is some simple logic that makes all of this simple to handle (“each storage.Layer only contains digest values all using the same algorithm????”), it’s not immediately obvious to me.

(“Request changes” at least for the .Encoded() calls in IDs.)

defer decompressed.Close()

diffID := digest.Canonical.Digester()
diffID := supporteddigests.TmpDigestForNewObjects().Digester()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more complex — the “Ensure that we always see the same “view” of a layer” code will compare the value to config’s RootFS.DiffIDs, and that might use a different digest.

Potentially, I guess we might need to compute two uncompressed digests: one to match against config, and one to identify the layer in c/storage at the user-wanted level of collision resistance.

// ordinaryImageID is a digest of a config, which is a JSON value.
// To avoid the risk of collisions, start the input with @ so that the input is not a valid JSON.
tocImageID := digest.FromString("@With TOC:" + tocIDInput).Encoded()
tocImageID := supporteddigests.TmpDigestForNewObjects().FromString("@With TOC:" + tocIDInput).Encoded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strips the algorithm ID; that would not make the new implementation “agile” (we couldn’t switch to a different algorithm that produces digests of the same length).

“What should be the new format image IDs” is a significant design choice (also shows up in Podman UI).


If we are moving from sha256 (and presumably breaking users), that’s potentially a significant opportunity to entirely change the image ID design. There’s precedent in containerd using a manifest digest as an image ID … although that would not remove the need for tocIDInput being a factor in image deduplication.

Compare #508 / containers/skopeo#2750 : either image IDs will be digests, and then we need an ~explicit algorithm ID, or they will continue to use the ~existing format and then they must be ~random and not directly derived from sha256.


Above, m.ImageID typically (not always) just returns the config digest; that might not be using a strong enough algorithm. I suppose there’s an argument that a sha256-digested image is inherently sha256-collision prone, and that computing sha512-based digests does not remove the sha256 collisions but adds a risk of sha512 collisions on top. I don’t know what we want to promise here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, m.ImageID typically (not always) just returns the config digest;

Also, 2 different manifests with the same config digest, and layer digests using different algorithms?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are moving from sha256 (and presumably breaking users), that’s potentially a significant opportunity to entirely change the image ID design.

We need to ship something in v5.8 where (IIUC) we shouldn't break any existing sha256 workflows. Can we go with this for now, and then address this for podman v6 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the existing all-sha256 values need to stay unchanged (… in the situations where the values are predictable, i.e. = config digest; probably not in the TOC case here — that value has never been documented, and that code path is IIRC not reachable in the default configuration [with diffID checking] anyway).

What crypto experts want from us is not just “migrate to sha512”, but “give us crypto agility so that we don’t have to do a many-quarter project to move to another hash in the future”, and basing image IDs on .Encoded() for non-sha256 doesn’t achieve that.

return component
}
return digest.Canonical.FromString(parentID + "+" + component).Encoded()
return supporteddigests.TmpDigestForNewObjects().FromString(parentID + "+" + component).Encoded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to image IDs, discarding the algorithm identifier would not be crypto-agile.

Layer IDs are ~easier in that they are not that prominent in the UI (although still visible), but, also, they are currently critical for layer sharing across images. Do we want to keep that design at all? Or should we store the “chain ID” values in a new field of a c/storage.Layer, and look up layers using that, maybe making layer IDs simply random 256-bit values?

// the per-platform one, which is saved below.
if !bytes.Equal(toplevelManifest, s.manifest) {
manifestDigest, err := manifest.Digest(toplevelManifest)
manifestDigest, err := manifest.DigestWithAlgorithm(toplevelManifest, supporteddigests.TmpDigestForNewObjects())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values end up in storage.Image.Digests and are used for lookups (see “Look for an image with the specified digest that has the same name,”); so, if the user pulls a repo@digest, we must allow lookup by exactly that digest.

I don’t immediately know what we do when pulling by tag. One option might be to pre-compute and allow lookup only by (probably, Canonical and) the preferred digest. Another might be to add support for lookups by an arbitrary digest algorithm to c/storage, where c/storage would actually digest all recorded manifests using the provided algorithm (and, I guess, cache the values once computed, although that might require RW locks where RO locks used to suffice).

defer decompressed.Close()

diffID := digest.Canonical.Digester()
diffID := supporteddigests.TmpDigestForNewObjects().Digester()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a more generic version of this concern, note that all over the place layer.UncompressedDigest and layer.TOCDigest only contains one value per layer; and options.Cache.Cache.UncompressedDigest{,ForTOC} only records one value per blob.

I don’t know what we want to do there… at least for the BlobInfoCache, we might determine that a compressed-sha256=A matches uncompressed-sha512=B (if pulling an old image configured with …ForNewObjects=sha512). Do we want to use that knowledge? Maybe we do, for future pulls (a sha256 collision means a collision at the registry), or at least for pulls from the same registry, but not for future pushes (if we have a sha512 image, we don’t want to regress to sha256 strength on uploads)??

// PutManifest writes the manifest to the destination.
func (s *storageImageDestination) PutManifest(ctx context.Context, manifestBlob []byte, instanceDigest *digest.Digest) error {
digest, err := manifest.Digest(manifestBlob)
digest, err := manifest.DigestWithAlgorithm(manifestBlob, supporteddigests.TmpDigestForNewObjects())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See elsewhere about lookups by manifest digest.


… but, also, see what PutSignaturesWithFormat does with the value. I’m not sure that actually makes a difference, the reader seems to ignore the entry for the top-level manifest digest, but that needs tracking down.

@packit-as-a-service
Copy link

Packit jobs failed. @containers/packit-build please check.

Allow the user to specify non-Canonical digest algorithm via
`supporteddigests.TmpDigestForNewObjects()` instead of
hardcoded `digest.Canonical` references.

Without --digest or with --digest=sha256, behavior remains unchanged
(SHA256 is the default).

Tested with a scratch built image.

Signed-off-by: Lokesh Mandvekar <lsm5@redhat.com>
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Dec 5, 2025
@lsm5
Copy link
Member Author

lsm5 commented Dec 5, 2025

@mtrmac switched tocImageID and layerID to also show algorithm identifier.

RE: other comments, which of those would you consider hard blockers to even get this in as experimental?

@mtrmac
Copy link
Contributor

mtrmac commented Dec 5, 2025

@mtrmac switched tocImageID and layerID to also show algorithm identifier.

I’m afraid I forgot to point at validateImageID; and look at the parsing logic in ParseStoreReference, the algorithm:hex values would not be usable as image IDs there. And stringid.ValidateID, currently used to interpret directory contents. So just starting to emit unexpected IDs is going to break.

RE: other comments, which of those would you consider hard blockers to even get this in as experimental?

I don’t know what we want to do but I think this does need a ~comprehensive analysis+plan; the c/storage data model, the UI-visible syntax, and the implementation need all to be updated consistently. I think starting down one implementation path without knowing whether the approach is viable overall… is work that should happen on an experimental branch, not something that should live in main already and might need to be expensively reverted if other work happens in the meantime.

  • A major design decision is whether to include algorithm names in IDs or whether to make IDs ~random and do lookups using some other mechanism. What do we want to do?
    • Deterministic IDs using the current design, adding an algorithm ID seem easier to implement within c/storage, but we would end up with ~separate sha256/sha512 “universes” within c/storage. OTOH it would require fairly disruptive changes to UIs and syntax.
    • Random IDs doing lookups via separately-maintained metadata would be much less (but still somewhat) disruptive to UIs and syntax, but we would need to define and implement the new metadata; and I don’t know how equality of different digests works — can it ever happen that we pull “the same” layer via sha256, via sha512, as separate objects, and then we “discover” that they two digests are exactly the same contents, and should be ~merged? Is “digest equivalence” even a transitive property?
  • (Although I guess the design should have ~the same shape for images and layers,) structurally the “layer ID / lookup / deduplication” and “image ID / lookup / deduplication” are mostly independent concerns, so maybe we can deal with designing one at a time.

@lsm5
Copy link
Member Author

lsm5 commented Dec 5, 2025

is work that should happen on an experimental branch, not something that should live in main already

SGTM. Could you please create a branch that I could base this PR on?

When it comes to shipping in Fedora/RHEL, we could ship from these branches until things land in the official branches, assuming nothing breaks in the default workflows.

  * Deterministic IDs using the current design, adding an algorithm ID seem easier to implement within c/storage, but we would end up with ~separate sha256/sha512 “universes” within c/storage. OTOH it would require fairly disruptive changes to UIs and syntax.

For now, I'm leaning towards this one where we continue like usual perhaps without breaking sha256 as much as possible and whatever UI changes for all others. I think that would make it easier to ship in podman v5.8 builds that we package in our distros. If users choose to use something other than sha256, they know they need to make many changes.

Do you have any particular preference?

@mtrmac
Copy link
Contributor

mtrmac commented Dec 5, 2025

SGTM. Could you please create a branch that I could base this PR on?

When it comes to shipping in Fedora/RHEL, we could ship from these branches until things land in the official branches, assuming nothing breaks in the default workflows.

I don’t think we can ship from a branch like that, especially to RHEL — that would be a commitment to a c/storage on-disk format. This can only be a WIP / spike playground.


(I agree we must not change the image ID values for sha256 ; I don’t think that strictly implies we need the same design for non-sha256 digests — although the code might be easier if there is only one approach used.)

For image IDs, I’m guessing but I find the “random ID values using the existing format” approach attractive because we wouldn’t need to figure out new syntaxes (compare the ParseStoreRefrence issue and the other PR about filtering by ID prefix in c/common/libimage). [Users like https://github.com/containers/skopeo/pull/2751 are going to be broken either way.]

For layer IDs … I really haven’t yet investigated what implementing either of the two approaches would take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image Related to "image" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants