#20 containers: only publish manifest if all current arches present (#17)
Merged by jcline. Opened by adamwill.
adamwill/cloud-image-uploader container-check-current-arches  into  main

There's a problem with always updating the multi-arch container manifests, if the relevant image fails to build on some arches. Say in the last compose we built x86_64 and aarch64. The current manifest will have both arches. Now say in the current compose, aarch64 fails and we only get an x86_64 image. If we go ahead and generate a new manifest and publish it, the aarch64 image is essentially "lost". The one from the previous compose would still be available under the arch specific name, as "fedora:40-aarch64" or whatever, but most use cases don't use that name, they just request "fedora:40" and expect the manifest to point to an image. If someone did that on aarch64 in this scenario, it would fail.

On advice from @nalin in #17, this addresses that problem by refusing to publish an updated multi-arch manifest at all if any arch that is in the current manifest failed in the compose being processed. This runs the risk of our manifests going stale if we do not fix failing arches promptly, but nalin felt this is preferable to the alternatives:

  1. Just go ahead and "lose" failed arches
  2. Generate a "hybrid" manifest containing successful images from the new compose along with existing images from the current manifest for arches that failed

If this turns out to be a big problem, we ought to be able to tweak this code to alternative #2 quite easily.

This also includes an important fix as a precursor, which I worked out while working on it: we need to allow buildah rmi commands to fail, because they will if the image/manifest does not exist, but that's totally fine.

Note that what the current script for containers does is the "hybrid manifest" approach, but I think that's more or less by chance, I'm not sure it was designed that way. The current script finds the most recent successful Koji task for each type and arch of container and builds a manifest out of those. I think it's just assuming the most recent successful task will be the one from the most recent compose, but as it happens, if it isn't, it will naturally "fall back" to whatever the previous successful task of the same type for the same release was.

What's happening here isn't immediately obvious to me.

I initially thought it was to work around buildah not being able to list a remote manifest, but it seems happy to do that:

surface5 in ~
❯ buildah manifest inspect quay.io/fedora/fedora:40
{
    "schemaVersion": 2,
    "mediaType": "application/vnd.oci.image.index.v1+json",
    "manifests": [
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "digest": "sha256:99563158175e056de38d23bf074ea3e91861971c0a73e7664faae535af4ed9fa",
            "size": 504,
            "platform": {
                "architecture": "arm64",
                "os": "linux"
            }
        },
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "digest": "sha256:89b870b45e9e410e1a018ee777b50dc2cc2a2d8896414e34b5f1a51b4724fcbf",
            "size": 504,
            "platform": {
                "architecture": "ppc64le",
                "os": "linux"
            }
        },
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "digest": "sha256:70f7e514851a0836d94e9ef644d0abffa751f71e215520ca4fce32ca9b78e82d",
            "size": 504,
            "platform": {
                "architecture": "s390x",
                "os": "linux"
            }
        },
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "digest": "sha256:1415ea89e284dfd26bcc58c046819a46305215050a8bb48f8ea45ed50a55ffe6",
            "size": 504,
            "platform": {
                "architecture": "amd64",
                "os": "linux"
            }
        }
    ]
}

it was based on advice in https://pagure.io/cloud-image-uploader/issue/17#comment-913921 - I thought the implication of that comment was that doing so was the only way to get the contents of a remote manifest. if we can just buildah manifest inspect it directly, that is definitely much easier, I will test that a bit and switch to it if so.

well, I think we can do that, but this will need rebasing on https://pagure.io/cloud-image-uploader/pull-request/22 , so let's review/merge that first.

rebased onto 716e06190eb181f341a38482d74c142507f2f41b

OK, rebased on the other recent changes, and simplified how we inspect the current manifest.

When a new architecture gets added this will crash with a KeyError. An alternative could be to just default to darch, but there's no guarantee we'll line up with the manifest name for an architecture so maybe failing loudly is fine?

Yeah, on the whole, I think I'd rather fail. If we do the fallback-to-darch thing, then if a new arch gets added whose manifest name is different, we would then fail to publish every compose, logging an error, till someone noticed...I'm not sure that's better?

good catch, though!

Cool, sounds fine to me.

Pull-Request has been merged by jcline

awesome, thanks! so I guess you can rebase your messaging PR now?