When pushing container to the registry, we should push them first under a tag that is the full compose version (41.20241115.0) and then update the release tag (41) as well.
41.20241115.0
41
We probably need to figure a tagging naming scheme for testing builds as well.
What container image specifically are you talking about here? What tag did it get pushed under? The current logic for tag handling intentionally replicates how the shell scripts were doing it, if we want it to do something different that needs to be planned out.
So if we're talking about atomic desktops containers: yes, AFAICS, this would be new behavior. The old sync-ostree-base-containers.sh did not do this. If you look at https://quay.io/repository/fedora/fedora-silverblue?tab=tags , there are no such tags.
sync-ostree-base-containers.sh
If we're going to do this we would also have to implement garbage collection, wouldn't we? Otherwise we'd rapidly wind up with hundreds of tags, which probably wouldn't make quay very happy.
What's the use-case? We could provide version data as annotations. Either org.opencontainers.image.version or, if we want to set that to the major Fedora release version, org.fedoraproject.image.compose-id or something. Of course, if the reason is you want to be able to check out older builds that won't help you much.
org.opencontainers.image.version
org.fedoraproject.image.compose-id
And yeah, if we do this we'll need to decide on a cleanup policy.
Yes, while this is driven by the need for the Atomic Desktops, this would also probably be good for all containers. See an example in https://quay.io/repository/fedora-ostree-desktops/silverblue?tab=tags (where I also include the git shorthash but we probably don't want that).
For garbage collection, the policy should be: - Keep the last three months of Rawhide - Keep branched release - Keep all images for current stable and old stable releases - Keep only the latest build for EOL releases to let users update older systems.
Note that due to how Bootable Container images are built, a new tag does not necessarily mean a full image size in storage as identical layers are shared between builds. This is why the Silverblue repo for example has a size of 110 GB (https://quay.io/organization/fedora-ostree-desktops) while we have 110 tags in the repo and each image is about 2 GB (https://quay.io/repository/fedora-ostree-desktops/silverblue?tab=tags).
What's the use-case?
Classic ostree repos include the full log and history of all the builds. We want to preserve the same feature for the new format. It's massively useful for development and debugging (for bissecting issues) and it will enable us to implement proper update channels in the future. It's also how we want users to use those containers for derivation: pin to a given release, refresh the pinned release regularly via a CI job.
See: - https://github.com/coreos/fedora-coreos-tracker/issues/1367 - https://quay.io/repository/fedora/fedora-coreos?tab=tags
Fedora CoreOS is also implementing garbage collection for those images so we could share there: https://github.com/coreos/fedora-coreos-tracker/issues/99
See: https://discussion.fedoraproject.org/t/we-need-to-come-up-with-a-consistent-approach-for-generating-and-publishing-containers-both-traditional-and-atomic-desktop-containers-both-stable-and-unstable-releases/109213/14
Cool, thanks for the details.
I think this would be nice, and can devote some time to the implementation, but I also think I shouldn't be the person to yay/nay this. I'm not, however, certain what the appropriate workflow should be. Maybe we make a releng ticket and get sign-off there?
Beyond just agreeing to a tagging and cleanup policy it would be good to ensure it's uniform and documented. I like the idea of sharing garbage collection with CoreOS. My preference would be for that to be in place before we start pushing tags since otherwise it's possible it won't get done until it's an emergency, but again that feels like something releng people should decide.
I was pointed to this thread and want to add another use case where preventing images from garbage collection would be great.
Let's consider the Dockerfile of a pet project (i.e., https://github.com/vrothberg/fedora-bootc-workstation/blob/main/Dockerfile#L1). I am referencing the fedora-bootc:41 image via a digest. Using the digest allows Renovate Bot to open a PR as soon as the image was updated on the registry while also allowing reproducibility. I am in full control of what goes in which is exactly what I want for the use case.
However, the aggressive GC can easily render the Dockerfile to not build as the referenced digest does not exist on the registry anymore. Having a three month grace period or keeping images around until they go EOL would all work for me. I am mostly looking for a policy we can document and work with.
I kinda feel like we could probably do the garbage collection inline - when pushing images from a new compose, also wipe the corresponding images that are more than X months old (or whatever the relevant heuristic is). Seems like it'd be neatest that way. I can try and find some time to draft an implementation of this, maybe.
We don't really keep around old content for any other deliverable, I'm not sure why we should do it for containers, especially when it means we could easily fill up various storage quotas doing it.
Atomic Desktop container images are not small, and Fedora isn't paying for space on the various registries. We have no budget for it and no means to support it. Not to mention it creates a support problem as we can get people using old content forever, which we do not support as a project.
[...] we can get people using old content forever [...]
Can you elaborate on that?
The GC policy should avoid images from lingering around forever, which should address the issue of users potentially using outdated content forever. The storage issues may turn into a problem but I would wait and see.
The -bootc images are published on Quay where we can work with Red Hat in case the storage hunger is getting too big. It is a strategic direction for Red Hat, so I am sure we won't get in trouble with Quay. I cannot speak for other registries but I personally care most about the -bootc images at the moment.
I'm not against doing that (we do it for Azure images) and it does have the advantage of not splitting the management across projects. I'd hoped we could reduce the duplication between CoreOS and everything else, but unless @siosm weighs in and can help co-ordinate that then just doing it ourselves is okay.
I don't want us making decisions based on invisible internal decision-making from a provider, because that can change at a drop of a hat in a fairly painful way. Our images are supposed to be available in multiple provider locations (Fedora hosted, Quay hosted, Docker Hub hosted), so it matters how we use other people's resources.
We keep lots and lots of "old" (but not EOL) content on AWS and Azure so I'm not sure what the issue is with containers on Quay/Fedora hosted/Docker Hub.
I haven't seen anyone suggesting we keep content forever and there's plenty of middle ground between "today's image" and "every image ever produced". We can define whatever retention policy we think is a good fit for users and hosts and adjust as necessary when it isn't perfect.
That sounds good to me. This is essentially what the job Fedora CoreOS pipeline does if I'm not mistaken.
We don't really keep around old content for any other deliverable, I'm not sure why we should do it for containers, especially when it means we could easily fill up various storage quotas doing it. Atomic Desktop container images are not small, and Fedora isn't paying for space on the various registries. We have no budget for it and no means to support it. Not to mention it creates a support problem as we can get people using old content forever, which we do not support as a project.
See https://pagure.io/cloud-image-uploader/issue/37#comment-944447 where most of this is addressed and where I make an initial garbage policy suggestion. I'm OK with deleting more images but please make a clear policy suggestion.
I like the policy suggested in https://pagure.io/cloud-image-uploader/issue/37#comment-944447 but I'd add:
This would allow you to say test something in f37 or whatever. or work around some issue that showed up in updates in the latest update image for that release.
I don't think we need to keep all the EOL releases, as if we really really need something for some reason, we have them in koji still.
Also, I made a https://quay.io/organization/fedora-testing org, which as time permts we should setup to handle testing/candidate/etc things.
According to https://docs.projectquay.io/use_quay.html#setting-tag-expirations-v2-ui (section: Setting tag expirations by using the API), it is possible to set an expiration date via the API for image uploaded to Quay.io.
I'm planning to take a look at implementing this when I return from vacation in mid-April, if someone doesn't beat me to it.
it's in my backlog too, but didn't get to it yet :/
I started poking at this a little bit in https://pagure.io/cloud-image-uploader/c/dd10defbbe3f19bf262ecaeb8f815b8198f67103?branch=nightly-containers
I'm not sure if it's easier to use the annotation method to apply the expiration, or the REST API (I think we'd need new credentials with that approach). I also haven't really decided how to handle registry.fedoraproject.org since it doesn't have the fancy expiring-tag feature. Do we just not push the nightly tags there?
So coming back to this and thinking about it...one issue is that the strategy gets pretty complex. We have various complicating factors. We upload images of different types from all these different composes...
So we kinda have two dimensions possibly affecting the strategy: what type of image is this, and what type of compose is it from? Container/toolbox images don't necessarily want the same strategy as atomic desktop images (but maybe they do?) Images from a pre-release nightly compose don't necessarily want the same strategy as images from a pre-release candidate compose, which don't necessarily want the same strategy as images from a post-release nightly compose. And IoT is just weird (all IoT composes are run nightly but have similar metadata to 'candidate' composes, and pre-release/post-release are only identifiable from each other by release number...)
I can kinda see two choices: we try to come up with a 'standard' strategy, massaging the differences between compose types and image types, and apply it all in code. Or we implement various strategies and allow some kinda config format for specifying which strategy to use per image and possibly per phase?
There's also a wrinkle with using the quay.io expiring tags feature like this, besides 'registry.fp.o doesn't have them': we can't easily do any of the policies proposed so far that way, because they all rely on knowing stuff we do not know at publish time. At the time we publish the images that will be the GA images for a given release, we do not know that yet. All we know is that this is a candidate compose - the decision to make it the GA compose is made later.
So we can't do stuff like "keep the GA compose forever" with this "apply an annotation at upload time" approach. We'd either have to have a way to add or modify annotations later (is this even possible?), or do the cleanup ourselves as initially suggested.
We could do something much simpler, like "keep candidate compose images till release EOL (or forever), keep all other images for X days". That would be easiest, I think. In theory we could do that and then refine in later; in practice I suspect if we do that it will be one of those "it's good enough so we just stick with it forever" things.
Oh hey, I missed ELN. Sigh.
Having thought that through, @jcline 's approach actually looks like it'd mostly work pretty well in practice. The only note I can see is that it would retain even pre-release images for branched until release date.
I'm not sure of the reason for excepting ELN, though?
What my thinking was a month ago is nearly a mystery to me as well, but I guess it was because they had initially asked for just a latest tag, but there's no other reason.
latest
Opened up https://pagure.io/cloud-image-uploader/pull-request/57 - the tests still need rewriting because they assume the manifest is the same regardless of the registry being pushed to, but I wanted at least @adamwill to okay the refactor work before I proceed further.
This is deployed so either things will fall over shortly, or you'll start seeing versioned tags on quay.io. Let me know if you spot anything unexpected.
Metadata Update from @jcline: - Issue status updated to: Closed (was: Open)
The way I'd think of this is that we should actually default to expiring all (container) images published after e.g. 1 month. GA images just have their expiry increased.
OK, so then we have to add some kind of mechanism to increase the expiry on the GA images once we decide they're the GA images. or remember to do it manually, which inevitably won't always happen.
This doesn't appear to be working, at least going by the Quay UI.
It seems that both the image index and image manifest objects can have annotations so possibly putting them in the index alone is wrong. Looking at the Quay docs don't make it obvious one way or another, or I am missing something.
Metadata Update from @jcline: - Issue status updated to: Open (was: Closed)
This doesn't appear to be working, at least going by the Quay UI. It seems that both the image index and image manifest objects can have annotations so possibly putting them in the index alone is wrong. Looking at the Quay docs don't make it obvious one way or another, or I am missing something.
Putting them in both doesn't work either. If annotations in the index don't work, another option is to do it via the API if the docs are to be believed, which means getting permissions/credentials set up.
Unfortunately, I've got a lot of other things on my plate so I'm inclined to revert this and return it to my backlog (pull requests welcome, of course).