I’ve been on a very technical deep-dive recently.
One of my actions from a GitHub outage we had was to back up our GitHub containers from GHCR to our own container registry.
The previous action was to back up GitHub repositories, so I started adding another command to mirror containers from ghcr.io
Backing up a container registry, when you don’t have access to the underlying data files is kind of difficult. For extra difficulty I also wanted to run that task within a container that can be scheduled on-demand.
Most tools for managing images are designed to run on a host and talk to the container runtime. It’s difficult to get those tools working inside a container without running the container as root, which sort of defeats running it in a container in the first place.
What happened next was diving into how the HTTP API for container registries works.
A container image is a series of layers produced during the build step and a manifest that describes the image and the order of the layers.
All together, those layers form a full OS ready to run your app with all the binaries, dependencies & source code in the correct places. When you run a container it sort of squashes them together into a coherent filesystem.
Each layer is a zip of the filesystem changes produced by a step of the build from a Dockerfile. So if a step installs nodejs, that zip has the nodejs binary somewhere like /usr/bin/node. Those layers are referenced by their hash so if you had two node images they can share the same layer.
That reuse is really useful if your running versioned containers and only the source code or a configuration has changed, there is only one minimal layer difference when you deploy the new version. So only the new layer needs to be downloaded, which could be only kilobytes!
An extra complication is multi-arch images. These are essentially an “index” manifest which points to other architecture-specific manifests that then have their own layers with platform-specific binaries and files.
There is also a “config” object in a manifest that stores meta information like the original steps and build history.
Uploading and managing layers and manifests are purely HTTP operations and shouldn’t need a container runtime at all!
— naive me about a week ago
Fast forward a week, there is a new “registry” command for the github-org-backup app. It:
- Uses the GitHub api to list all container packages
- Processes that into an array of container image + tag combinations
- It filters out redundant semantic versions of each repository, ie it only keeps the latest patch version of each major:minor combo
- It downloads the manifests for those containers and the architecture-specific manifests too
- Then it enters the big loop of each layer of each architecture of each container
- If that layer doesn’t exist it starts a fetch() to stream the zip and streams that into a fetch() to the private registry to upload the blob
- Once all the layers and architecture-specific manifests are uploaded, it uploads the top-level manifest to complete that upload
Once it has initially run, it only has to upload any missing blobs, which are only new things since the last backup.
There was another optimisation where it keeps track of blobs it has uploaded and queries for previous uploads so if it sees the same blob again (looked up from its sha256 hash) it can perform a “mount” operation instead, bypassing re-uploads!
You can see this descent into madness in code form here:
Thank you for coming to my TED talk