What does it happen when the source code that you use disappears? It happens more often than you think at first. Examples: Google Code down early in 2016, Alioth (from Debian) down in 2018 and replaced by a Gitlab instance named Salsa, Gna! down in 2017 after 13 of active years, Gitourious (the second most popular hosting service for Git in 2011) down in 2015, etc.
It is one of the initial motivation behind Software Heritage. They collect, preserve and share software in source code form. On September 2016, they announced the long-term preservation of Google Code, for instance.
That's fine, through concretely, how do I do with my own stack? That's the aim of this post: deal with the details using Guix.
Alice is developing a tool for her research. This tool is hosted somewhere on a public forge, for instance GitHub, using Alice’s personal account. Nothing fancy, regular thing as in many labs. Because Alice knows that Guix is awesome when considering reproducible computations, she packages the tool for Guix. She drops in a file this content,
(define-module (hello) #:use-module (guix packages) #:use-module (guix build-system gnu) #:use-module (guix git-download) #:use-module (guix licenses)) (define-public hi (package (name "hi") (version "2.10") (source (origin (method git-fetch) (uri (git-reference (url "https://github.com/Alice/hello-example.git") (commit "e1eefd033b8a2c4c81babc6fde08ebb116c6abb8"))) (sha256 (base32 "1im1gglfm4k10bh4mdaqzmx3lm3kivnsmxrvl6vyvmfqqzljq75l")))) (build-system gnu-build-system) (synopsis "Hello, GNU world: An example GNU package") (description "GNU Hello prints the message \"Hello, world!\" and then exits. It serves as an example of standard GNU coding practices. As such, it supports command-line arguments, multiple languages, and so on.") (home-page "https://www.gnu.org/software/hello/") (license gpl3+)))
guix build -L pah/to/that/file.scm hi just works. So far, so good!
On one hand, Alice could have many personal packages or variants in her
software stack, and on the other hand, she wants an easy mean to exchange
with collaborators such definitions. And, Alice knows that good practises
imply versioning this file. Thus, doing so, she is creating a channel. It
is just another Git repository. Last, this channel is hosted somewhere
on a public forge, for instance GitHub, again using Alice’s personal
Alice publishes a paper. The paper exploits the software stack from Alice’s channel and the mentions two key points:
manifest.scmfile containing all the numerical tools requires to complete the paper;
guix describe -f channelscontaining all the channels used and their revision (commit).
Then, anyone reading the paper is able to redeploy the software stack by simply running,
$ guix time-machine -C channels.scm -- package -m manifest.scm
Months or years later, Bob is trying to redeploy the software stack used by
the paper. Usually, the paper provides only few URLs referring to the
source code of some tools. For instance, this analysis uses
These URLs are now down, as Google Code is down for example. No worry,
Software Heritage has backed them up. However, Bob is hitting a
R@4.1.1 depends on so many other software that it
is impossible to resolve all by hand.
In short, some URLs mentioned in the paper are now down: the source code of Alice’s tool and Alice’s channel – because Alice closed her account at the end of her postdoc or because the hosting service is down.
That’s where content-addressability matters! If the paper provides channel commits – their revision – and the names of packages – used by the manifest file – then Bod is able to redeploy.
Knowing the commits of each channels, for instance Guix’s revision and
Alice’s channel revision, Bod writes this
(list (channel (name 'guix) (url "https://git.savannah.gnu.org/git/guix.git") (commit "cdea76a2fdaf7705583a02081a6468d436b8df05")) (channel (name 'example) (url "https://whatever-here.org/does-not-matter.git") (commit "67c9f2143aa6f545419ae913b4ae02af4cd3effc")))
and the paper mentions the tool
hi, thus, Bob now runs,
$ guix time-machine -C channels.scm -- build hi Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... guix time-machine: warning: channel authentication disabled Updating channel 'example' from Git repository at 'https://whatever-here.org/does-not-matter.git'... SWH: found revision 67c9f2143aa6f545419ae913b4ae02af4cd3effc with directory at 'https://archive.softwareheritage.org/api/1/directory/fe423e88ce277d3fc230c88d408e42b14a3a458c/' SWH vault: requested bundle cooking, waiting for completion... swh:1:rev:67c9f2143aa6f545419ae913b4ae02af4cd3effc.git/ swh:1:rev:67c9f2143aa6f545419ae913b4ae02af4cd3effc.git/HEAD [...] construction de /gnu/store/6g9qlysbbk7p4609xrv82j0wzbib1y4r-git-checkout.drv... guile: warning: failed to install locale environment variable `PATH' set to `/gnu/store/378zjf2kgajcfd7mfr98jn5xyc5wa3qv-gzip-1.10/bin:/gnu/store/sf3rbvb6iqcphgm1afbplcs72hsywg25-tar-1.32/bin' hint: Using 'master' as the name for the initial branch. This default branch name hint: is subject to change. To configure the initial branch name to use in all hint: of your new repositories, which will suppress this warning, call: hint: hint: git config --global init.defaultBranch <name> hint: hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and hint: 'development'. The just-created branch can be renamed via this command: hint: hint: git branch -m <name> Initialized empty Git repository in /gnu/store/884nsva9r8wkp40kbqyvpj1ad57jc5dd-git-checkout/.git/ fatal: could not read Username for 'https://github.com': No such device or address Failed to do a shallow fetch; retrying a full fetch... fatal: could not read Username for 'https://github.com': No such device or address git-fetch: '/gnu/store/5vai7bfrfkzv22dx13bxpszjrqyi78x6-git-minimal-2.33.0/bin/git fetch origin' failed with exit code 128 Trying content-addressed mirror at berlin.guix.gnu.org... Trying content-addressed mirror at berlin.guix.gnu.org... Trying to download from Software Heritage... SWH: found revision e1eefd033b8a2c4c81babc6fde08ebb116c6abb8 with directory at 'https://archive.softwareheritage.org/api/1/directory/c3e538ed2de412d54c567ed7c8cfc46cbbc35d07/' swh:1:dir:c3e538ed2de412d54c567ed7c8cfc46cbbc35d07/ [...] construction de /gnu/store/jx1r7w8xaw768176pjl0j0q1l1529w75-hi-2.10.drv réussie /gnu/store/jn8d031zx4znxy7s5zhj4dbr6xjsfq9v-hi-2.10
Great! But what does it mean? It means that a) Guix fetches the Alice’s
channel content from Software Heritage then b) Guix fetches the content of
hi again from Software Heritage.
More awesomeness… wait for it… it works for any Guix command! Other said, reproduce Docker images at anytime is not an issue anymore!
This content-address system is only implemented for Git repositories on Guix
side. For instance, Alice sends a request to Software Heritage for saving
her tool versioned with Git and packaged for Guix by invoking
Then, Alice sends by hand a request for her channel by submitting via
Software Heritage web-interface.
Currently, other version control system (VCS) is not supported by Guix. Help welcome for adding Subversion support. Though, the current Guix implementation should be enough for most of the scientist practitioner cases.
However, a lot of tools used by the complete Guix toolchain depends on tarballs. And the fallback to Software Heritage mechanism is still not enough.
Disarchive and its database is promising for a better tarballs coverage. The main issue is the normalization of the archive. In short, Software Heritage removes some metadata to store only the content – similarly as nars. Therefore, the tarball that Software Heritage returns does not necessary match the checksum known at package time – because of missing metadata. Disarchive builds a database containing this metadata, and thus a map from this checksum to the content stored in Software Heritage. It is improving daily.
Join the fun, join Guix!