Guix and long-term archiving in action

What does it happen when the source code that you use disappears? It happens more often than you think at first. Examples: Google Code down early in 2016, Alioth (from Debian) down in 2018 and replaced by a Gitlab instance named Salsa, Gna! down in 2017 after 13 of active years, Gitourious (the second most popular hosting service for Git in 2011) down in 2015, etc.

It is one of the initial motivation behind Software Heritage. They collect, preserve and share software in source code form. On September 2016, they announced the long-term preservation of Google Code, for instance.

That's fine, through concretely, how do I do with my own stack? That's the aim of this post: deal with the details using Guix.

Alice publish

Alice is developing a tool for her research. This tool is hosted somewhere on a public forge, for instance GitHub, using Alice’s personal account. Nothing fancy, regular thing as in many labs. Because Alice knows that Guix is awesome when considering reproducible computations, she packages the tool for Guix. She drops in a file this content,

(define-module (hello)
  #:use-module (guix packages)
  #:use-module (guix build-system gnu)
  #:use-module (guix git-download)
  #:use-module (guix licenses))

(define-public hi
  (package
    (name "hi")
    (version "2.10")
    (source (origin
              (method git-fetch)
              (uri (git-reference
                    (url "https://github.com/Alice/hello-example.git")
                    (commit "e1eefd033b8a2c4c81babc6fde08ebb116c6abb8")))
              (sha256
               (base32
                "1im1gglfm4k10bh4mdaqzmx3lm3kivnsmxrvl6vyvmfqqzljq75l"))))
    (build-system gnu-build-system)
    (synopsis "Hello, GNU world: An example GNU package")
    (description
     "GNU Hello prints the message \"Hello, world!\" and then exits.  It
serves as an example of standard GNU coding practices.  As such, it supports
command-line arguments, multiple languages, and so on.")
    (home-page "https://www.gnu.org/software/hello/")
    (license gpl3+)))

then guix build -L pah/to/that/file.scm hi just works. So far, so good!

On one hand, Alice could have many personal packages or variants in her software stack, and on the other hand, she wants an easy mean to exchange with collaborators such definitions. And, Alice knows that good practises imply versioning this file. Thus, doing so, she is creating a channel. It is just another Git repository. Last, this channel is hosted somewhere on a public forge, for instance GitHub, again using Alice’s personal account, say https://github.com/Alice/channel-that-rocks.git.

Alice publishes a paper. The paper exploits the software stack from Alice’s channel and it mentions two key points: the both files,

  1. one manifest.scm file containing all the numerical tools required to complete the paper;
  2. one channels.scm file from guix describe -f channels containing all the channels used and their revision (commit).

Then, anyone reading the paper is able to redeploy the software stack by simply running,

$ guix time-machine -C channels.scm -- shell -m manifest.scm

Bob redo

Months or years later, Bob is trying to redeploy the software stack used by the paper. Usually, the paper provides only few URLs referring to the source code of some tools. For instance, this analysis uses R@4.1.1. These URLs are now down, as Google Code is down for example. No worry, Software Heritage has backed them up. However, Bob is hitting a combinatorial problem: R@4.1.1 depends on so many other software with dead URL that it is impossible to resolve all by hand.

In short, some URLs mentioned in the paper are now down: for instance, the source code of Alice’s tool and Alice’s channel – because, for example, Alice closed her account at the end of her postdoc or because the hosting service is down.

That’s where content-addressability matters! If the paper provides channel commits – their revision – and the names of packages – used by the manifest file – then Bob is able to redeploy.

If the file channels.scm is not provided as supplementary material along the paper, only knowing the commits of each channel, for instance Guix’s revision and Alice’s channel revision, then Bob is able to rewrite this channels.scm file,

(list (channel
        (name 'guix)
        (url "https://git.savannah.gnu.org/git/guix.git")
        (commit
          "cdea76a2fdaf7705583a02081a6468d436b8df05"))
      (channel
        (name 'example)
        (url "https://whatever-here.org/does-not-matter.git")
        (commit
          "67c9f2143aa6f545419ae913b4ae02af4cd3effc")))

and the paper mentions the tool hi, thus, Bob now runs,

$ guix time-machine -C channels.scm -- build hi

which returns the output,

Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
guix time-machine: warning: channel authentication disabled
Updating channel 'example' from Git repository at 'https://whatever-here.org/does-not-matter.git'...
SWH: found revision 67c9f2143aa6f545419ae913b4ae02af4cd3effc with directory at 'https://archive.softwareheritage.org/api/1/directory/fe423e88ce277d3fc230c88d408e42b14a3a458c/'
SWH vault: requested bundle cooking, waiting for completion...
swh:1:rev:67c9f2143aa6f545419ae913b4ae02af4cd3effc.git/
swh:1:rev:67c9f2143aa6f545419ae913b4ae02af4cd3effc.git/HEAD
[...]
construction de /gnu/store/6g9qlysbbk7p4609xrv82j0wzbib1y4r-git-checkout.drv...
guile: warning: failed to install locale
environment variable `PATH' set to `/gnu/store/378zjf2kgajcfd7mfr98jn5xyc5wa3qv-gzip-1.10/bin:/gnu/store/sf3rbvb6iqcphgm1afbplcs72hsywg25-tar-1.32/bin'
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint:   git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint:   git branch -m <name>
Initialized empty Git repository in /gnu/store/884nsva9r8wkp40kbqyvpj1ad57jc5dd-git-checkout/.git/
fatal: could not read Username for 'https://github.com': No such device or address
Failed to do a shallow fetch; retrying a full fetch...
fatal: could not read Username for 'https://github.com': No such device or address
git-fetch: '/gnu/store/5vai7bfrfkzv22dx13bxpszjrqyi78x6-git-minimal-2.33.0/bin/git fetch origin' failed with exit code 128
Trying content-addressed mirror at berlin.guix.gnu.org...
Trying content-addressed mirror at berlin.guix.gnu.org...
Trying to download from Software Heritage...
SWH: found revision e1eefd033b8a2c4c81babc6fde08ebb116c6abb8 with directory at 'https://archive.softwareheritage.org/api/1/directory/c3e538ed2de412d54c567ed7c8cfc46cbbc35d07/'
swh:1:dir:c3e538ed2de412d54c567ed7c8cfc46cbbc35d07/
[...]
construction de /gnu/store/jx1r7w8xaw768176pjl0j0q1l1529w75-hi-2.10.drv réussie
/gnu/store/jn8d031zx4znxy7s5zhj4dbr6xjsfq9v-hi-2.10

Great! But what does it mean? It means that,

a) Guix fetches the Alice’s channel content from Software Heritage, then, b) Guix fetches the source code of hi again from Software Heritage.

Using step a), Guix knows how to build the package named hi and using step b) Guix builds this package definition using the source code of hi. Alice and Bob will have the exact same binary for hi – if the build is fully reproducible, another story.

More awesomeness… wait for it… it works for any Guix command! Other said, reproduce Docker images at anytime is not an issue anymore!

Ongoing work

This content-address system is only implemented for Git repositories on Guix side. For instance, Alice sends a request to Software Heritage for saving her tool versioned with Git and packaged for Guix by invoking guix lint. Then, Alice sends by hand a request for her channel by submitting via Software Heritage web-interface.

Currently, other version control system (VCS) is not supported by Guix. Help welcome for adding Subversion support. Though, the current Guix implementation should be enough for most of the scientist practitioner cases.

However, a lot of tools used by the complete Guix toolchain depends on tarballs. And the fallback to Software Heritage mechanism is still not enough.

Disarchive and its database is promising for a better tarballs coverage. The main issue is the normalization of the archive. In short, Software Heritage removes some metadata to store only the content – similarly as nars. Therefore, the tarball that Software Heritage returns does not necessary match the checksum known at package time – because of missing metadata. Disarchive builds a database containing this metadata, and thus a map from this checksum to the content stored in Software Heritage. It is improving daily.

Join the fun, join Guix!


© 2014-2024 Simon Tournier <simon (at) tournier.info >

(last update: 2024-06-05 Wed 09:47)