Redoing our paper in Nature Scientific Data (Oct. 2022)
transparent, verifiable and long-term… one year later
Some days ago, I prepared a demo for my talk at Pasteur Institute. Why not try out one example from the paper Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data, vol. 9, num. 597. Well, I spent the morning just before the talk checking out that example for demoing… And the computational environment reproduction failed. This post investigates and draws counter-measures. The aim here is twofold:
- Expose the roadblocks about reproducing a computational environment; here the reproduction of one computational environment distant of three years.
- Show how Guix project helps; from technical workarounds to work-in-progress features.
Our paper had been published on October 2022 and we demoed using a published
paper from 2020. The Guix revision (channel file) – the one we published
in 2022 – had been thus picked from an older Guix revision – mimicking a Guix
channel file as it would have been published by the paper in 2020. To be
precise, the selected Guix revision we published1 was 1971d11db9
(April, 14th 2020).
And as explained in our paper, we chose this revision in order to get
Bioconductor 3.10, as we extrapolated from the published material.
Other said, the command-line,
guix time-machine -C channels.scm \
-- environment -C -L my-pkgs -m manifest.scm
builds the exact same computational environment as it was in April 2020. This command-line was just working out-of-the box in October 2022. Now, more than one year later, this very exact same command-line fails.
Long story short: today, three years later after one publication, the link-rot defeats the ability of checking this very published result. The failure starts very early in the attempt of reproducing: it starts at getting all the source code for composing the computational environment.
Evil is about details. Let review them!
Disclaim. Re-run and compare would be another goal. I think “re-run and compare” is a tangential corollary of the main objective: reproduce the exact same computational environment. Somehow, if we run inside the exact same computational environment, then the experimental conditions of the computations are (almost) bounded, and so “re-run and compare” is just repeat yet another same experiment so re-generating equivalent data. In addition, for many results, it is not affordable to “re-run and compare” because the cost (energy consumption, computing resource availability, CPU time, memory, etc.) is too high or does not worth it. To me, the scientific method applied to the computational environment means its full source-to-binary transparency, other said, the ability to deeply audit and verify all, in the long term. That’s the first main objective, then, if completed and if required, secondly we could apply variations about the computational environment in order to challenge the conclusion under study.
How to rebuild the past?
Before jumping into details, what do we mean by “computational environment” here? This computational environment is composed by the direct packages required by the analysis and also by all the indirect packages not explicitly listed. How many packages do we speak about? More than 720! Dependencies of dependencies matter, perhaps.
Reproducing the exact same computational environment means that we must compose again from more than 720 source code, and today on 2023 as it was in 2020. This asks two important questions:
- How to identify a source code?
- How to build the identical computational environment composed by more than 720 components?
My opinionated answer is: label version as with “I used flowCore at version 1.52.1” is not enough because,
- it identifies poorly the source code,
- it does not scale; more than 720 label versions should be provided.
Ready for the arcane of rebuilding the past? Let’s go!
First thing first
The computational environment is described by the manifest file,
(specifications->manifest (list ;; Packages from Guix collection "r" "r-rtsne" "r-pheatmap" "r-rcolorbrewer" "r-ncdfflow" "r-edger" "r-flowcore" "r-dplyr" "r-combinat" "r-rmarkdown" ;render .Rmd files ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm "r-cydar" ))
and all the versions of all these packages are implicitly defined by the channel file, which reads in this case,
(list (channel (name 'guix) (url "https://git.savannah.gnu.org/git/guix.git") (commit "1971d11db9ed9683d5036cd4c62deb564842e1f6")))
Here, only the “Guix channel” is required. More channels could also be added
for extending the package collection. In our case, the package collection had
been locally extended with one package defined in the directory my-pkgs
.
Complete transparency
The commitment that the Guix project tries to demonstrate is, from my point of view, unique and a real challenge: provide tools that produce the same computational environment at two distant points, in time and space. Concretely running this command-line,
$ guix time-machine -C channels.scm \
-- environment -C -L my-pkgs -m manifest.scm
outputs this long list:
guile: warning: failed to install locale substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% substitute: updating substitutes from 'https://bordeaux.guix.gnu.org'... 100.0% The following derivations will be built: /gnu/store/q0bb9n6blh4nbs6icmzfasymcwr5wkgd-r-3.6.3.drv /gnu/store/d52csrci11ngcxmzp04n9mdpjn207h7r-r-mass-7.3-51.5.drv /gnu/store/4xk7sl860dhf134in2gzfkbw6hc9z3an-MASS_7.3-51.5.tar.gz.drv /gnu/store/hl17dqn0wfc7wdf4c30daqvd5zgh5bky-r-codetools-0.2-16.drv /gnu/store/5257ay3c3qkiy51yv4frsvmjmh3hxk08-codetools_0.2-16.tar.gz.drv [...] /gnu/store/3sp4a864ax4cl8k7mpbmnxgbrvrcmvy8-gcc-7.4.0.drv /gnu/store/9b5swsrwd1z7lz6r9b1w3jdzyc75nvsx-ghc-8.0.2-src.tar.xz.drv /gnu/store/7k0qyy1s0clja7g1967ny8wsjlyy7izs-ghc-8.0.2.drv /gnu/store/03pbyq29ip4827h871y7p4bqsd8y0y1y-ghc-8.6.5.drv /gnu/store/31916d6jvgjwahvd28yipbpyrfrivmiq-ghc-pandoc-types-1.17.6.1.drv /gnu/store/2gwaaigspkzsa146ykfzdaingcx9kfjj-ghc-test-framework-hunit-0.3.0.2.drv /gnu/store/0h0c8pcbb34g8p6jxw09gw6km0frppd6-ghc-extensible-exceptions-0.1.1.4.drv /gnu/store/2srxxhp7dx4p286qk9rvrrywh4mkbgpy-ghc-pandoc-2.7.3.drv [...] /gnu/store/rd9yb2pci9xsr71dhf1k90gnmqhd513i-clisp-2.49-92.drv /gnu/store/ahi2sb681pz13a9sfv2hd8r77a5rb88v-clisp-2.49-92.tar.xz.drv /gnu/store/b03g73xpi16dyh762r2s39l2bvd40vif-sbcl-parse-js-0.0.0-1.fbadc6029.drv /gnu/store/fbz1kkihs51pgzmhfvqfw2xwawlycvcn-sbcl-iterate-1.5.drv /gnu/store/gmcvf18cpkdadz6h53l88nlnmh81b1r1-sbcl-rt-1990.12.19-1.a6a7503.drv [...] building /gnu/store/y1bnskpk88qh1adw7hpvds125m35p8xp-r-minimal-3.6.3.drv... - 'build' phase
What does it mean? It means that the computational environment is completely
transparent and verifiable. Namely, these files ending with .drv.
describe
how to build or where to fetch source code. We have access to all the details
for producing all the binary artifacts from the source code.
First, note that the Guix project provides pre-built substitutes and these
binary artifacts from 2020 are gone. Other said, we need to download the
source code and locally build them. For instance, we see that we are building
the package named r-minimal
and then we will download the source code for
the R
library MASS
in order to build the corresponding Guix package
r-mass
. And so on.
Second, have you noticed these ghc
items? They are Haskell libraries and
required for Pandoc – an universal document converter. The Guix package
ghc-pandoc
is indirectly required by R
libraries as ncdfFlow
. In
addition, we can also see clisp
or sbcl
which are from Common Lisp
ecosystem. Have you guessed them beforehand? Me, not! But told you – more
than 700 dependencies of dependencies.
Note. One might ask if one of these dependencies of dependencies matters for the ending result of the analysis and it is legitimate. Without using Guix, I would not be able to start an answer – confirm or contradict my intuition. Other said, the scientific method implies that we need to challenge the hypothesis – say it has no impact – by testing some variations. Therefore, we need two things: on one hand, a reference point and on the other hand the capacity to generate fully controlled variations of that reference point. As experimenters would do, they would run experiments varying parameters carefully selected under controlled conditions and the output of these experiments checked against one control output.
The unexpected failure?
Ah, the previous command-line quickly fails. The error is about the R
library BiocNeighbors
. It is important to notice that this R
library is
not explicitly required by the analysis but appears to be an indirect
dependency – required by the package r-cydar
.
It fails because Guix tries to fetch the source code from the location encoded in Guix package definition from 2020 and the content at that location is gone today in 2023. Yet another observation of the well-known link-rot phenomenon.
building /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv... -builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz' build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'. building /gnu/store/smky0svf3gw5dz8ajj0im7kr7mzv12lr-BiocParallel_1.20.1.tar.gz.drv... cannot build derivation `/gnu/store/xnmx8c5jgksv56g4qhsr17fsm62qclni-r-biocneighbors-1.4.2.drv': 1 dependencies couldn't be built guix environment: error: build of `/gnu/store/xnmx8c5jgksv56g4qhsr17fsm62qclni-r-biocneighbors-1.4.2.drv' failed
What could be done? Let ask first: what is already done? What are the attempts behind this error message? Let open the build log file as reported, it reads:
Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz... download failed "https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz... following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz'... download failed "https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6... download failed "https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6... download failed "https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/... download failed "https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz... download failed "https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "NOT FOUND" Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz... could not find its Disarchive specification failed to download "/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz" from ("https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" "https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz") builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz' build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'.
Before error out, Guix tries from 5 different locations: 2 official Bioconductor locations, 1 location maintained by the Guix project, 1 another location maintained by the NixOS project, and last 1 location from Software Heritage – the universal software archive.
Sadly, if only one item misses, all fails down. For this specific case where Bioconductor is involved, the Guix project considers it is a bug and tracks it with bug #39885.
Rewriting past origin
fields
Part of this Guix bug #39885 had been fixed with b032d14ebd
. Sadly, this
fix is from late June 2020, hence not available for the Guix revision
(April 2020) we are considering.
What could be done? We provide a channel file that totally freezes one specific state. All the packages described by that state are immutable. We are doomed, isn’t it?
Maybe not yet… Guix is flexible enough that it allows to rewrite the complete graph of dependencies. Indeed, but one should object that it will not be the exact same computational environment. And why would it not be? We need to speak about the identification of source code.
How does Guix fetch the source code? The answer is fixed-output derivation.
A package as r-biocneighbors
defines, among many other fields, the source
field which describes how to download (url-fetch
) and from where (see
uri
).
(define-public r-biocneighbors (package (name "r-biocneighbors") (version "1.4.2") (source (origin (method url-fetch) (uri (bioconductor-uri "BiocNeighbors" version)) (sha256 (base32 "1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6")))) (properties `((upstream-name . "BiocNeighbors"))) (build-system r-build-system) (propagated-inputs `(("r-biocparallel" ,r-biocparallel) [...]
In addition, we see a checksum (sha256
). This field allows to verify that
Guix downloads the expected content – the one that had been packaged. We
could provide any other URL location (uri
) while the content checksum
matches.
Note. If I am extreme, the field
version
is a label but does not – at all – describe the source code version. Only an identifier depending on the content itself allows to exactly know which source code version we are really using.
What revision b032d14ebd
from Guix bug #39885 does it fix? One of the
Bioconductor URL location. Look the content is still available at another
location,
$ guix time-machine -C channels.scm \ -- download https://bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz guile: warning: failed to install locale Starting download of /tmp/guix-file.kQt81F From https://bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz... following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz'... …_1.4.2.tar.gz 882KiB 348KiB/s 00:03 [##################] 100.0% /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz 1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6
The checksum matches. Now, we have the source code in our local store.
Problem solved for this missing source code of BiocNeighbors
. Then, let
re-run the command-line above for creating the computational environment. Ah,
it fails again because the source code of DelayedArray
is missing. Then,
again for another (GenomeInfoDb
), and again.
Could we programmatically extend the location where Guix downloads source code
from Bioconductor? Let rewrite the origin
field of the packages coming from
the Bioconductor project. First, we identify these packages because the URL
of their source code contains the string "bioconductor.org"
. Let define a
predicate procedure that returns #true
or #false
if the uri
field
matches or not this string "bioconductor.org"
. For details about
pattern-match
-ing, give a look to this post or this documentation.
(define (bioconductor? p) (match (package-source p) ((? origin? o) (match (origin-uri o) ((url rest ...) (string-contains url "bioconductor.org")) (_ #false))) (_ #false)))
Second, Guix provides the package-mapping
procedure that, given a package,
applies an user-defined procedure to all the packages depended on and returns
the resulting package. Other said, this package-mapping
allows to customize
the whole dependencies of dependencies (graph). Therefore, we need to define
a procedure that takes a package and creates a new package with an extended
list of URL if it is a package from Bioconductor else does nothing.
Wait, for instance the packages r-flowcore
and r-ncdfflow
both depends on
r-bh
, and even r-ncdfflow
depends on r-flowcore
. Therefore, we need to
rewrite r-flowcore
or r-bh
only once and not traverse again all the
dependencies. We need a predicate procedure package-seen?
that returns true
when the origin
field of the package coming from Bioconductor had already
been extended. A Guix package may own a properties
field that can store any
customized/user-defined pair values; from an association list. Each time we
extend the list of URL, we also add an element to the properties
field.
Then we use it for detecting if the package had already been seen.
All in all, it reads,
(define (package-seen? pkg) (assq-ref (package-properties pkg) 'bioconductor)) (define (extend-url pkg) (cond ((package-seen? pkg) pkg) ; Already processed ((bioconductor? pkg) ; Process if it comes from Bioconductor (let ((src (package-source pkg))) (package (inherit pkg) (source (origin (inherit src) (uri (append (origin-uri src) (list (string-append "https://" bioconductor-url "/packages/" "/3.10/bioc/src/contrib/" (or (assq-ref (package-properties pkg) 'upstream-name) (package-name pkg)) "_" (package-version pkg) ".tar.gz")))))) (properties `((bioconductor . #true) ,@(package-properties pkg)))))) (else pkg))) ; Do nothing for all the other packages
When pkg
comes from "bioconductor.org"
servers, we create a new package
(package
) where all the fields are copied (inherit
) except the source
and properties
fields. Since that’s a Bioconductor package, it is marked as
such using the properties
field. All the elements of the origin
field are
also copied from the original package except the uri
field where the new URL
is appended.
Here, the checksum field is not modified at all, therefore we are building the exact same computational environment. We are only implementing an ad-hoc workaround for fetching from more locations.
Last, in order to avoid extra work and traverse the deep dependencies of
dependencies, we assume that if, for one package, the dependency is not from
Bioconductor or already seen, it is not worth to process the dependencies of
this dependency. This strategy is what the procedure cut?
implements; see
here for the details.
Launching the command-line above for creating the exact same computational environment, now we get,
building /gnu/store/0cs04zymfxpwh49z5da2ps2d4vinakhi-GenomeInfoDb_1.22.1.tar.gz.drv... downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/GenomeInfoDb_1.22.1.tar.gz... building /gnu/store/0k520rcg3qa4bamkgrn1x8nd1nvxbbs2-Diff-0.3.4.tar.xz.drv... building /gnu/store/8diprmghchxf62svbapmjd2nq4g3yhhn-GenomicRanges_1.38.0.tar.gz.drv... downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/GenomicRanges_1.38.0.tar.gz... building /gnu/store/9i3c5vv9lmlkd0dr91252fmhzv7sa5vm-IRanges_2.20.2.tar.gz.drv... downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/IRanges_2.20.2.tar.gz... [...]
Awesome! Guix rocks…
Another failure
…Guix rocks but Guix cannot fix all the issues of the world. The failure now reads,
building /gnu/store/5257ay3c3qkiy51yv4frsvmjmh3hxk08-codetools_0.2-16.tar.gz.drv... downloading from http://cran.r-project.org/src/contrib/Archive/codetools/codetools_0.2-16.tar.gz... \sha256 hash mismatch for /gnu/store/9kd2dj46zy0m8ciz2m57f0rij9m3lj5c-codetools_0.2-16.tar.gz: expected hash: 00bmhzqprqfn3w6ghx7sakai6s7il8gbksfiawj8in5mbhbncypn actual hash: 1dklibnp747a0p41ggcf8fyw36xhj9c869gay80ggfns79y7axn2 hash mismatch for store item '/gnu/store/9kd2dj46zy0m8ciz2m57f0rij9m3lj5c-codetools_0.2-16.tar.gz'
We are doomed! Game over for reproducing the exact same computational
environment as the one from our paper. The CRAN project did an in-place
replacement. Again, label version is not enough when we speak about
identification of source code. At package time back in 2020, codetools
labelled 0.2-16
had a checksum and now this very same labelled 0.2-16
had
another checksum. Sadly, because we do not have access to the past source
code, now we have no mean to know the difference and/or if the difference
matters. We have lost the ability to verify and audit how the computations
had been done.
Assume that I just discover the result – say one, two or three years later after the publication. How can I trust the result if I cannot audit? If I stretch, what are the guarantees for trusting the result when the collective scientific method principles do not have the time to be applied? One, two or three years is not enough time for challenging a result, in my humble opinion.
That’s said, what are the options at hand? For one, we need to create the
best computational environment approximation, for example by using this other
version labelled 0.2-16
. For two, the skeptical of the result should use
this computational environment approximation to re-run and compare. Let focus
on one since I am not competent at all for two.
Again, we rely on package transformation as package-input-rewriting
which
allows to replace dependencies. First, we need to locally define a new
package, named r-codetools-bis
. And second, we need to rewrite the
dependencies of dependencies for replacing the old r-codetools
by the new
r-codetools-bis
. The definition of r-codetools-bis
is straightforward,
(define-module (my-pkgs-fix) #:use-module (guix packages) #:use-module (guix download) #:use-module ((gnu packages statistics) #:select (r-codetools))) (define-public r-codetools-bis (package (inherit r-codetools) (name "r-codetools-bis") (source (origin (method url-fetch) (uri "http://cran.r-project.org/src/contrib/Archive/codetools/codetools_0.2-16.tar.gz") (sha256 (base32 "1dklibnp747a0p41ggcf8fyw36xhj9c869gay80ggfns79y7axn2"))))))
From the origin definition of r-codetools
, we just copy (inherit
) all the
fields except the source
field, the one we replace with the new source code
version matching the new checksum. And the manifest file is updated with,
(define with-r-codetools-bis (package-input-rewriting `((,(specification->package "r-codetools") . ,(specification->package "r-codetools-bis")))))
Nothing more.
Go for it
Let recap how all the pieces fit altogether. First, we have a list of package
names (specification) that we transform to internal representation (package).
Second, for each we rewrite the dependencies of dependencies to replace
r-codetools
with r-codetools-bis
. Third, for each we write the
dependencies of dependencies to extend the Bioconductor servers. Somehow, it
reads,
(packages->manifest (map (compose fix-bioconductor-url with-r-codetools-bis specification->package) (list ;; Packages from Guix collection "r" "r-rtsne" "r-pheatmap" "r-rcolorbrewer" "r-ncdfflow" "r-edger" "r-flowcore" "r-dplyr" "r-combinat" "r-rmarkdown" ;render .Rmd files ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm "r-cydar" )))
In summary, we compose the three transformations and apply this composition
for each (map
) listed package name. See the companion Git repository for
all the details.
Last, be patient… then be more patient. It needs to build many packages from
source since most of the pre-built binary artifacts (substitutes) are gone.
It takes many hours depending on your hardware. And the end, we do not have
the exact same computational environment as the one we described in our paper.
Instead, we have an approximated one where the R
library codetools
has
been replaced. For sure, Guix provides all the tools to manipulate the
computational environment controlling with very fine-grain all the details.
Next steps, work in progress
Still reading? Such a journey, isn’t it? My first opinionated and main conclusion is that Guix is very flexible and strict in the same time. The computational environment had been frozen – all is immutable – and although the reproduction is impacted by the link-rot phenomenon, still, Guix helps. Maybe it does not appear to you “easy” but I do not have in mind any other tools able of such features: time-travel, rewrite components on the fly, etc. If you are an expert of some other tools, let me know how you would do. Any feedback is very welcome!
How this reproduction could be even easier? Yes, I, among many others, are not satisfied by the current situation. We can collectively do better!
- If you are paper’s author, please cite without any ambiguity your scripts
or more required by your work. Do not think that publishing your Git
repository containing your script is enough for inspecting, verifying and
auditing what you did.
- You need to pin a specific revision when publishing.
- You need an identifier that depends only on the content itself. And not some label version as “version 1.2.3”.
- If you are paper’s author, please think how to cite all the software
required by your work. Do not take me wrong, publishing your scripts or
more is a very good practise toward a better scientific production. In
addition, please also capture the information from your computational
environment as
guix describe
does, and publish it.
In summary,
- The way the source code is uniquely identified matters.
- The way how source code is transformed to binary also matters.
How to lookup inside an archive?
As we have seen above, when the content is missing at all the expected locations, then Guix tries to fetch such content from the Software Heritage archive. Currently 75% of source code that the Guix collection provides is archived. To put it in precise terms: for 75% of source code that Guix collection provides, Guix is able to automatically fetch back the content from Software Heritage. Other said, the remaining 25% may be already archived and Guix does not implement the capacity to download them from Software Heritage.
One example is about Subversion (svn
) version control system. Most of CTAN
TeX packages is versioned using Subversion and may be archived in Software
Heritage. However, in these cases the mechanism to lookup is not implemented
by Guix (see bug#43442). Help is very welcome!
Another example is about “compressed tarballs”. In short, a “compressed tarball” could be split into two parts: the content itself and metadata around. For instance, metadata is compression level, compression algorithm parameter, or related to specific file or directory structures. Software Heritage archive only the content itself – and that’s already a lot! – but drop the metadata around. Then, without these metadata, it is impossible to verify (checksum) that it exactly matches the version at package time.
That’s the purpose of Disarchive database. For all the “compressed tarballs” that the Guix collection relies on, on one hand the metadata is automatically extracted and stored in a database and on the other hand the content is archived by Software Heritage. Awesome, isn’t it?
However, as we have seen previously,
Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz... could not find its Disarchive specification
this Disarchive database has holes. For instance, it does not contain the
package r-biocneighbors
for the Guix revision 1971d11db9
. Help is very
welcome. From discussion to example of hole or infrastructure improvement.
Last but not least, the choice of the unique identifier. It is not a detail,
it is the masterpiece. And the situation seems as the well-known xkcd strip
about standards. Guix had been initiated in 2012 and reuses principles
pioneered by Eelco Dolstra’s PhD thesis from 2006. One is normalized
archive or nar
format – comparable in spirit to tarball. Then, Software
Heritage designed later SoftWare Heritage persistent IDentifiers (SWHIDs)
adapted for their archiving purpose. For the interested reader, an
entry-point for the difference reads,
guix hash --serializer=nar --format=nix-base32 --hash=sha256 guix hash --serializer=git --format=hex --hash=sha1
The first one matches the sha256
field in a Guix package definition. The
second matches swhid:1:dir:
identifier. Compare for instance,
$ guix time-machine -C channels.scm -- edit r-catterplots $ guix hash -S nar -f nix-base32 -H sha256 \ $(guix time-machine -C channels.scm -- build r-catterplots --source) 0qa8liylffpxgdg8xcgjar5dsvczqhn3akd4w35113hnyg1m4xyg $ guix hash -S git -f hex -H sha1 \ $(guix time-machine -C channels.scm -- build r-catterplots --source) 98315f49b5f8a6bd0c537de92449d5a5ce8ff35a
And visit https://archive.softwareheritage.org/swh:1:dir:98315f49b5f8a6bd0c537de92449d5a5ce8ff35a.
Somehow, Disarchive database provides such mapping from NAR identifier to SWHID identifier. It could be nice if Software Heritage could directly integrate such map. Guess what? The work in progress is to bridge both and thus ease the lookup. For instance, SWH ticket#4979 introduces some mapping components. Stay tuned for more…
Rendez-vous next year! I cannot wait. Let see if we will collectively make progress for reproducing what we just did in this post.
Join the fun, join the initiative Guix for Science!