Talk in bioinfo seminar of Pasteur Institute

Note: Thanks to Frédéric Le Moine from Pasteur Institute for the invitation to talk about Guix there. The experience was very great. Going to Pasteur Institute changed my routine, I always enjoy meet new people and learn about topics that I have barely heard before.

Although my talk (PDF) about introducing Guix is now a bit polished, its run is not as smooth as I would like, yet. First, because I still need to practise my speech delivery. Second, because explain what Guix is is difficult; it is not straightforward to clearly distinguish the “products”. Well, let make a quick recap of the message. And extend with a short example that I did not had the opportunity to demo.

The main talks’s message is also expressed in the paper:

Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data, vol. 9, num. 597, Oct. 2022

A quick opinionated summary for the impatient

  • Producing a scientific result implies being transparent at all the stages, such that, all is collectively verified. Both components are essential guarantees ensuring that we build knowledge on rock-solid foundations. They implicitly asks: how to redo later and elsewhere what had been done today and here?

  • Any scientific activity is by its definition open. Reproducible research is a mean – and not an end – for strengthening the trust of a scientific result.

  • The redo crisis in the scientific method has many causes and not one unique solution. Part of the crisis is from the transparency of computational environments. From my totally partial point of view, it is where Guix helps. What I remind from workshop Recherche reproductible: état des lieux: computational environment troubles touch all the fields – biology, chemistry, physics, etc. even social sciences! And I let aside all the software stack required for producing raw experimental data; another story.

  • In case it was not obvious: Guix does not solve the redo crisis. Neither most of the non-reproducibility problems the scientific practitioner is facing. Guix makes the computational environment transparent and collectively verifiable. Nothing more. That’s already one step toward better scientific research: Guix helps in freeing scientific practitioner’s mind about the computational environment part, then they can focus on numerical methodology and how to compose their computations.

  • When considering the computational environment, Guix is one answer for the question how to redo later and elsewhere what had been done today and here?

What makes Guix different

  • When I need the software samtools for manipulating nucleotide sequence alignments, implicitly it means I also need htslib – a C library for reading/writing nucleotide sequencing data, which itself needs bzip2 or zlib for compressing/decompressing or htscodecs for CRAM codecs, etc. Other said, I need a graph for linking all the dependency relationships.

  • Assuming Alice says “I use samtools at version 1.14”, are we using the exact same samtools depending on the version of dependencies? Is it the same samtools if we link it with htslib at version 1.16 or if we link it with htslib at version 1.12? And recursively for the dependencies of dependencies…

  • The mention of label version as “I use samtools at version 1.14” is an handy shorthand for identifying source code but it does not capture all the information required for:

    • Checking all is correct. Label version is not fully transparent for a collective verification:
      1. Based on a label version, how can we verify that the source code used by Alice is the exact same as the one we fetch now? What if two source codes are identified by the same label “version 1.14”?
      2. Maybe a bug had been discovered in one specific version of htscodecs, then how to know that the scientific result produced by Alice is not impacted by this bug if we do not know what is the htscodecs version that Alice used?
    • Redoing if necessary. Using later the incomplete information provided by label version, we do not have any guarantee that we run inside the same computational environment as Alice. Then if we observe a difference that leads to another conclusion,
      1. Is it because some methodological flaw of Alice’s paper?
      2. Is it because some experimental parameters poorly captured?
      3. Is it because an effect of the computational environment?

    Label version as “I use samtools at version 1.14” is not enough when applying the scientific method. It does not allow to control the source of variations. Instead, the complete graph – required tools, their dependencies and the dependencies of dependencies – must be captured.

  • This graph is what any package manager builds. The topic is how to build it unambiguously. When relying on some dependency solver, satisfying all the constraints is not “easy“. Guix does not rely on any dependency solver but builds the graph from an explicit specification (state). In addition, for flexibility, Guix allows to manipulates this graph. For instance, from one specification (state), Guix allows to declare how to replace one or more nodes for customizing the computational environment.

  • Using Guix, there is no dependency-resolution, contrary to Conda, APT, Yum, etc. Instead, the user specifies the state and this state provides some packages at some version. All is captured, from the exact identification of source code to compilation options, recursively.

  • The state of this graph is described by guix describe. It provides a pin that captures the state, i.e., the whole graph.

  • This features allows to reproduce the exact same stack of software from one machine to another.

Note. Container images as Docker or Singularity are a common solution to freeze this graph. The main drawback: The container embarks binaries only. The way how these binaries had been produced is lost, and if not, the task for auditing how the computational environment is composed is very hard. The source-to-binary (graph) is not designed for being deeply verifiable. Containers as Docker or Singularity lack transparency required by the scientific method.

  • In addition, Guix is able to exploit Software Heritage archive. If an URL location of some source code mentioned in or required by a publication vanishes, then Guix falls back to SWH in order to checkout the missing source code.

  • This fallback feature mitigates troubles coming from link-rot when time is passing. It lets a chance to redo later.

The 4 essentials for working with Guix

  • Capture your current state:

    guix describe -f channels > state.scm
    

    where an example of the channel file state.scm is displayed on the last slide of PDF.

  • Create an isolated environment:

    guix shell --container -m some-tools.scm
    

    where examples of the manifest file some-tools.scm are displayed slides p.14, p.16, p.19, and second to last slide of PDF.

  • Collaborate or publish means share two files: state.scm and some-tools.scm.
  • Re-create the exact same isolated environment, whenever and wherever:

    guix time-machine -C state.scm \
         -- shell --container -m some-tools.scm
    
  • Share the exact same computational environment via a pack – Docker is one of many other formats – if your collaborator does not run Guix (yet!):

    guix time-machine -C state.scm \
         -- pack -f docker -m some-tools.scm
    

    Distribute the resulting Docker image as you wish.

Summary: The whole computational environment is captured by these both manifest and channel files. They specify one particular graph; they describe all the nodes from source code identification to how to build or compose them (see third to last slide of PDF for an example of package definition, i.e., the definition of one node).

How to work with Guix?

Do not miss A Guide to Reproducible Research papers for getting started.

Consider that we already worked on a projects but without Guix. We are going to add the two channel and manifest files. It will help for redoing.

We consider a very simple case:

  1. One R script for analysing the data.
  2. One data set.

The example here is from analysing flow cytometry data but it adapts for any other fields using any other programming language.

As usual: a project = scripts + data

Assume that the source code for analysing the data is tracked by a Git repository. Let clone it.

git clone  https://github.com/MarioniLab/SignallingMassCytoStimStrength src

Note. When speaking about redo, we need an unambiguous identifier. In the digital world, the easiest is to consider data-dependent identifier as hash fingerprint. For Git repository, we do not have to worry much. Git is designed around this concept of content-addressable identifier.

Well, let checkout one specific revision of source code scripts:

git -C src \
    checkout d85402f3d951edf2c51281e3d09ea96a5c7da612

Missing the data to analyse, we need to fetch them.

Digression. For me, the right methodology is still an open question: how do we identify data set? how do we share data set? how do we fight against link-rot? etc. Am I drifting, right? Let focus on computational environment and be back on Guix!

For the sake of the message, I heavily simplify: 1. I downsampled the data and 2. I store the truncated data set in some location easily downloadable.

Warning. This data set being stored in a Git repository is not relevant here; it is only for easing the sharing with you – we are demoing Guix and not data management after all. The data set could be located on any server and downloadable via any API that such server would provide.

Let download some data to analyse.

git clone https://gitlab.com/zimoun/tiny-data src/data
cd src/data && gunzip *.gz && cd ..

Now, we are ready!

Computational environment: Guix

Entering in the project directory src/, we see the R script files. Consider the file Timecourse_peptides_analysis.Rmd and it tells us that it requires the R library named ncdfFlow. Let search for this package.

guix search ncdfFlow

In Guix, R libraries are packaged with a name using the prefix r- followed by the upstream downcased name, as r-ncdfflow. All in all, we identify the requirements:

"r"
"r-rtsne"
"r-pheatmap"
"r-rcolorbrewer"
"r-ncdfflow"
"r-edger"
"r-flowcore"
"r-dplyr"
"r-combinat"
"r-rmarkdown"

However, it still misses the R library named cydar. Guix does not provide it. And it is not part of any known scientific Guix channels (browse). Well, cydar is part of Bioconductor collection, so let import it from there:

guix import cran -a bioconductor cydar

This command-line fetches metadata from Bioconductor servers, then based on this metadata the importer returns a Guix recipe (package). The next step is explained in “How to get started writing Guix packages” – or see another French Café Guix talk “Comment avoir plus de paquets pour Guix ?” or this post. The end result defining r-cydar can be seen here.

Let write the manifest file. A starting point can be the command-line:

guix shell r r-ncdfflow r-rtsne r-pheatmap r-flowcore --export-manifest

which display a manifest file. At the end, the complete manifest.scm file looks as,

(specifications->manifest
 (list
  ;; Packages from Guix collection
  "r"
  "r-rtsne"
  "r-pheatmap"
  "r-rcolorbrewer"
  "r-ncdfflow"
  "r-edger"
  "r-flowcore"
  "r-dplyr"
  "r-combinat"
  "r-rmarkdown"                         ;render .Rmd files

  ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm
  "r-cydar"
  ))

where I edited it for adding comments – starting with semi-colon (;).

Then, it is easy to launch an isolated computational environment,

guix shell --container --load-path=my-pkgs -m manifest.scm

where --load-path points a directory containing Guix package definitions extending the builtin Guix package collection.

Note that the environment is isolated. Try some command other than R or Rscript as for instance ls or cd.

What is missing for being able to redo later and/or elsewhere? The specification of the state.

guix describe -f channels > channels.scm

Obviously, depending when you run this command, you will potentially get another Guix revision. The one I used when writing this post is 8e61e63 and it is let as an exercise for the reader how to run this revision.

Now, if both files channels.scm and manifest.scm are stored with all the other project files, it becomes easy to redo:

guix time-machine -C channels.scm                     \
     -- shell --container -L my-pkgs -m manifest.scm  \
     -- Rscript -e "rmarkdown::render('Timecourse_peptides_analysis.Rmd')"

Cool, isn’t it?

When time is passing

Maybe you have noted that the scripts are not one of my project but it is the source code associated to a published paper in 2020. Together with Nicolas Vallet and David Michonneau, we redid part of this paper on 2021 as the demo for our paper (Nature Scientific Data, vol. 9, num. 597, Oct. 2022). Therefore, let take the former state (see channels.scm here) file that we described two years ago.

Bang!

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz...
download failed "https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz...
following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz'...
download failed "https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6...
download failed "https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6...
download failed "https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/...
download failed "https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz...
download failed "https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "NOT FOUND"
Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz...
could not find its Disarchive specification
failed to download "/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz" from ("https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" "https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz")
builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz'
build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed
View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'.

Wait, let analyse the failure.

  1. The package BiocNeighbors is not part of the requirements that we specified. This R library is not explicitly imported by the scripts.
  2. We are seeing yet another example of the link-rot issue. Back on 2021, the source code of BiocNeighbors was downloadable at the mentioned URL and now the content at this very same URL is gone.
  3. Guix falls back to other locations when the initial expected one is failing. It tries the Software Heritage archive. That’s awesome! Currently 75% of source code that the Guix collection provides is archived. I will not dive into details about Software Heritage coverage and why it fails here. Keep in mind:
    • Guix automatically exploits the Software Heritage archive for fighting against link-rot.
    • The way the source code is uniquely identified matters.
    • Stay tuned, the coverage is improving…

Sadly, if only one node of the graph misses, all fails down. For this specific case where Bioconductor is involved, the Guix project considers it is a bug and tracks it with #39885… For the fix, stay tuned!

Not convinced that Guix rocks? Give a look to many use-cases about how scientific practitioners run Guix for making more reproducible their research.

Opinionated closing

Well, computational environment and Guix is a tiny part of the large picture of Reproducible research. Transparency and the ability to collectively audit the whole stack of any computation is just the scientific method. For instance, it is not affordable to redo some intensive computations that require days or weeks on very strong clusters. Complete transparency and careful audit of all the stages – the complete software stack – are guardians for trusting the scientific method, as in this case where the numerical experiment redo is not possible.

Moreover, most of the modern analyses implies various steps that are chained. All these chained steps are often named workflow and they also lead to process a graph – see from the venerable GNU Make to Snakemake or CWL or Nextflow. The composition of these different steps (node) needs to make sense and thus how can we check all is correct? It matters when extracting some steps from one workflow for creating another workflow by composing them differently. For instance, tools as bistro or funflow are trying to tackle such topic. Then, how to connect the workflow with the software deployment? Guix Workflow Language is an attempt but lacks some strong checker. Therefore, how to join tools as bistro or funflow with fine-control of the underlying computational environment? Reproducible research in the digital world still has a lot on its plate…

Join the fun, join the initiative Guix for Science!


© 2014-2024 Simon Tournier <simon (at) tournier.info >

(last update: 2024-04-12 Fri 20:39)