This talk was in French with a slot of 5-7 minutes, questions included. It was taken in a full day satellite to Paris Open Source Summit. The initiative was lead by Bastien Guerry from EtaLab. More information of the programme here.
The slot was very short and the audience very heterogeneous; especially about the day-to-day concerns. As an engineer working in an institute doing research in biology, I have tried to explain what is the «Reproducible Science» challenge in the modern age of data.
In short, today a scientific result is an experiment producing data and a numerical processing. From what I am seeing, the experimental part is more or less well described, or let say that people in labs are aware of its importance because they have already several decades (even more) of collective learning.
However, not enough people take care about the numerical processing. Mainly, in my opinion, because we are living a scientific paradigm shift. From what I am seeing, more than often, it is not understood that more scientific value is in the numerical process than really in the data itself (or how they are produced). Although I am fully biased because computing is my job and I understand nothing about labs.
To guarantee «Reproducible Science» in the modern age of data, we need to guarantee several items, especially:
- Open Articles
- Open Data
- Open Source
- Controlled computing environment (open, too)
However, what about the point 4.?
To fix the ideas, let consider some examples I encounter everyday.
- Alice use the tool foo-1.2, bar-3.4 and baz-5.6,
- Carole works with Alice but works for another project with the tools foo-7.8 and bar-9.0,
- Charlie upgrades their system and then nothing is working,
- Bob uses the same versions than Alice but he hits different results,
- Dan wants to replay the same numerical processing several months (or years) later but he is not able to reinstall the same versions of the tools because the tools have been updated breaking the backward compatibility.
With these scenarii, the idea is to spot concrete issues of the daily life of researchers.
Each issue is fixable separately:
- package managers fix the dependency hell,
- virtual environments fix the coexistence of several versions,
- containers fix the exact same version (and the coexistence).
But now the nightmare is to work with all these layers. Wait, Guix already provides all we need.
Guix allows to control with a fine grain the toolchain and this control is the masterpiece of «Reproducible Science». At in least in my opinion.
The two keys are the binary transparency which allows to track what should be wrong and the bootstrapping which is the seed ingredient of the former.
Then, the presentation quickly exposes how Guix works, firstly as an end-user for each scenario and secondly some plumbing presented in length elsewhere.
HPC meaning all scientific computing, not only cluster. :-)