Back from Software Heritage anniversary at UNESCO

Wow, I was impressed to be invited to Software Heritage annual community event. Without speaking about the place at UNESCO. Amazing to see La Chute d'Icare by Pablo Picasso. Or the view from the refectory. Enough for the tourist part. :-)

I am very grateful to Roberto Di Cosmo and the Software Heritage team. The day was so fruitful. Right after FOSDEM, it was intense. The morning was focused on “the-big-picture” talks and the afternoon on community discussion; the dinner was a great moment for socializing.

Among many very informative talks, I have been surprised by « DNA storage and the future for long term archival » and astonished by « Large scale compression of software source code », last, happy to discover Intact Digital–their mission is to mitigate risks arising from the obsolescence of computing technologies.

Again, I had the opportunity to discuss with Software Heritage (SWH) team. Always a delight! For the diner, I was at the same table as John Sullivan–ex FSF–and I appreciated the quick chat about some other sides of Free Software. Last, still while the diner, Sabrina Granger pointed me many references about the status of the error or other philosophy of science topics. Now, I have many homework.

Workshop about FAQ

In the afternoon, we had small group discussions about various topics. I chose the group who try “to debunk some misconceptions”–my words–about some frequently asked questions (FAQ). The session was led by maestro: Sabrina Granger and Pierre Poulain (colleague from Université Paris Cité).

The organization was very helpful for encouraging interaction: first, think for yourself about arguments, second share and compare with your direct neighbors, last report to the complete group (~10 people) and discuss the ideas. The very last, summarize with one hurray sentence the main idea. Sadly, we dealt with only 3 items. Here the complete list with my own arguments (step 1). Again, here is only my own and only my own personal opinions.

If my colleague would be saying one item, it would mean we do not share the same understanding about some topics as the scientific method or the activities of the scientific researchers. Therefore, to me, the best would be to first ask one question instead of an hurray claim. Somehow, convince peers is by fetching their understanding to our idea and not by pushing to them one point, in my humble opinion. Well, it probably depends on people.

That’s said, let’s go!

In my field/domain, we never look back on our source code.

Question: What is the aim to publish and/or use this source code if no one look at it?

The scientific method applies two key components: transparency of the arguments supporting a conclusion and sustainability of the evidences. Other said, the consensus is reached because we as a scientific community of a field/domain altogether draw the same conclusion. By design of the scientific method, the field/domain must look back on the arguments, conclusion and evidences.

In the scientific debate of such field/domain, what is the value of such source code? Is it part of the argument? Is it just an example? Is such source code involved in providing some evidences?

All in all, if the source code is part of the scientific debate, broad meaning, it must be archived. If the source code is not part of the scientific process, why is it produced and/or published by scientific researchers as an outcome of their activities?

We don't need to share our source code outside our team.

Question: Why do you need a source code for producing this scientific result?

The scientific method applies two key components: transparency of the arguments supporting a conclusion and sustainability of the evidences. Other said, the consensus is reached because we as a scientific community altogether draw the same conclusion. By design of the scientific method, all the arguments and evidences must be shared. Otherwise, the conclusion cannot hold outside the team.

In the scientific debate, what is the value of such source code? Is it part of the argument? Is it just an example? Is such source code involved in providing some evidences?

Sharing source code is fine, but only if the source code is well written. How does SWH deal with source code quality?

Question: What does it mean “well written”?

For some cases, “well written” source code means as obfuscated as possible. Even, please note some contest where best means the least human readable. The frame in scientific context is not about “well written” but instead about correctness.

If you feel that the code is not well written and you are hesitant to share it because the source code could contain flaws, then specifically the scientific method requires the sharing. Less you are confident, more you must share.

By sharing the source code, even poorly written based on your standard, you allow other peers to access and thus check the correctness or to reuse part or all. It is about the FAIR principles.

Software Heritage does not deal with source code quality because it is an archive. Similarly, Archives Nationales does not consider if the handwritten paperwork is well written but just archive it. Being an archive raises other challenges (collect, classify/sort, preserve, find back).

I'd rather save/find the executable version of a software.

Question: Can an executable run totally alone?

The preservation of an executable version is a very short term solution.

First, it is specific to an hardware and without such hardware, this executable is totally useless. For instance, what would you do if you have at hand only the executable version of the Apollo mission? Probably nothing. However, having the source code version can be helpful for exploring the algorithms and numerical methods. And being motivated, it would still also be possible to write compiler targeting modern hardware.

Second, the executable version does not tell anything about how such executable had been produced. It is easy to find examples using the exact same source code but producing two different results depending on compiler’s options.

The scientific method applies two key components: transparency of the arguments supporting a conclusion and sustainability of the evidences. Other said, the consensus is reached because we as a scientific community altogether draw the same conclusion. By design of the scientific method, all the arguments and evidences must be sustainably transparent for the whole community.

Therefore, save/find an executable version of a software is against the scientific method. The scientific method requires a source-binary transparency.

The mission of Software Heritage is to archive the source code and it is already a lot. It is technically not affordable to archive the executable. Instead, we must rely on tools able to redeploy the executable. The software Guix is helping here.

Why should I use a `SWHID`? DOIs are just fine.

Question: Do you know the difference between SWHID and DOI?

It is about intrinsic vs extrinsic identifier. Extrinsic identifiers rely on external register to keep the correspondence between the object itself and its identifying reference. Consider the well-known passport: it identifies one person but it depends on the government which issued it. There is no connection between the person and the passport number, only via a database registering this correspondence.

Intrinsic identifiers rely only on the object itself for the identification. Again, the well-known passport: biometric data. Fingerprint is unique to the person. There is no external authority.

The identifiers SWHID is intrinsic and DOI is extrinsic. It means that DOIs require a resolver (register) that maps the identifier to the content. And having the content only, there is no mean to know its DOI. These DOIs work very well for some specific content such as the one in the libraries. However, they scale poorly when dealing with source code.

SWHID provides a fingerprint of the content: having at hand the content and the SWHID, it is possible to check bit-to-bit if they correspond to what we expect. SWHID cuts an indirection of trust.

My source code doesn't deserve to be archived. It's just a tiny script.

Question: What is the aim this tiny script?

See the three entries:

In my field/domain, we never look back on our source code
We don't need to share our source code outside our team
Sharing source code is fine, but only if the source code is well written. How does SWH deal with source code quality?

It's already on Zenodo.

Question: Can you compare the pre-publication version and the published one?

First, Zenodo is a general-purpose open repository. It means the resource cannot be optimized as for the specific-purpose warehouse. An example: when dealing only with source code and history of source code, it becomes possible to design adapted data structures that are efficient for an archive of source code, i.e., burn less physical resource.

Second, Zenodo collects heterogeneous content and assign them DOIs (extrinsic identifier). Who do we trust? CERN which supports the platform? The authors themselves who push content there? What does it happen if the authors erase a previous but published version?

Computing a `SWHID` may be nice. But many people are not able to do it. Therefore, why does SWH provide this type of PID?

Question: What is the aim of a SWHID?

Let evacuate the flaw in the argument: Cutting trees with an axe may be nice, but many people are not able to do it, therefore, why does X provide this type of tool? Maybe because an axe is the right tool for cutting trees. :-)

A Persistent Identifier (PID) is a long-lasting reference to a document, file, web page, or other object. Yes, we could ask the well-known xkcd about standards: situation, there are 14 competing standards; wait we need to develop one universal standard that covers everyone’s use cases; soon, situation, there are 15 competing standards. For instance, before the creation of SWHID, Nix developed the NAR format (Normalized ARchive) then used by Guix.

Here, the key is to rely on intrinsic identifier for PID. If SWHID become the standard, then it would still be possible to compute bridges between the intrinsic identifiers as NAR.

When running Git as version control system or using a forge for collaborating as Gitlab, you are somehow already computing SWHID.

Being able to do or not do is more often rooted in collective practises and thus available tools than the concrete ability to specifically do it.

SWH deals only with dead source code. It's an archive.

Question: Is it an issue?

Yes, Software Heritage is an archive and not a forge. And?

Why archiving the development history if commits are poorly written?

Question: What does it mean “poorly written”?

Once archived, the development history becomes immutable. Therefore, it could help in scientific controversy or point out who were the first.

Moreover, consider that one version was producing some results. Then, time is passing and the source code has many improvements but the new version does not produce the same results. Knowing the development history allow you to track the offending change. Once knowing that, it helps in evaluating the impact: how many publications are reusing this incorrect code?

The scientific method applies two key components: transparency of the arguments supporting a conclusion and sustainability of the evidences. Other said, the consensus is reached because we as a scientific community altogether draw the same conclusion. By design of the scientific method, controlling the sources of variation as the development history helps in both side: transparency and sustainability.

Producing scientific results is not frozen in time but it is a continuous activity. This development history reflects that, similarly as good ol’ lab notebook.

Nobody correctly cites software in academic papers. Why should I take care of the metadata of my software if people only refer to it via a link to my repo?

Question: How do you bootstrap the correct practise if not by doing correctly yourself?

Citing software is a complex topics and, to my knowledge, there is no consensus about what would be the correct way. Running a software barely means that this software is totally alone, and instead, this software belongs to a complex chain of dependencies implying many other software. For instance, running the programming language R means that you need under the hood: a shell as Bash and a C compiler since R is mainly written in C programming language. Therefore, what does you cite? Only the label version of the leaf, here R? What about the options for compiling this R? Or about the dependencies of R, e.g., linear algebra?

It is a twofold issue: we audit the source code but we run a binary. How do we know the correspondence between the both? It is about transparency of the supply chain producing the binaries.

Metadata will not fix the issue of citing software because, to my knowledge, they do not capture the supply chain. However, they can be part of the solution. And the most promising for citing software is Guix which, by design, control exactly the supply chain.

That’s all for now.

Join the fun, join Guix for scientific research! Join SWH!