2020-06-29

Better science through better software

Written by Arne Vollertsen

The CodeRefinery project promotes writing better code for computer-aided science

Computers are everywhere, and just like so many other parts of our lives, academia is becoming increasingly digitalized and computer-aided. Scientists rely heavily on computation, and studies show that more than half of all scientists across science domains write their own code, ranging from short scripts to highly advanced and complex protocols.

However, being a scientist does not necessarily mean that you are an accomplished programmer as well. In fact, many scientists are self-taught software developers, and they could benefit significantly from learning stringent and methodical approaches to coding. Not only does better software produce better science, it also enhances the reproducibility of research results, and makes it easier for others to reuse and refine the code you’ve written.

700+ workshop participants

Since 2016 NeIC’s CodeRefinery teaching program has helped hundreds of Nordic scientists refine their coding skills. As of May 2020, 700+ people have participated in the 3-day workshops on sustainable scientific software development that are the bread and butter of the CodeRefinery project. The workshops are teaching best practices and modern tools for reproducible and sustainable research code.

Meet Max Emil Schön, PhD student at Uppsala University identifying single cell organisms with gene sequencing techniques.

“I work in bioinformatics, combining biology and computer science, and my PhD is about evolutionary biology. For instance, we study how genomes of organisms change, when they adapt to new environments over evolutionary time scales. It involves a lot of sequence data analysis, and a lot of custom scripts, because much of what we do is not established yet.

I signed up to a CodeRefinery workshop, because in my masters program I had learned a lot about theory and biological applications, but not so much about how to write better code. I’ve learned that through CodeRefinery. It has helped me become better at modular code development, thinking about making the analysis pattern more usable, thinking about how to share code, where to host it etc.

CodeRefinery has changed the way I work with coding. The quality of my coding work has gone up, and it’s more fun! On top of that, becoming a better coder has improved my research as well. Since I can reuse things more easily now I get to look more at details than I could before.”

According to Max Emil Schön, writing code is not an integral part of the biology curriculum. However, nowadays all biological researchers need to write code to do their research. That means that many researchers use and write software without any formal training.

CodeRefinery is addressing that knowledge gap by teaching basic-to-advanced research computing skills, thus helping researchers become confident in using state-of-the-art tools and practices from modern collaborative software engineering.

Avoiding “the tar pit”

For instance, to write good and efficient code you need to know how to avoid “the tar pit” of programming: Over time software becomes more and more complex and harder to fix, with bugs appearing in unexpected places, and more time is spent on debugging than on developing. To evade the tar pit you have to think LEGO: Modular code development that allows you to build complex behaviour from simple building blocks.

Another cornerstone of good code is testing it as you go along, so CodeRefinery students are introduced to workflows including automated testing, to detect errors, and to make it easier for users to verify that they have installed the code correctly.

Like calibrating an instrument

Testing is all about making sure that your software yields correct results. Yet, many researchers are unaware of how software can and should be tested, and therefore do not test software the same way they would test a scientific instrument before taking measurements. While they would never trust a detector that hasn’t been tested and calibrated, their approach to computers is less stringent, although software is the most widely spread of all instruments used in modern science.

So, one of the pillars of the CodeRefinery teaching program is testing. Furthermore, version control is important. With version control the system records snapshots of a project and thus makes it possible to return to a working version of the code if anything goes wrong. Also, version control enables different people to work on the same code without interference. Importantly, it also contributes to the reproducibility of research results: Knowing which version of the code has been used for a specific scientific experiment makes it much easier to reproduce the experiment.

No re-inventing

Also, CodeRefinery offers researchers an online repository hosting platform to store their code for others to access, and teaches how such platforms can be used to effectively collaborate on writing software. Instead of having people writing code that has already been written by someone else, this service helps avoiding re-inventing software. In this way people can make it easier for themselves and easier for the research community as well.

Meet Dina Babushkina, philosopher and Post-Doc at the University of Helsinki Department of Social Sciences, Unit of Practical Philosophy, working with ethics and artificial intelligence.

“Coding is not something philosophers usually do. But in my opinion, if you really want to make a difference in the field of ethics of AI, you have to understand how the algorithms work. Or else you will find it hard to cooperate with developers and decision makers in that field. Right now I’m working in a project called “Towards Responsible Artificial Intelligence”. We look at the use of AI in society and in the medical sphere, for instance social care, surgery, therapy, diagnostics, and other areas, where AI could potentially harm people. We want to help finding ways to mitigate that risk by establishing general principles for the ethical use of AI. We also aim to determine what capacities should an autonomous AI system have when it operates in such social environments.

When you know how to code with the methods software engineers use, you get a completely different perspective on the type of problems they face. You learn to look behind all the hype surrounding AI, and you focus on what the algorithms are and what they can or cannot do.

Originally I taught myself coding, learning Python from scratch. I signed up for a CodeRefinery workshop to improve my coding skills and to get a better understanding of how collaborative work in data science actually happens. That understanding turned out to be very helpful in my work and especially in interdisciplinary projects where I co-develop code with researchers from other sciences. The CodeRefinery workshops are great for that because you meet data scientists from various fields. I even met a philosopher turned computer scientist! Generally speaking, the coding community is very open. There is a tendency to share knowledge and be open about the techniques people are using. It is a very creative discipline, and I find that inspiring.”

To understand the CodeRefinery teaching program in a broader context you need to look at what has been labelled “The Reproducibility Crisis in Science”. One of the cornerstones of research is reproducibility. For a scientific finding to be believable it has to be reproducible, meaning that an independent researcher should be able to replicate an experiment and obtain the same results under the same conditions as the original experiment.

A growing problem

However, a survey published 2016 in the leading scientific journal Nature showed that a majority of researchers consider irreproducible experiments a growing problem across all domains of science. One of the reasons for that is the increasing use of computation. For instance, the software used to generate the original results may be unavailable, it may be difficult to recreate libraries, versions and other parts of the software environment used to generate the results, and it may be difficult to rerun the exact computational steps that lead to the original results.

Meet Pradeep Eranti, PhD student and Early Stage Researcher with MLFPM ITN (Machine Learning Frontiers in Precision Medicine) at the University of Paris. His PhD project studies the integration of multi-omics data and disease-related phenotypes for better disease risk prediction.

“I am a Bioinformatician and I mainly write code for analysing biological data. Before moving to Paris, I used to work at Aalto University where my project involved analysing huge datasets. Sometimes, the analysis required heavy computations, which made my computer crash and made me lose code. To avoid that, I wanted to learn how to use Git, to put my code in the cloud. So, I signed up for a CodeRefinery workshop.

CodeRefinery has made me a better coder, definitely. I’m not a computer science student. I work in bioinformatics, so I’m not a perfect coder. But through CodeRefinery I now know about principles like developing code in the form of small modules, to reuse code, tracking changes using Git, and so on. Now I write cleaner and more efficient code, and not least, now my code can be reused by other researchers.

For me, the main advantage of participating in the CodeRefinery workshop came from it being a live event. You get to know everything. For instance, experienced UNIX users use all kinds of gimmicks and tricks, which is impossible to follow for a novice coder. But in the workshop they didn’t use the short cuts. They typed everything in front of us, so we could see what they were doing. That was tremendously helpful.

I liked the workshop so much I volunteered afterwards as a teacher. I think that was a nice way of giving something back.”

The CodeRefinery project is contributing to addressing this crisis in science reproducibility by promoting better and more transparent research code. And it is committed to spreading the word and having a large and long-term impact, even beyond its expiry date end of October 2021.

Thus it is collaborating with The Carpentries movement, a worldwide initiative to teach foundational coding and data science skills to researchers. While inspired by the teaching style of The Carpentries, the topics taught in the CodeRefinery workshops are a step more advanced, focusing on how to educate researchers who already write code or scripts. However, CodeRefinery uses similar lesson formats, so they can be easily adapted by the Carpentries community.

Furthermore, to anchor the initiative locally, CodeRefinery is partnering with Nordic research institutions to form training hubs with local instructors. Current hubs are placed at the Aalto University, Helsinki, at KTH in Stockholm, at NTNU, Trondheim, and at the University of Oslo.

Network for research software engineers

Also, CodeRefinery is promoting a network for people with one foot in research and one foot in software development. Nordic-RSE, the network of Nordic research software engineers, aims to bring together the growing number of people writing and contributing to research software from Nordic universities, research institutes, companies and other organizations. These people combine expertise in programming with an in-depth understanding of research, but often they work alone with not much connection to colleagues. Although their contribution is valuable, they are not following the traditional career path from PhD to postdoc to professor, and thus have no formal place in the academic system. To remedy this the Nordic-RSE is organising meetings and raising awareness for the scientific recognition of research software.

Because better code leads to better science, and it’s time to focus on the most popular scientific instrument there is – software.