The story of single-cell trajectories begins in 2014, with two seminal papers by the labs of Cole Trapnell (Monocle ) and Dana Pe'er (Wanderlust ). With this, the field of pseudotemporal ordering or trajectory inference (TI), quickly took off: in 2016 we reviewed 10 TI methods , but today there are already more than 70 methods .
This may be surprising at first: why are so many methods necessary for solving the same problem? Finding dynamics in a multidimensional -omics dataset is not an easy problem, hence the need to optimise several factors such as the dimensionality reduction, time complexity and final ordering step, which may all be different depending on the dataset. Indeed, when we tested these methods on some datasets, we often found contradictory results between the different methods. Based on some earlier experience with benchmarking  and the development of a TI method , we therefore set out to develop a large-scale benchmarking effort.
Figure 1. A 3D illustration of the two main figures of the paper, summarising the main results regarding accuracy, scalability, stability and usability
When we talked with users in our institution and at conferences, we found that there were four recurring concerns: accuracy, scalability, stability and usability.
Foremost is how well the trajectory represents the underlying cellular dynamics, which we tested using four of our accuracy metrics. It took us quite some time to settle on these metrics; between our preprint and our final revision we even replaced three out of four.
A second common concern was the scalability of methods. With the cost of profiling individual cells decreasing swiftly during the last years, it's not surprising for methods to "lag behind" with respect to scalability. Nonetheless, some methods really did well on this criterion, and we're excited to highlight those in our benchmark. Nothing makes the analysis of single-cell data so frustrating as having to wait hours before a method is finished, only to find out that there was something wrong with your input data and having to wait again, and again, and again.
Third, we also analysed the stability of a method, i.e. how similar the trajectories are if you run a method on the same (or very similar) input data. While reviewing source code, we found that quite a few of methods fake stability by forcing the pseudorandom number generator to the same state in every execution. While it is a good practise to set a seed at the start of an analysis, developers should beware that setting a seed only makes a method deterministic (generates the same output with the same input data) but not necessarily stable (small differences in input might thus result in large differences in the output). In order to assess the stability of each method, we ran it multiple times on variations of the same dataset (by inserting small perturbations), and computed the similarity between the different predictions.
Last but not least, we analysed the usability of each method. While reviewing source code and manuscripts, we often ran into similar issues, such as seed-setting, limited documentation, bad coding practices, or simply getting the software to run. This led us to create a 'quality control' checklist of good software and science development practices, in order to assess the usability of the method. The usefulness of parts of the usability score are certainly debatable, the usability of a tool is often in the eye of the beholder. We really hope that these kind of usability scores will encourage better software practices within bioinformatics. Our small experiment supports this idea, because when we asked authors for feedback regarding our analysis, many updated their method to include better documentation and automated testing of the code.
Doing large-scale benchmarking is not only a question of science, but also of technology. To make such a study maintainable in the long run, you are almost forced to adhere to good software standards, and make use of all the tools that modern software development has to offer. Already from the very beginning of this software project, we heavily made use of unit testing, version control and continuous integration in order to be able to ascertain that each part of the pipeline is functioning correctly. This allowed us to collaborate better, but also acted as an early bug warning system – you do not want to wrapping up your results only to find a crucial bug, forcing to rerun all of your experiments. Adopting these practices can have a considerable learning curve, especially for people coming from a different scientific field, but they pay off in the long run. If you find yourself at the helm of such a project, do not be overwhelmed by any of these terms; look up a few tutorials from experts in the field, and browse through source code from software packages you frequently use, you learn a lot by simply observing how experts tackle various issues.
This effort is not immediately obvious when reading the paper. But to let it not go to waste, we invested heavily on making the underlying tools easily accessible by others as a set of R packages. Check them out at dynverse.org!