Sunday, 20 October 2013

My bioinformatics interests and plans

I intended to keep this blog focused on technical topics but in this post I'll extend the scope a bit and will write about my current research  interests in bioinformatics and how they relate to software engineering. They are not necessarily the ones I spend most of my working time as my first priority is to provide computational and statistical support to the ongoing research of other scientists. But they are closely related and are the ones that keep me busy when I find some time free of other commitments. There are two topics that I am particularly interested – robustness and reproducibility of bioinformatics analysis.

Robustness

Bioinformatics has experienced dramatic growth over the last 15 years. The rapidly evolving experimental data, the adoption of new technologies drives the rapid evolution of the computational methods. The fast speed of development on many occasions comes at the expense of software engineering rigour. As result, bugs creep in, code that used to work a day ago no longer does. The results of bioinformatics analysis can change significantly from one version of a tool to the next. I have experienced this myself on a number of occasions while using third-party tools to analyze data from genomic experiments. The consequence is increased uncertainty about the validity of results.

I believe that this situation can be significantly improved if we borrow from the field of software engineering tools and techniques which have been developed over the last decade in order to maintain the quality and robustness of the software code over its entire life cycle (e.g. unit tests [1][2], static code analysis [3][4], continuous integration[5][6]). A fortunate circumstance is that over the same period public repositories of biological data (e.g. at EBI and NCBI) accumulated a vast amount of experimental data. This opens up exciting opportunities. As an example, data from a large number of independently produced studies can be re-used to populate unit tests for many of the popular bioinformatics tools. Executing such unit tests either ad-hoc or as a part of an automatic routine would help us identify situations where different versions of the same tool produce discrepant results over large number of studies. Such automatic meta-analysis would increase the reliability of the biological studies and would allow us to focus on studies and results where re-interpretation may be necessary.

Reproducibility

Bioinformatics workflows are becoming increasingly sophisticated and varied. They tend to apply in succession a growing number of tools with each tool requiring multiple input paramaters each of which could modify its behaviour. A tool over the course of its lifetime may evolve into having multiple versions each leading to some variation in its output. Each tool may also require reference data (e.g. genomic or proteomic sequences) which itself evolves to have multiple releases and versions. Thus, in order to reproduce a certain bioinformatics analysis one needs to capture all the associated metadata (tool versions and parameters, reference versions) which is not always provided in the corresponding publications. The field (and EBI in particular) has worked on establishing standards requiring minimal metadata for reporting a number of experiments (MIAME [7], MIAPE [8]) but we need to go further and cover entire bioinformatics workflows. What is needed, in short, is a publicly accessible system for executable workflows which would capture all the relevant metadata and allow straightforward replication of the bioinformatics analysis.

There has been extensive work on systems for composing re-usable workflows and capturing the associated metadata (e.g. Taverna [9], Galaxy [10]) but technical limitations in the corresponding implementations have so far restricted their adoption. In particular, it is non-trivial to set up such a system on a local infrastructure. Furthermore, in the case of NGS the large data transfers required limit the usability of the most visible public service which supports creating and storing of such workflows (the public Galaxy service [11]). Thus, a system is needed that allows both straight-forward local deployment and public sharing and execution. It shall make it possible to attach a persistent DOI [12] to a bioinformatics workflow and refer to it in a publication so that other scientists are able to reproduce the results using either locally or a globally accessible computational resource. Fortunately, recent advances in system virtualisation [13] and in utility computing [14] make both goals feasible.

Plans

My immediate plans (when I have more time) would be to focus on two ideas:

(1) Use data at the EBI and NCBI (e.g. ENA/SRA, EGA, ArrayExpress) to produce test suites for automatic meta-analysis. I am particularly interested in functional genomics and epigenetics studies involving NGS data (e.g. RNA-Seq, Chip-seq) but the tools which would be developed are likely to be useful also for other types of studies.

(2) Develop in collaboration with other research institutions in the EU and elsewhere a platform for composing and sharing executable workflows which builds upon the latest advances of utility computing and improves upon existing projects such as Taverna and Galaxy. The focus of this system would be, again, on NGS data but the tools would likely be useful for other types of biomedical data.

Both ideas are, actually, synergistic. The automated meta-analysis would require building of a re-usable testware system which would exhibit the main features required by the platform for re-usable and share-able workflows. I imagine that both projects would build upon the same core software module.


[14] Armbrust, Michael, et al. "A view of cloud computing." Communications of the ACM 53.4 (2010): 50-58.