Feral Developer: 2013

Saturday, 28 December 2013

@BeforeClass and Parametrized JUnit tests

I have recently become aware that @BeforeClass does not work as expected in Parametrized JUnit tests. Namely, it is not executed before any of the tests have been instantiated. Quick search on SO made me aware that others took notice of the issue, too:

http://stackoverflow.com/questions/11430859/parameters-method-is-executed-before-beforeclass-method

http://stackoverflow.com/questions/11163890/with-junit-4-can-i-parameterize-beforeclass.

What surprises me is that the solutions proposed on SO seem to miss the most obvious workaround - embedding the @BeforeClass in the @Parameters method. The latter is static and is executed only once - before any of the tests.

Here is an example.

I needed a JUnit test that validates all XML files in a particular directory against a schema stored in a particular XSD file. It would be best if the schema is instantiated once - and re-used for all of the individual tests. I tried to encapsulate the schema instantiation in the doSetup() method which I annotated as @BeforeClass. Unfortunately, I got NullPointerException in each of the tests as the @BeforeClass method was, apparently, not called and the schema was therefore not instantiated. Calling the doSetup() method with the @Parameters method data() did the job:

Sunday, 20 October 2013

My bioinformatics interests and plans

I intended to keep this blog focused on technical topics but in this post I'll extend the scope a bit and will write about my current research interests in bioinformatics and how they relate to software engineering. They are not necessarily the ones I spend most of my working time as my first priority is to provide computational and statistical support to the ongoing research of other scientists. But they are closely related and are the ones that keep me busy when I find some time free of other commitments. There are two topics that I am particularly interested – robustness and reproducibility of bioinformatics analysis.

Robustness

Bioinformatics has experienced dramatic growth over the last 15 years. The rapidly evolving experimental data, the adoption of new technologies drives the rapid evolution of the computational methods. The fast speed of development on many occasions comes at the expense of software engineering rigour. As result, bugs creep in, code that used to work a day ago no longer does. The results of bioinformatics analysis can change significantly from one version of a tool to the next. I have experienced this myself on a number of occasions while using third-party tools to analyze data from genomic experiments. The consequence is increased uncertainty about the validity of results.

I believe that this situation can be significantly improved if we borrow from the field of software engineering tools and techniques which have been developed over the last decade in order to maintain the quality and robustness of the software code over its entire life cycle (e.g. unit tests [1][2], static code analysis [3][4], continuous integration[5][6]). A fortunate circumstance is that over the same period public repositories of biological data (e.g. at EBI and NCBI) accumulated a vast amount of experimental data. This opens up exciting opportunities. As an example, data from a large number of independently produced studies can be re-used to populate unit tests for many of the popular bioinformatics tools. Executing such unit tests either ad-hoc or as a part of an automatic routine would help us identify situations where different versions of the same tool produce discrepant results over large number of studies. Such automatic meta-analysis would increase the reliability of the biological studies and would allow us to focus on studies and results where re-interpretation may be necessary.

Reproducibility

Bioinformatics workflows are becoming increasingly sophisticated and varied. They tend to apply in succession a growing number of tools with each tool requiring multiple input paramaters each of which could modify its behaviour. A tool over the course of its lifetime may evolve into having multiple versions each leading to some variation in its output. Each tool may also require reference data (e.g. genomic or proteomic sequences) which itself evolves to have multiple releases and versions. Thus, in order to reproduce a certain bioinformatics analysis one needs to capture all the associated metadata (tool versions and parameters, reference versions) which is not always provided in the corresponding publications. The field (and EBI in particular) has worked on establishing standards requiring minimal metadata for reporting a number of experiments (MIAME [7], MIAPE [8]) but we need to go further and cover entire bioinformatics workflows. What is needed, in short, is a publicly accessible system for executable workflows which would capture all the relevant metadata and allow straightforward replication of the bioinformatics analysis.

There has been extensive work on systems for composing re-usable workflows and capturing the associated metadata (e.g. Taverna [9], Galaxy [10]) but technical limitations in the corresponding implementations have so far restricted their adoption. In particular, it is non-trivial to set up such a system on a local infrastructure. Furthermore, in the case of NGS the large data transfers required limit the usability of the most visible public service which supports creating and storing of such workflows (the public Galaxy service [11]). Thus, a system is needed that allows both straight-forward local deployment and public sharing and execution. It shall make it possible to attach a persistent DOI [12] to a bioinformatics workflow and refer to it in a publication so that other scientists are able to reproduce the results using either locally or a globally accessible computational resource. Fortunately, recent advances in system virtualisation [13] and in utility computing [14] make both goals feasible.

Plans

My immediate plans (when I have more time) would be to focus on two ideas:

(1) Use data at the EBI and NCBI (e.g. ENA/SRA, EGA, ArrayExpress) to produce test suites for automatic meta-analysis. I am particularly interested in functional genomics and epigenetics studies involving NGS data (e.g. RNA-Seq, Chip-seq) but the tools which would be developed are likely to be useful also for other types of studies.

(2) Develop in collaboration with other research institutions in the EU and elsewhere a platform for composing and sharing executable workflows which builds upon the latest advances of utility computing and improves upon existing projects such as Taverna and Galaxy. The focus of this system would be, again, on NGS data but the tools would likely be useful for other types of biomedical data.

Both ideas are, actually, synergistic. The automated meta-analysis would require building of a re-usable testware system which would exhibit the main features required by the platform for re-usable and share-able workflows. I imagine that both projects would build upon the same core software module.

[1] Cheon, Yoonsik, and Gary T. Leavens. "A simple and practical approach to unit testing: The JML and JUnit way." ECOOP 2002—Object-Oriented Programming. Springer Berlin Heidelberg, 2006. 231-255.

[2] Meszaros, Gerard. xUnit test patterns: Refactoring test code. Pearson Education, 2007.

[3] Louridas, Panagiotis. "Static code analysis." Software, IEEE 23.4 (2006): 58-61.

[4] Bessey, Al, et al. "A few billion lines of code later: using static analysis to find bugs in the real world." Communications of the ACM 53.2 (2010): 66-75.

[5] Beck, Kent, and Cynthia Andres. Extreme programming explained: embrace change. Addison-Wesley Professional, 2004.

[6] Duvall, Paul M., Steve Matyas, and Andrew Glover. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007.

[7] Brazma, Alvis, et al. "Minimum information about a microarray experiment (MIAME)—toward standards for microarray data." Nature genetics 29.4 (2001): 365-371.

[8] Taylor, Chris F., et al. "The minimum information about a proteomics experiment (MIAPE)." Nature biotechnology 25.8 (2007): 887-893.

[9] Oinn, Tom, et al. "Taverna: a tool for the composition and enactment of bioinformatics workflows." Bioinformatics 20.17 (2004): 3045-3054.

[10] Goecks, Jeremy, et al. "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." Genome Biol 11.8 (2010): R86.

[11] https://usegalaxy.org

[12] Langston, Marc; Tyler, James (2004). "Linking to journal articles in an online teaching environment: The persistent link, DOI, and OpenURL". The Internet and Higher Education 7 (1): 51–58.

[13] Adams, Keith, and Ole Agesen. "A comparison of software and hardware techniques for x86 virtualization." ACM SIGOPS Operating Systems Review. Vol. 40. No. 5. ACM, 2006.

[14] Armbrust, Michael, et al. "A view of cloud computing." Communications of the ACM 53.4 (2010): 50-58.

Thursday, 26 September 2013

Generating a XSD schema from a XML file using the free community version of IntelliJ

I have been working with XML files produced by a third-party tool which had almost no documentation. I wanted to find ways to automatically parse them in Java and it seemed that this involves getting or re-creating the XSD schema.There is a thread on Stackoverflow where multiple possible solutions are suggested. Unfortunately, the thread is closed for further contributions so I am going to describe my solution here.

I stumbled almost by chance on the XML toolset in IntelliJ which includes such functionality. My solution involved the following simple steps:

1. Open a (hopefully representative) XML file in IntelliJ

I used the free community version but I presume that the same functionality is included in the commercial version, too.

2. Go to "Tools"/"XML actions"/"Generate XSD schema from XML file"

I used the default values for all parameters apart from "Detect enumerations limit" which I changed from 10 to 2:

3. (optional) Verify the resulting XSD schema

The resulting XSD schema seems OK but optional elements may cause issues. If the XML file you have opened is missing them than the XSD schema will miss them, too. A better solution would be to create the XSD schema based on a set of files (I have dozens) rather than just on a single file. Unfortunately, I know of no tool that does that. Instead, I decided to verify the schema using the following method (based on Grzegorz Szpetkowski's excellent post on Stackoverflow):

4. (optional) Add optional elements to the XSD schema and adjust the data types

It turned out that a few of the other XML files did, indeed, contain optional elements that were not part of the XSD schema (i.e. they failed the xml validation) so I added them manually. Another manual change was setting the data type for the elements. I tried initially running the IntelliJ wizzards with automatic data type detection enabled ("Detect simple content type" set to "smart") but it did not work well. So I used in step 2 the "dumb" option of setting all data types to "string". Once I got the XSD schema I than manually set the data types.

Conclusion

The above is a relatively simple recipe - it takes about 20 minutes. Still, I am somewhat disappointed that currently no tool seems to be able to generate a XSD schema based on multiple files - such functionality would cut the manual effort by half and would make the recipe more robust.

Tuesday, 24 September 2013

Storing JPQL queries separately from the code of the JPA2 entities

I was recently interested about visually creating JPA2 entities in a way similar to creating database tables from an ER diagram. Turns out, this is not so well supported by the tools I tried but I found a workaround which involved exporting all the tables from a database. The solution had, however, the unintended side effect of overwriting those of the existing JPA2 entities which had a match to a table and this removed, among others, the named queries which I have already written. This happened because I have defined the queries in the same files in which I have stored the code of the JPA2 entities. This is how it is mostly done and I have simply followed the mainstream. I remember finding this solution rather inelegant as the queries tend to involve more than one JPA2 entity - so which one you assign it to?

Turns out, I was not the only one asking this question and there is a very nice solution:

Storing the query text in a separate XML mapping file(s)

I base my solution with minor changes on Arjan's excellent post. There are, essentially, two steps:

1. Define a separate XML mapping file(s) in persistence.xml

The relevant line is

<mapping-file>META-INF/jpql_queries.xml</mapping-file>

It refers to the file containing the named queries. It is placed in the META-INF folder, which is where persistence.xml is located, too. I thought initially that since both persistence.xml and the query file (jpql_queries.xml) are in the same folder I can use

<mapping-file>jpql_queries.xml</mapping-file>

but that's wrong - than the query file is not retrieved during the build time.

2. Storing the named queries in the respective file

The syntax of the queries can be the same as when the named queries are stored in the file containing the JPA2 entity. In my case, I decided to use the advice included in Mkyong's excellent post, namely, to wrap the query text with CDATA, so that the XML parser will not prompt error for some special XML characters like ‘>’ , <’. For completeness sake I am including also the query file:

Monday, 23 September 2013

Generating JPA2 entities from database tables

Problem: How to design JPA2 entities in a visual way

I have been looking for a tool which allows defining JPA2 entities visually in a way similar to the various tools which generate SQL DDL script from a diagram of an entity-relationship model. I've tried various free tools including the Diagram Editor which is part of the Eclipse Dali project but all were rather flaky. So I decided to go a slightly indirect way.

Solution: Design entities in a ERD tool, export them to the Database and import them from the Database as JPA2 entities

Pre-requisites: MySQL Workbench, MySQL database, Eclipse (this solution was tested using Eclipse Java EE IDE for Web Developers, Juno Service Release 2), Eclipse Database connection, Eclipse JPA facet, Hibernate

Step 1. Creating the database tables

I created an ERD diagram, exported the DDL script and created the tables all within MySQL Workbench. The details (as well as the pre-requisites such as setting up a MySQL database) are beyond the scope of this post - I would just mention that no particular tweaking or coding is required. Also, it shall be possible to use any other JPA2 compliant database (e.g. Oracle).

Step 2. Importing the JPA2 entities

Pre-requisites for this step are:

Establishing Database connection in Eclipse
Creating a Java project and enabling the JPA facet for it in Eclipse (alternatively, a JPA project can be created from scratch). Since I chose to use Hibernate, rather than EclipseLink which is the default JPA implementation in Eclipse, I first had to set it up. The Eclipse Hibernate console needs to be set up, too.

I imported the tables as JPA2 entities using the functionality included in the JPA facet of Eclipse:

2.1. Click on the project and got to "JPA Tools"/"Generate Entities from Tables":

2.2. Set up the name of the output package (in my case "iw.pdfEx.persistence"):

Clicking on "Finish" starts the import from the database - this may take some time depending on the connection.

2.3. Remove the catalog entries from the generated Java code

The previous step generates a Java class (JPA2 entity) for each database table in the chosen package. It also automatically adds all the classes to persistence.xml.

It all sound well but in my case for each JPA entity Y I was getting an error "catalog X cannot be resolved for table Y" where X is the name of the database schema. I have no idea why as both the database connection and the persistence unit seem to be set up correctly. I found a quick workaround - I manually removed the catalog entry from each entity. The error disappeared - voila!

Some more troubleshooting

There is one more gotcha, however. Eclipse overwrites all JPA entities which already exist (I could not find a way of importing only some of the tables). In my case I had already two JPA2 entities. I had some simple logic added to some of the getter/setter methods - I had to recover it from the older git versions of the corresponding entities. Not nice. Also, all the named queries which I have defined were gone. Again, it was relatively easy to re-create them from the older git version. This prompted me to look for ways to define the JPQL queries separately from the code of the JPA2 entities. Fortunately, I found an easy way to do it - it is the subject of a separate post.

Conclusion

All in all, the above is a relatively easy and quick (generating the Java code from the ER diagram takes a few minutes) but somewhat dirty way of visually designing JPA2 entities. For more than 15 years it has been possible to visually design database models so I was really surprised that such a basic functionality is not readily provided by any of the tools I tried. Hopefully, Eclipse and the other JPA tools will catch up soon.

Monday, 1 July 2013

How much code documentation is sufficient? Can we unit-test the documentation?

I stumbled today upon a blog post which states that one page of external documentation (i.e. not counting in-code comments) is sufficient. I can see the appeal of such an idea but I find it somewhat simplistic. I had recently a project at the end of which we ended up writing +30 A4 pages documenting covering 10 man-months of coding. There were some graphs and diagrams so the text was probably only slightly more than 20 pages or, roughly, about a page for each 2 man-weeks of coding.

Is this too much or too little?

It depends. For most of the developers who wrote it this was a rather tedious task. But these happened to be internship students and they were going to leave our team. The quality of the code was not stellar as can be expected from programmers with little experience but, more importantly, there was discontinuity - anyone working on the codebase would not be able to ask the authors directly. Thus, in my opinion, the volume of the documentation - although it took several man-days to produce - was not really excessive.

But the question regarding the quality and quantity of the documentation is a rather deep one and cannot be easily answered with a simplistic metric like number of pages. The problem is that currently we have no objective metrics regarding the completeness of the documentation (i.e. "code coverage"), nor regarding its quality (i.e. how easy is it to understand and whether it correctly describes what the code does). These are rather involved issues and I don't expect the community to come up anytime soon with a widely agreed metric (e.g. number of words per line of code) that would resolve these issues in an objective manner (*) and that can be verified automatically similarly to the way static code analysis is done by various tools such as those included in SonarCube.

Why not test the documentation?

These days we assume that we shall test all or most of the code we deliver. Why not do the same for the documentation? I suggest that we combine the concept of testing with another familiar concept - that of the peer programming. What I mean is that we could go for peer-documenting. For example, a programmer A writes the documentation for the code written by programmer B. Than the documentation produced by B is tested by giving it to another team member C. C would have to assess the completeness and clarity of the documentation. If he/she needs to ask any questions than the documentation is not sufficient in quantity or quality (i.e. does not pass the test) so B would have to amend it till C is happy with it.

I reckon there will be some resistance as developers find writing documentation boring. But I remember that a while ago not everyone was happy with the idea that we shall unit-test most of the code. Nowadays unit-testing is considered a good practice and most programmers do it. I think that documentation testing is likely to prove its worth, too. I also think that the combination of testing and peer-documenting would inject even more rigour in the process.

===========

(*) The problem is somewhat related to the completeness or the clarity of a mathematical proof. At the university some professors were thoroughly writing down every step of the derivation of a proof while some did just proof sketches leaving everything else "as an exercise to the reader". Needless to say, some of these exercises were anything but trivial and I found such hand-waving occasionally quite frustrating.

Thursday, 2 May 2013

Including (or referring to) the git revision in your maven build

Problem

I am writing customized test ware for one of my personal projects (more on it in another post). The test ware does a few things and at the end stores the test results in a relational database. I wanted to be able to link each database record to the specific git commit from which the code under test was produced. I have used a similar technique with svn in the past but did not know how to do this with git. Also I wanted to automate it using my CI pipeline (maven, Jenkins, Sonar).

Solution

After a bit of googling I found Mattias Severson's excellent post. He explains how to inject the git SHA-1 hash sum into the maven ${buildNumber} variable. Then he lists a few ways in which this variable can be propagated in the build (e.g. manifest entry, property file, static method call, etc.). I am not going to repeat any of these implementations here - you can check out the "Advertise the Build Number" section of his post. Just would like to add to his list another implementation which I decided to use in this particular case:

Injecting the git revision in a VM parameter for the maven surefire plugin

The Junit tests are run by the maven surefire plugin. It spawns its own java VM and you can pass a custom parameter to the VM of the form -Dparam.name=param.value. The "trick" is to link the ${buildNumber} variable to the VM parameter. This can be done in the pom.xml in this way:

The VM parameter than can be read from the JUnit tests at execution time:

System.getProperty("git.version")

That's it. For completeness' sake I am listing below my entire pom.xml: