Friday 24 July 2015

Moving to Jekyll & Github

I'm in the process of moving this blog to a new site. I felt somewhat constrained by the layout options at Blogspot and decided to move to Jekyll and host my blog on Github. This blog will stay online but won't be updated anymore - just click on the link above and check out the new version of this blog!

Tuesday 13 January 2015

Handling and Deploying credentials in Rails4

Every web application uses credentials of some sorts - e.g. to access a database or third-party services such as Amazon Web Services, email processing applications such as Mandrill, MailChimp or SendGrid, etc. The consensus is that it is a bad practice to check these in code repositories. All workflows involve a variation of storing these credentials in the local file system (either as configuration files or in files used to set up environment variables). In Rails there is an ecosystem of solutions created to make this process as smooth as possible. Just to mention a few:

- the dotenv gem
- the figaro gem
- the rbenv-vars plugin

I looked at each but of these and at a number of other ad-hoc workflows based on, more or less, the same ideas.

They would solve the problem but I found each to be somewhat inelegant. I was looking for a solution that makes both reading in the credentials and their deployment as smooth as possible. So I came up with the following workflow:

1. create config/secrets.yml

Since 4.1 secrets.yml is supposed to be the "official" container of sensitive data.

2. Put it in .gitignore

Surprisingly, this is not done yet by default when you start with a scaffold.

3. Use econfig to read in the credentials in the application

The econfig link above explains the changes that need to be made in the code base - and they are really minimal.

In essence, econfig reads in by default a number of files that might contain credentials - including config/secrets.yml. It makes it possible to refer to these in the application as
MyApp.config.credential
4. Use capistrano-secrets-yml to deploy config/secrets.yml in production

This workflows seems really very minimal - the two gems complement each other nicely and make the process smooth - I tested it and I am happy with it.

Monday 29 December 2014

"cap production deploy" can erroneously deploy in development

After running "cap production deploy" I had a mysterious case of "Action Controller Exception: Mysql2::Error Can't connect to MySQL server on '127.0.0.1' (111)". The settings in database.yml were correct. I could verify that the database was, indeed, accessible by using these settings in MySQL Workbench to ssh tunnel to the database on the VPS from my laptop. So what was going on?

It turned out, that the application was deployed in development, not production despite the capistrano command and the settings in deploy/production.rb. And in database.yml I had incorrect settings in the development block - they were supposed to work only on my laptop but not at the VPS.

It was a great relief to find the explanation of that mysterious error. But how could be that "cap production deploy" produced a deployment in a development environment, rather than in production? I don't have 100% certainty as I changed a number of things while trying to find the culprit but the most likely reason was the following line in /etc/nginx/nginx.conf:


server {
    rack_env           development;

Once I changed that setting and re-started nginx the problem disappeared. I am relieved but I still find the whole thing a bit troubling. Somehow I cannot escape the feeling that the rails stack is really messy and fragile - you can specify the same setting (in this case the environment) in a number of different places (in fact in different layers of the stack) so that you get inconsistency and unexpected behaviour. God knows what other settings are now over-ridden but haven't yet manifested themselves in an error. This is a bit frightening.

Anyway - you have to live with it if you want to use rails. For those who are suspecting a similar problem in their deployment:

The best way to tell which environment your application is running in is to look at the logs (/home/deploy_user/my_app/shared/log). In my case the file development.log had a current timestamp while production.log was stale (a few weeks old). That sort of nailed it. However, the situation could be a bit more complicated if you are running different instances of the application at different ports - they could run in different environments. Than you need to look up the logs and find which one has the error message that you are getting via the browser.

Saturday 20 September 2014

Sorting out 403 and "directory index of * is forbidden" in a nginx deployment of a rails application

I deployed a rails 4 application on Digital Ocean Ubuntu instance using Capistrano, Nginx, Passenger by following this recipe.

I set up the server block to listen on a non-standard port:


server {

        listen                        3001;
        server_name            178.62.17.94;
        root                          /home/deploy/appmate/current/public;
        passenger_enabled  on;
        passenger_ruby       /home/deploy/.rvm/gems/ruby-2.1.2/wrappers/ruby;
        rails_env production;

       }


This worked okay.

However, when I tried to set the same application on port 80 I stumbled upon HTTP 403.

The server block for port 80 was set up automatically and I just added/changed the root and the passenger/rails lines. I left the other original content there:

server {
        listen 80 default_server;
        listen [::]:80 default_server ipv6only=on;

        passenger_enabled on;
        passenger_ruby     /home/deploy/.rvm/gems/ruby-2.1.2/wrappers/ruby;
        rails_env production;

        root   /home/deploy/appmate/current/public;
        ##root /usr/share/nginx/html;
        ##index index.html index.htm;

        # Make site accessible from http://localhost/
        server_name localhost;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
               try_files $uri $uri/ =404;
                # Uncomment to enable naxsi on this location
                # include /etc/nginx/naxsi.rules
        }

}

The error message I found in

/var/log/nginx/error.log

was the same I found earler: "directory index of "/home/deploy/appmate/current/public/" is forbidden". Back than I resolved the issue by changing the privileges of the parent directories. So there was no reason I had to do that again!

With a bit of googling I found the culprit in the location block. By following this recipe. I removed  $uri/ from the try_files command. I started getting 404 instead of 403. When I removed the whole location block all error messages disappeared.

Monday 11 August 2014

Deploying Rails application on Digital Ocean Ubuntu VPS using Capistrano, Passenger and Nginx

As a part of a New Year's resolution I started learning Ruby on Rails (it was more than 5 years since I last picked up a new programming language!). Ruby and Rails are really fun to learn especially if you start with a toy application. But I found Rails deployment to be a bit more involved than Java deployment. I don't mean using a platform such as Heroku - this really removes all the grunt out of it. I mean deploying Rails while keeping control over the entire stack - i.e. starting from bare metal.


Deploying Rails


There are tons of tutorials on this subject but the clearest I could find is the one written by Chris Oliver. I followed it quite literally. I chose the Ubuntu 12.04 and rvm and I've been using the most recent versions of the other packages as available from the Ubuntu 12.04 repositories:

rnginx/1.6.0
mysql/5.5.38-0ubuntu0.12.04.1
Phusion Passenger version 4.0.48
*** LOCAL GEMS ***
bigdecimal (1.2.4)
bundler (1.6.2)
bundler-unload (1.0.2)
executable-hooks (1.3.2)
gem-wrappers (1.2.4)
io-console (0.4.2)
json (1.8.1)
minitest (4.7.5)
psych (2.0.5)
rake (10.1.0)
rdoc (4.1.0)
rubygems-bundler (1.4.4)
rvm (1.11.3.9)
test-unit (2.1.2.0)

On the dev machine I installed, among others,

capistrano (3.2.1)

Despite following the tutorial to the letter I was getting HTTP 500 after deployment.


Setting the correct path for passenger_ruby


I posted a question on Chris's blog and he kindly suggested that Nginx is the likely culprit. After inspecting

/var/log/nginx/error.log

I found the following error message

You've set the `PassengerRuby` (Apache) or `passenger_ruby` (Nginx) option to '/home/deploy/.rvm/rubies/ruby-2.1.2/bin/ruby'. However, because you are using RVM, this is not allowed: the option must point to an RVM wrapper script, not a raw Ruby binary. This is because RVM is implemented through various environment variables, which are set through the wrapper script. 

To find out the correct value for `PassengerRuby`/`passenger_ruby`, please read:
  https://www.phusionpassenger.com/documentation/Users%20guide%20Apache.html#PassengerRuby
  https://www.phusionpassenger.com/documentation/Users%20guide%20Nginx.html#PassengerRuby

The error message is handily descriptive and checking the above links makes it clear that I had to execute

$ which passenger-config
to find the path to the passenger-config executable. Running it produced:
$ deploy@drop1:~$ /usr/bin/passenger-config --ruby-command

passenger-config was invoked through the following Ruby interpreter:
  Command: /home/deploy/.rvm/gems/ruby-2.1.2/wrappers/ruby

Relaxing the privileges for the parent directories of the app 

Fixing the ruby path did a change - I started getting HTTP404 instead of HTTP500. Checking the log again I found
directory index of {public directory of the app} is forbidden
After a bit of googling it turned out that relaxing the privileges to ALL parent directories is what is needed. I executed
$ sudo chmod g+x,o+x
for each of the parent directories.Than re-starting nginx eliminated the problem. Voila!

Saturday 28 December 2013

@BeforeClass and Parametrized JUnit tests

I have recently become aware that @BeforeClass does not work as expected in Parametrized JUnit tests. Namely, it is not executed before any of the tests have been instantiated. Quick search on SO made me aware that others took notice of the issue, too:

http://stackoverflow.com/questions/11430859/parameters-method-is-executed-before-beforeclass-method

http://stackoverflow.com/questions/11163890/with-junit-4-can-i-parameterize-beforeclass.

What surprises me is that the solutions proposed on SO seem to miss the most obvious workaround - embedding the @BeforeClass in the @Parameters method. The latter is static and is executed only once  - before any of the tests.

Here is an example.

I needed a JUnit test that validates all XML files in a particular directory against a schema stored in a particular XSD file. It would be best if the schema is instantiated once - and re-used for all of the individual tests. I tried to encapsulate the schema instantiation in the doSetup() method which I annotated as @BeforeClass. Unfortunately, I got NullPointerException in each of the tests as the @BeforeClass method was, apparently, not called and the schema was therefore not instantiated. Calling the doSetup() method with the @Parameters method data() did the job:

Sunday 20 October 2013

My bioinformatics interests and plans

I intended to keep this blog focused on technical topics but in this post I'll extend the scope a bit and will write about my current research  interests in bioinformatics and how they relate to software engineering. They are not necessarily the ones I spend most of my working time as my first priority is to provide computational and statistical support to the ongoing research of other scientists. But they are closely related and are the ones that keep me busy when I find some time free of other commitments. There are two topics that I am particularly interested – robustness and reproducibility of bioinformatics analysis.

Robustness

Bioinformatics has experienced dramatic growth over the last 15 years. The rapidly evolving experimental data, the adoption of new technologies drives the rapid evolution of the computational methods. The fast speed of development on many occasions comes at the expense of software engineering rigour. As result, bugs creep in, code that used to work a day ago no longer does. The results of bioinformatics analysis can change significantly from one version of a tool to the next. I have experienced this myself on a number of occasions while using third-party tools to analyze data from genomic experiments. The consequence is increased uncertainty about the validity of results.

I believe that this situation can be significantly improved if we borrow from the field of software engineering tools and techniques which have been developed over the last decade in order to maintain the quality and robustness of the software code over its entire life cycle (e.g. unit tests [1][2], static code analysis [3][4], continuous integration[5][6]). A fortunate circumstance is that over the same period public repositories of biological data (e.g. at EBI and NCBI) accumulated a vast amount of experimental data. This opens up exciting opportunities. As an example, data from a large number of independently produced studies can be re-used to populate unit tests for many of the popular bioinformatics tools. Executing such unit tests either ad-hoc or as a part of an automatic routine would help us identify situations where different versions of the same tool produce discrepant results over large number of studies. Such automatic meta-analysis would increase the reliability of the biological studies and would allow us to focus on studies and results where re-interpretation may be necessary.

Reproducibility

Bioinformatics workflows are becoming increasingly sophisticated and varied. They tend to apply in succession a growing number of tools with each tool requiring multiple input paramaters each of which could modify its behaviour. A tool over the course of its lifetime may evolve into having multiple versions each leading to some variation in its output. Each tool may also require reference data (e.g. genomic or proteomic sequences) which itself evolves to have multiple releases and versions. Thus, in order to reproduce a certain bioinformatics analysis one needs to capture all the associated metadata (tool versions and parameters, reference versions) which is not always provided in the corresponding publications. The field (and EBI in particular) has worked on establishing standards requiring minimal metadata for reporting a number of experiments (MIAME [7], MIAPE [8]) but we need to go further and cover entire bioinformatics workflows. What is needed, in short, is a publicly accessible system for executable workflows which would capture all the relevant metadata and allow straightforward replication of the bioinformatics analysis.

There has been extensive work on systems for composing re-usable workflows and capturing the associated metadata (e.g. Taverna [9], Galaxy [10]) but technical limitations in the corresponding implementations have so far restricted their adoption. In particular, it is non-trivial to set up such a system on a local infrastructure. Furthermore, in the case of NGS the large data transfers required limit the usability of the most visible public service which supports creating and storing of such workflows (the public Galaxy service [11]). Thus, a system is needed that allows both straight-forward local deployment and public sharing and execution. It shall make it possible to attach a persistent DOI [12] to a bioinformatics workflow and refer to it in a publication so that other scientists are able to reproduce the results using either locally or a globally accessible computational resource. Fortunately, recent advances in system virtualisation [13] and in utility computing [14] make both goals feasible.

Plans

My immediate plans (when I have more time) would be to focus on two ideas:

(1) Use data at the EBI and NCBI (e.g. ENA/SRA, EGA, ArrayExpress) to produce test suites for automatic meta-analysis. I am particularly interested in functional genomics and epigenetics studies involving NGS data (e.g. RNA-Seq, Chip-seq) but the tools which would be developed are likely to be useful also for other types of studies.

(2) Develop in collaboration with other research institutions in the EU and elsewhere a platform for composing and sharing executable workflows which builds upon the latest advances of utility computing and improves upon existing projects such as Taverna and Galaxy. The focus of this system would be, again, on NGS data but the tools would likely be useful for other types of biomedical data.

Both ideas are, actually, synergistic. The automated meta-analysis would require building of a re-usable testware system which would exhibit the main features required by the platform for re-usable and share-able workflows. I imagine that both projects would build upon the same core software module.


[14] Armbrust, Michael, et al. "A view of cloud computing." Communications of the ACM 53.4 (2010): 50-58.