Wednesday 27 July 2011

Batch extracting GenBank data from journal articles

Repeatability, one of the central tenets of science. In theory, any published study should be repeatable. But, if anyone actually wants to do this, it's not always as straightforward as it sounds.

Molecular people do have it easy in comparison to say ecologists—we use GenBank, the Web based repository of all things DNA. Simple, just want to re-analyse some data, grab it from GenBank. Not so easy*, and here's an example.

Take the recent cypriniform relationships proposed by Mayden and Chen (2010). Being as polite as possible, their results are "interesting" to say the least. Lets assume we want to get our grubby hands on this data and see whether their conclusions have any meaningful support. They used six genes, and a table listing the accession numbers is presented:


These data are not all generated in this study though, so it makes it tricky to access them from the GenBank Web site. Copying and pasting each of them into GenBank is just not an option, so here's a hack that might just save you some time:

(1) Copy and paste the whole table from the pdf into a text file. Save the text file (e.g. as "input.txt").

(2) Open a terminal session, cd to the directory, and copy this command:

grep -o '[A-Z][A-Z][0-9][0-9][0-9][0-9][0-9][0-9]' input.txt | sed -e ':a;N;$!ba;s/\n/", "/g' -e 's/^/acc <- c("/g' -e 's/$/")/g' > output.txt

Eugh, that looks horrible, but what it does is create a ready made vector (called "acc") of accession numbers, which can be copied straight into R. Now, we assume a few things first though: that the table copied perfectly, and the fonts are compatible between pdf and txt; that the accessions are common eight digit GenBank accessions—some of the older GenBank accessions may be six digits (the regex can be modified though).

(3) Now, copy the output straight into R. We could use ape's read.GenBank function, but we would like to access the gene names too**, so we will use Samuel Brown's read.GB function instead.

Run the following R code to download the data and add taxon labels:

dat <- read.GB(acc)
names(dat) <- paste(acc, "|", attr(dat, "gene"), sep="")

(4) Now, this dumps us with all of the data in one vector, but we really want to analyse the seperate loci (e.g. rhodopsin). No worries, grep comes to the rescue again by grabbing all the sequences with "rhodopsin" in the gene attribute:

rho <- grep("rhodopsin", names(dat))
rhoSeqs <- dat[rho,]

The names ascribed to genes in GenBank from different studies may not be consistent, so make sure you check that your identifying phrase will work as expected. Now all you need to do is write this into a fasta file, align it, and you're away.

write.dna(rhoSeqs, file="mayden2010_rho.fas", format="fasta", colw=10000)

---------------------------------------------------------------------------------

* It really should be mandatory that researchers use services such as TreeBASE to upload their alignments.

** read.GenBank in ape 2.7 now has an optional "gene" attribute, but I couldn't get it to work ...

Thursday 21 July 2011

Importing pdf R plots into Inkscape: the "q" problem

I know some people like to do absolutely everything in R (it's like an admission of failure if they can't), but my approach is somewhat more pragmatic; I like to get things done as quickly and as easily as possible.

As is my usual way, I get most of my plot basics done in R, then touch them up and move things around afterwards in Inkscape. This really does make life simple, but recently I came unstuck with this method.

R was exporting my plot just fine, and the pdf looked okay, but when I imported it into Inkscape, the points had all turned into the letter "q". I think this a a fairly well known problem, and is caused by R using letters (i.e. "o") as circles in the plot, but Inkscape not being able to deal with the same fonts.


There are multitude of fixes out there on the Web, but here is what worked for me, and took about 30 seconds. Instead of using pdf as output, try postscript. The code is as follows:

postscript(file="plot.eps")
plot(object)
dev.off()


Inkscape can now import the eps file, and the points are nicely rendered as circles. I would be interested to know if this works across other platforms using other font packages ...