NSG's databases blog: 2012

Thursday 15 November 2012

VoSeq: delete voucher button (new feature)

Dear VoSeq users,

Little by little we are doing some progress on VoSeq's TODO list.

We have released a new micro-version of VoSeq: 1.3.1
In this version, we have included a feature to delete records. You will find a "Delete me" button in voucher pages under the "Administrator" interface. If you click the button, VoSeq will issue a dialog asking for confirmation to delete all traces of that voucher record (including its associated sequences, primers and will remove them from taxon sets).

Use the button with care!

You can download VoSeq from github: https://github.com/carlosp420/VoSeq/downloads

Peña C, & Malm T (2012). VoSeq: a voucher and DNA sequence web application. PloS one, 7 (6) PMID: 22720030

Thursday 18 October 2012

Plotting ages of phylogenetic trees in R

It appears that different genes might estimate very different age estimates for your phylogenetic trees. This seems to be the case with the COI gene that tends to pull your timings towards the past. Saturation in the 3rd codon position might be accused for responsibility (you can inspect the saturation level with some cool plots).

Brandley et al (2011) use a nice way to figure out whether some of your genes might be giving very different time estimates for your trees. You can select 1000 random trees from your BEAST run and plot the distributions of the ages for the crown group of different genes, different codon positions and the combined analyses.

Something like this plot consisting on a simulation of a gen1 estimating a crown age of 30Mya, gen2 estimating an age of 50Mya and the combined analysis giving an age of 40Mya.

This can be done in the ubiquitous statistical software R, and here is the code: http://dx.doi.org/10.6084/m9.figshare.96636


library("ape")
library("Hmisc")
library("TreeSim")

# simulate three sets of trees with different ages
# you will skip this step and use your own tree files instead
trees_gen1 <- sim.bd.taxa.age(n=50, numbsim=100, lambda=0.03, mu=0.001, age=30);
for(i in 1:length(trees_gen1)) {
  write.tree(trees_gen1[[i]], file="trees_gen1.nwk", append=TRUE)
}

trees_gen2 <- sim.bd.taxa.age(n=50, numbsim=100, lambda=0.03, mu=0.001, age=50);
for(i in 1:length(trees_gen2)) {
  write.tree(trees_gen2[[i]], file="trees_gen2.nwk", append=TRUE)
}

trees_combined <- sim.bd.taxa.age(n=50, numbsim=100, lambda=0.03, mu=0.001, age=40);
for(i in 1:length(trees_combined)) {
  write.tree(trees_combined[[i]], file="trees_combined.nwk", append=TRUE)
}

# you might want to plot the ages for an ingroup or a particular clade
# then you might want to remove certain taxa or the outgroups
# use this variable and replace with your real tip names
outgroup_tips <- c(); # add your tips as a vector, for example: c("Aus", "Bus", "Cus");

##-----------------------
## Function to get distribution of ages for root from 1000 tree file
get_crown_age_distribution <- function(file, outgroup_tips) {
  tips <- outgroup_tips;
  phys <- read.tree(file)

  # get the branching times for the crown
  branching_times <- c();
  for( i in 1:length(phys)  ) {
    phy <- phys[[i]]
    if( length(tips) > 0) {
      phy <- drop.tip(phys[[i]], tips);
    }
  
    # the the node number for the root
    nodes <- length(phy$tip.label) +  1
  
    x <- branching.times(phy)
  
    # get the branching time for the root
    branching_times <- c(branching_times, x[names(x) == nodes])
    }
  return(branching_times);
}


## get data
gen1 <- get_crown_age_distribution(file="trees_gen1.nwk", outgroup_tips=outgroup_tips)
gen2 <- get_crown_age_distribution(file="trees_gen2.nwk", outgroup_tips=outgroup_tips)
combined <- get_crown_age_distribution(file="trees_combined.nwk", outgroup_tips=outgroup_tips)

hist(gen1, freq=FALSE, axes=F, xlim=c(80,0), 
     border="white", main="Age posterior probability distributions\nof simulated trees",
     ylab="", xlab="Million years ago")
axis(4, at=seq(0.0,0.6, 0.1), las=1, font=2)
axis(1, font=2)

lines(density(combined), lwd=2, col="black")
lines(density(gen1), lwd=2, col="red")
lines(density(gen2), lwd=2, col="blue")

minor.tick(ny=1)

legend.txt <- c("Data combined", "gen1", "gen2")
legend.colors <- c("black",      "red", "blue")
legend(cex=0.8,"topleft", legend.txt, pch=22, lwd=0, pt.bg=legend.colors, title="Locus", pt.cex=2)

Sunday 14 October 2012

Plots de saturación para secuencias moleculares en R

Estuve buscando un programa que realice gráficos mostrando los niveles de transiciones y tranversiones en secuencias de ADN. Se supone que el tercer codon de una secuencia de ADN tiende a llegar a "saturación" debido a que el código genético es degenerado (varios tripletes codifican el mismo aminoácido). Esta saturación ocurre cuando las mutaciones en el tercer codón han sido tan frecuentes que ya no llevan información filogenética. Es decir, se llega al grado en que dos secuencias saturadas son parecidas simplemente por chance. El programa DAMBE realiza este tipo de gráficos al plotear la relación de transiciones y transversiones versus distancia genética, pero no pude hacer funcionar la versión para Linux. Encontré una función escrita para el programa estadístico R que sí me funcionó. La versión original la pueden encontrar aquí: http://the-praise-of-insects.blogspot.fi/2010/04/transitions-in-r-redux.html

Download sample data here: wg_sample_data.zip

library(ape)

# modified from
# http://the-praise-of-insects.blogspot.fi/2010/04/transitions-in-r-redux.html

par(mfcol=c(1,1)); par(mar=c())
par(xpd=F, mar=c(5,4,4,2)+0.2, family="Palatino")

#Input: dat---an object of class 'DNAbin'

titv<-function(dat){
  mat<-as.matrix(dat)
  res<-matrix(NA, ncol=dim(mat)[1], nrow=dim(mat)[1], dimnames=list(x=names(dat), y=names(dat)))
  for(i in 1:(dim(mat)[1] - 1)){
    for(j in (i+1):dim(mat)[1]){
      vec<-as.numeric(mat[i,])+as.numeric(mat[j,])-8
      res[j,i]<-sum(!is.na(match(vec,c(200,56))))#Transitions
      res[i,j]<-sum(!is.na(match(vec,c(152,168,88,104))))#Transversions
    }
  }
  res
}

create_plots<-function(files) {
  for( i in 1:length(files)) {
    codon <- as.DNAbin(read.nexus.data(files[i]))
    
    ti<-titv(codon)
    tv<-t(ti)
        
    if (i == 1) {
      R1 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist1 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    }
    else if( i == 2) {
      R2 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist2 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    }
    else if( i == 3) {
      R3 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist3 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    }
    else {
      print("error")
    }
  }
  # print plot
  xname = expression("JC+" * Gamma * "corrected distances")
  yname = "Transition/Tranverstion ratio"
  plot(NA, xlim=c(0,.6), ylim=c(0,40), xlab=xname, ylab=yname, main="Saturation plot for Wingless gene")
  points(R3~dist3, xlim=c(0,.6), ylim=c(0,40), col="red", pch=20, xlab="", ylab="")
  points(R1~dist1, xlim=c(0,.6), ylim=c(0,40), col="grey", pch=20, xlab="", ylab="")
  points(R2~dist2, xlim=c(0,.6), ylim=c(0,40), col="cyan", pch=20, xlab="", ylab="")

  legend.txt <- c("1st position", "2nd position", "3rd position")
  legend.colors <- c("grey",  "cyan", "red")
  legend("topright", legend.txt, pch=19, col=legend.colors, title="wingless", cex=0.9)
}


files <- c("wg_1.nex", "wg_2.nex", "wg_3.nex");
create_plots(files);

Y este es el resultado final para el gen wingless:

Es importante el orden al plotear los puntos (primero los más abundantes).

Así es posible ver todos.

Monday 30 July 2012

Morphology of immature stages of the butterfly genus Junea and its implications

Fredy Montero was very kind to send me his recent publication (with Maira Ortiz) describing morphological characters of eggs, larvae and pupae of the satyrine butterfly genus Junea.

They raised the specimens on the hostplant Chusquea in Colombia. What I found most interesting is the morphology of the last instars of the larva. They have very long "horns" and long bifid "tails".

Head of Junea doraete larva. http://bit.ly/N1d0wu

It was believed that the only Neotropical butterfly with these traits was the satyrine butterfly Eteona tisiphone from southeast Brazil:

Eteona tisiphone larva (from Freitas, 2002), head on the left.

These similarities explain in part why Eteona and Junea appear as closely related taxa in the Satyrinae phylogeny that we published in 2006:

Satyrinae phylogeny from Peña et al., 2006: doi: 10.1016/j.ympev.2006.02.007

At the time of writing that manuscript, it was very strange to find the Andean genera Junea and Pronophila close to the genera Eteona and Foetterleia (that are mainly distributed in Southeastern Brazil, Paraguay and northern Argentina). It is nice to see that these relationships can be supported by morphological characters of immature stages. It would be very interesting to see whether the larvae and pupae of Pronophila and Foetterleia share the same characters with Eteona and Junea. This could require interesting biogeographical explanations for such close relationships of among currently disjunct taxa.

Pupae of Junea doraete (left) and Eteona tisiphone (right).

References:

Freitas, A. V. L. Immature stages of Eteona tisiphone (Nymphalidae Satyrinae). Journal of the Lepidopterists’ Society 56, 286–288 (2002). http://bit.ly/NReImD
Montero Abril, F. & Perez, M. O. Estados inmaduros e historia natural de algunas especies de la subtribu Pronophilina (Nymphalidae: Satyrinae) presentes en el paramo del Tablazo, Colombia. I. Junea doraete doraete (Hewitson 1858). Tropical Lepidoptera Research 22, 32–41 (2012). http://bit.ly/N1d0wu
Carlos Peña, Niklas Wahlberg, Elisabet Weingartner, Ullasa Kodandaramaiah, Sören Nylin, André V.L. Freitas, Andrew V.Z. Brower (2006). Higher level phylogeny of Satyrinae butterflies (Lepidoptera: Nymphalidae) based on DNA sequence data Molecular Phylogenetics and Evolution, 40 (1), 29-49 DOI: 10.1016/j.ympev.2006.02.007

Wednesday 18 July 2012

PLoS Altmetric API will change soon

Just got an email from PLoS ALM Team saying that they are updating their API for their Article Level Metrics (ALM; Altmetric) tools.

The Almetric software shows "citation" data on scientific papers harvested from social networks such as Twitter, Scientific Blogs, Citeulike and Mendeley. They deliver this content via their very easy API.

You could also visit their website http://altmetric.com/ and enter a DOI number for your favorite paper and see how many citations from social network it has. Also you can see the "hot" papers that have the most number of citations so are the one that "everybody" is reading right now.

I am using the altmetric API for my website and noticed that there seems to be a mix up in the data that was harvested for one of our recently published papers:

Peña C, Malm T (2012) VoSeq: A Voucher and DNA Sequence Web Application. PLoS ONE 7: e39071. doi:10.1371/journal.pone.0039071

For some reason Altmetric started to collect data from their announcement when their released Altmetric:

You can see the Altmetric API and changes in their github profile:
https://github.com/articlemetrics/alm/wiki/API

Tuesday 10 July 2012

Voseq, web database for molecular phylogenetics

Are you working in molecular phylogenetics? Do you and your lab produce lots of DNA sequences.
Are you tired of trying to find your sequences among several text files and Excel sheets?
Do you wish there was a easy-to-use database to keep track of sequences and their associated voucher specimens?
Did you ever wished there was a system to create molecular datasets for analysis in PAUP or MrBayes by just a few clicks of a mouse?

If the answers are "yes", then VoSeq might be for you. VoSeq is a voucher and DNA sequence web application database aimed for people working in molecular phylogenetics.

Main features of VoSeq:

Keep track of your sequences and associated voucher specimens.
You upload your data to VoSeq and the back-end relational database will allow you to easily search, fetch, update, etc your DNA sequences or voucher data.
With a few clicks, you can create ready-to-run datasets in NEXUS, Phylip, TNT formats, as well as FASTA files for submission to GenBank.
Use the BLAST capabilities to find similar sequences among those you have, or BLAST against GenBank.
Automated integration with public web services such as Flickr (for posting your voucher photos) and Yahoo Maps (for plotting voucher localities).
You can install it in your computer for private use, or set it up in a shared server for collaborative work via the internet.
and more.

The publication describing VoSeq came out recently:

Peña, C. & Malm, T. (2012). VoSeq: a Voucher and DNA Sequence Web Application PLOS ONE, 7 (6) DOI: 10.1371/journal.pone.0039071

You can check out a test installation of VoSeq with sample data here: http://www.nymphalidae.net/VoSeq/
The full documentation is here: http://nymphalidae.utu.fi/cpena/VoSeq_docu.html
Download from here: https://github.com/carlosp420/VoSeq
VoSeq is actively being developed and if you have questions or need help using it. Let us know!

Carlos Peña: mycalesis@gmail.com
Tobias Malm: tobemalm@gmail.com

Friday 18 May 2012

Voucher pages in JSON format

I made a quick addition to our public NSG database. Voucher pages will output the specimen's data in JSON format for easy and automated harvesting of our data.

JSON is becoming a commonly used format for transfer of data over the Internet because it can be easily integrated into Javascript. Nowadays, there are even database systems that keep all data in JSON format (e.g. couchdb, mongodb, etc).

This is how it works in our database:

If you go to this voucher page: http://nymphalidae.utu.fi/story.php?code=NW85-8 you will see an interface created for humans:

If you add to the URL the option "&format=json" (http://nymphalidae.utu.fi/story.php?code=NW85-8&format=json), you will get all the data in JSON format:

{
  "institutionCode": "NSG",
  "catalogNumber": "NW85-8",
  "voucher_code": "NW85-8",
  "recordNumber": "NW85-8",
  "family": "Hesperiidae",
  "subfamily": "",
  "tribe": "",
  "subtribe": "",
  "genus": "Achlyodes",
  "specificEpithet": "busiris",
  "infraspecificEpithet": "",
  "country": "PERU",
  "locality": "Km 28, road to Yurimaguas",
  "decimalLatitude": "-6.412590",
  "decimalLongitude": "-76.315900",
  "verbatimElevation": "750m",
  "collector": "St\u00e9phanie Gallusser",
  "eventDate": "2001-11-02",
  "voucherLocality": "NSG coll.",
  "sex": "",
  "voucherImage": "http:\/\/flickr.com\/photos\/37256239@N03\/3429238255\/",
  "associatedSequences": "GQ864726;GQ864820;GQ864414;GQ865378;GQ865050;GQ864915;GQ864593;GQ864507;GQ865158;GQ865279"
}


If you use the function getJSON of jQuery to call this web service, you will need to use the field jsoncallback=? or callback=? and, to avoid confusions, the field format=jsonp.
Example: Calls the NSG database for data about specimen code NW85-8 using jQuery's function getJSON.

<!DOCTYPE html>
<html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>test
</title>
<script src="http://code.jquery.com/jquery-latest.js"></script>


</head>
<body>

<script>
$(document).ready(function() {
       $.getJSON("http://nymphalidae.utu.fi/story.php?callback=?",
            {
                code: "NW85-8",
                format: "jsonp"
            },
        function(data) {
            var output = "";
            output += "voucher code: " + data.voucher_code + "<br />";
            output += "genus: " + data.genus + "<br />";
            output += "species: " + data.specificEpithet + "<br />";

            $("div").html(output);
        });
});

</script>

<div></div>
</body>
</html>

And the output will be:

Tuesday 28 February 2012

Integration with EOL

Our database now asks EOL for author and year of description for species names. It is using EOL's search API to pull the authority and link to the corresponding species page in EOL. If there is a positive response from EOL the authority and link will appear under the voucher code:

EOL gets the authority information from several sources (including Ubio and GenBank). However their taxonomy is far from complete and needs urgent updates.