TreeDyn Fasta Plug-in

Convert fasta files of sequences

There are two main ways to retrieve interesting sequences :

Annotation files from NCBI blasts                   Annotation files from ACNUC downloads

Steps     End result    Download                Extract annotations from:   Source     Gene     CDS                Fasta files manipulations

Sequence retrieval using BLAST

Imagine you Blast a sequence at NCBI (or any other server) and you want to analyze the results by phylogeny. The usual steps are:

1/  Retrieve sequences

You choose to download the five most similar sequences in fasta format (see this file).
Sequences have been recorded in fasta format, in a file that should look like that:
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGGT
...

2/ Align sequences

Your next step is the use of clustal to align these sequences.
The ouput of clustal, looks like that:
CLUSTAL W (1.81) multiple sequence alignment

gi|31044174|gb|AY143560.1|      GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAA
gi|31044185|gb|AY143571.1|      GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAA
gi|31044180|gb|AY143566.1|      GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAA
gi|94494524|gb|DQ487193.1|      GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAA
gi|31044182|gb|AY143568.1|      GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAA
                                **************************************************
The first obvious observation is that the information thant a human can easily understand, i.e. Tintinnopsis fimbriata has been lost !

3/ Calculate distance matrix and tree.

We will use dnadist from the Phylip package (after transformation of the .aln format into a phylip format, for example using SeaView).
We obtain a distance matrix, that we treat with neighbor.exe to obtain a tree file.
This tree file looks like that:
((gi|3104418:0.00443,gi|3104418:0.00214):0.01679,(gi|3104418:0.01106, gi|9449452:0.01326):0.00269,gi|3104417:0.01069);

Using for example njplot, you can see this tree as a graphic



The situation is even worse than before: you have lost part of the gi number !!! If successive numbers had been present, you cannot use this tree! In any case, the tree is very very difficult to read and understand...

The solution is the genbank2treedyn.exe.py, a python program that interacts with TreeDyn as an external plug-in. Available as code source and windows exe. No installation required, simply download and unzip. Then go to  directory: genbank2treedyn and run :

TreeDynFastaPlugin takes:
  1. Any fasta file
  2. A fasta file downloaded from NCBI (better)
  3. A fasta file + the corresponding GenBank file downloaded from NCBI (even better)
  4. A fasta file + the corresponding GenBank and ACNUC files downloaded from pbil (best).
Situation 1 is for your own sequences or older data sets.
Situation 2 is for lazy people.
Situation 3 provides with full  tools for similar sequences retrieved using a NCBI Blast
Situation 4 provides with full tools for sequences retrieved according to keywords and the most powerfull query engine i.e. ACNUC.

The main TreeDynFastaPlugin_Glue interface is as follows:



Button "fasta files manipulations" lets you manipulate fasta files of sequences as explained here.
Button "extract source" lets you manipulate fasta files of sequences as explained here.
Button "extract gene" lets you manipulate fasta files of sequences as explained here.
Button "extract CDS" lets you manipulate fasta files of sequences as explained here.

Right button "various tools" provides with some exotic tools (none yet, upon request).

End result

This are two pictures of the tree you can get using TreeDyn and the annotation file generated
Biodiversity study:
Source extraction : contains: gi number, organism, accession number, location, taxonomy (extracts)


Comparison of human genes
CDS extraction : contains: gi number, organism, accession number, GO terms, OMIM number, note


Note that this is only a rapid overview and that many more annotations, such as GO terms can be posted on the tree.


Download

NOTE : For Linux and BSD (Mac), Python and Biopython are required. Python is usually present by default (try "python -V" and "which python" in a console), but the Python default Mac seems not to include the Tk library ! Under Linux, Python full is packed if you use KDE env (don't know about Gnome...), you only need to install BioPython (see official site and for example: installation of python, as well as bioPython) .
python code   or exe for windows : here
run as usual python module, or for windows goto dir "genbank2treedyn" and run genbank2treedyn.exe.
Tutorial
This is the on line or download zipped directory.
I have stolen the icon for the windows exe from PHYLIP !!!


Richard Christen & François Chevenet.   Last modification Mai 2007