Make Annotation files from ACNUC download

There are two main ways to retrieve interesting sequences :
       
Using ACNUC            Build Files      Using fasta2treedyn            Using TreeDyn

Annotation files from NCBI blasts                    Annotation files from ACNUC downloads

Case study:

I want to retrieve pathogenicity genes for shigella toxin. (see a more difficult case study, in french but understandable for native english). At NCBI Entrez, a search for stx1 returns 158 responses:



Some are no good as evidenced above. Even the good ones are difficult to deal with, the problem is that both genes stx1A and stx1B are often sequenced simultaneously!



Entrez (to my knowledge) does not provide a good tool to extract only stx1A sequences for example. SRS could do the job, but is less powerfull than ACNUC. I will therefore present the use of ACNUC. Lets use the graphic interface client (here for MS Windows, but they exist for any OS):


Using ACNUC




We choose the genbank database. Using stx1A as keyword retrieves 27  entries only




Lets use the keyword browser:   when asking for stx, we get:
which provides us with the complete list of keywords starting with stx. I will not try to retrieve every stx1A gene sequence, because this is not a tutorial on how to use keywords to retrieve sequences, but how to use ACNUC in conjonction with TreeDyn, through fasta2treedyn.py !

Building required ACNUC files

1rst step: extract features

We will here use the command : "list1 et t=cds" in order to extract the informations linked to the cds feature (an other option would be t=rrna to extract rrna sequences for example).



Now we will extract these sequences under fasta format in a file : stx1a.fasta
1/ Select "list2", clic on button "extract seqs. to file", an in the selector, choose "fasta file", "simple" and file: stx1a.fasta




2/ Select "list2", lic on button "extract seqs. to file", an in the selector, choose "text file", "simple" and file: stx1a.cds, as described above.

2nd step: retrieve complete entries

Now we want to replace these entries by their parents, and extracts seqs to file as genbank format
3/ Select "list1" and clic on button "replace by parents".

4/ select "list3" and clic on "extract seqs. to file", choose "genbank format", extraction type: "simple" and file: "stx1a.gb"



Using fasta2treedyn

Run fasta2treedyn.



Clic on button : "extract all" & confirm you want to textract acnuc files:



And using the file selector, select the .fasta file you have previously downloaded from pbil.
NOTE: you may have warnings such as "WARNING - Ignoring an unknown line type,  found:", they result from the Biopython parser encountering lines that were newly added to the GenBank format, these warnings may be ignored.

Three new files were created :
In the fst file, sequence identifiers are composed of the LOCUS identifier and a number which is usualy 1, except if the entry had more than one feature extracted by ACNUC.

new fasta seq old fasta seq
>AE005174.1
ATGAAAATAATAATTTTTAGAGTGCTAACTTTTTTCTTTGTTATCTTTTCTGTTAATGTG
GTTGCGAAGGAATTTACGTTAGATTTCTCGACAGCAAAGACGTATGTAGATTCGCTGAAT
GTCATTCGCTCTGCAATAGGTACTCCATTACAGACTATTTCATCAGGAGGTACGTCTTTA
CTGATGATTGATAGTGGCACAGGGGATAATTTGTTTGCAGTTGATGTCAGAGGGATAGAT
CCAGAGGAAGGGCGGTTTAATAATCTACGGCTTATTGTTGAACGAAATAATTTATATGTG
ACAGGATTTGTTAACAGGACAAATAATGTTTTTTATCGCTTTGCTGATTTTTCACATGTT
ACCTTTCCTGGTACAACTGCGGGTACATTGTCTGGTGACAGTAGCTATACCACGTTACAG
CGTGTTGCGGGGATCAGTCGTACGGGGATGCAGATAAATCGCCATTCGTTGACTACTCCT
TATCTGGATTTAATGTCGCATAGCGGAACCTCACTGACGCAGTCTGTGGCAAGAGCGATG
TTACCGTTTGTTACTGTGACAGCTGAAGCTTTACGTTTTCGGCAAATTCAGAGGGGATTT
CGTACAACACTTGATGATCTCAGTGGGCGTTCTTATGTAATGACTGCTGAAGATGTTGAT
CTTACGTTGAACTGGGGAAGGTTGAGTAGTGTCCTGCCTGACTATCATGGACAAGACTCT
GTTCGTGTTGGAAGAATTTCTTTTGGAAGTGTTAATGCAATTCTGGGTAGCGTGGCATTA
ATACTGAATTGTCCTCATCATGCATCGCGAGTTGCCAGAATTGTACCTAATGAGTTTCCT
TCTATGTGCCCGGTAGATGGAAGAGTGCGTGGGATTACGCACAATAAAATATTGTGGGAC
TCATCCACTCTGGGGGCAATTTTGATACGCAGGGCTATTAGCAGTTGA

>AE005174.STX1A          948 residues
ATGAAAATAATTATTTTTAGAGTGCTAACTTTTTTCTTTGTTATCTTTTCAGTTAATGTG
GTTGCGAAGGAATTTACCTTAGACTTCTCGACTGCAAAGACGTATGTAGATTCGCTGAAT
GTCATTCGCTCTGCAATAGGTACTCCATTACAGACTATTTCATCAGGAGGTACGTCTTTA
CTGATGATTGATAGTGGCACAGGGGATAATTTGTTTGCAGTTGATGTCAGAGGGATAGAT
CCACAGGAAGGGCGGTTTAATAATCTACGGCTTATTGTTGAACGAAATAATTTATATGTG
ACAGGATTTGTTAACAGGACAAATAATGTTTTTTATCGCTTTGCTGATTTTTCACATGTT
ACCTTTCCTGGTACAACTGCGGTTACATTGTCTGGTGACAGTAGCTATACCACGTTACAG
CGTGTTGCGGGGATCAGTCGTACGGGGATGCAGATAAATCGCCATTCGTTGACTACTTCT
TATCTGGATTTAATGTCGCATAGTGGAACCTCACTGACGCAGTCTGTGGCAAGAGCGATG
TTACGGTTTGTTACTGTTACAGCTGAAGCTTTACGTTTTCGGCAAATTCAGAGGGGATTT
CGTACAACACTTGATGATCTCAGTGGGCGTTCTTATGTAATGACTGCTGAAGATGTTGAT
CTTACATTGAACTGGGGAAGGTTGAGTAGTGTCCTGCCTGACTATCATGGACAAGACTCT
GTTCGTGTAGGAAGAATTTCTTTTGGAAGTGTTAATGCAATTCTGGGTAGCGTGGCATTA
ATACTGAATTGTCATCATCATGCATCGCGAGTTGCCAGAATGGCATCTGATGAGTTTCCT
TCTATGTGTCCGGCAGATGGAAGAGGCCGTGGGATTACGCACAATAAAATATTGTGGGAT
TCATCCACTCTGGGGGCAATTCTGATGCGCAGAACTATTAGCAGTTGA

Results: Using TreeDyn


Using TreeDyn an d the annotation files, it is easy to produce trees such as :





This immediately shows that stx1a is a name for two very different genes, visual inspection of alignments demonstrates that they cannot be aligned, despite the fact that ClustalW returned an alignement...