Blasting this sequence to retrieve the most similar already published
sequences is not so simple, because in this case we want to retrieve
mostly (if not only) sequences from cultured strains (and if possible
sequences from type species). To do that, it is possible to blast for
example:
Both Blast were run without filter and in order to retrieve 100 most similar sequences.
A first look at each result shows that the EBI Blast returns more cultured strains than Blast at NCBI.
This is because the standard EBI Blast is done on a modified "EMBL"
database that excludes sequences from the ENV division (mostly
sequences from PCR and cloning).
As a result the EBI Blast may be more efficient, we will chek that.
Align sequences.
The next step is to align these sets of sequences (alternatively we could first remove every non cultured strain).
In my case, I use a proprietary automatic aligner, to align new sequences to a database of already aligned sequences.
In your case you will probably wish to use Clustal (or may be muscle, or dialign, or Tcoffe...) to align each dataset.
I strongly suggest that after running
CLustal, you use a manual aligner (such as SeaView) to check your
automatic alignements. This has several major advantages:
You can immediately detect if some sequences (wrongly annotated)
are the wrong strand (automatic alignement will return an alignement,
no matter which sequences you submit - Clustal may respond with a
warning when a sequence is too distant).
You can immediately detect if some sequences are the results of a very bad sequencing.
You can immediately detect if some sequences are really too short (very difficult to assess rapidly and easily with blast).
The major advantage of using already aligned and checked alignements is
that they allow for rapid identification of local bad sequencing
and errors in some sequences. See for example these snapshots
Left picture
sequence 156983 has three "AAA" instead of two in every other sequence. A very likely error.
sequence 153933 has a "A" when every other similar sequence has a "G", a likely error.
Right picture
Sequence 47250 has many insertions of very likely false readings.
Such problems are more difficult to detect when aligning sequences "de novo".
Since we want to make sure in
which genus the new sequence is included (or if it could be a new
genus), we want to do a phylogenetic analysis. IMPORTANT REMARKS :
You cannot simply take the tree given by Clustal, this is a guide tree, not a phylogenetic tree.
You cannot run a phylogenetic analysis on the entire sequences:
Some domains are too divergent to be aligned for the entire dataset (first figure below).
Some sequences are simply too short (second figure below).
Some sequences are really bad (not shown)
Divergent domain
Two sequences are much shorter !
We will therefore remove sequences that are too short or too bad, and
extract a domain (in the aligned sequences provided above, we keep
positions 308-2060, as numbered in these alignements).
Preliminary phylogenetic analysis.
We will simply use DNADIST and BIONJ to produce the two trees:
Lets have a look at the local position of the new sequence in the tree
(after using the swap tool to move the new sequence to the top of the
tree).
EBI tree, zoom on a subtree.
NCBI tree, zoom on a subtree.
Both analyses largely agree :
In both trees, the new sequence fits well into the Pseudoalteromonas genus.
In the identification of P. peptidolytica, P. piscicida, P. maricalosis, P. flavipulchra as close relatives.
In addition P. citrea is retrieved in the NCBI tree.
Conclusions. Everything seems to be working quite well, small
differences are due to the different databases on which Blast operates,
depending upon the server you use. We will now:
A consensus tree that summarizes all methods (see image below).
Final remarks:
Legend :
Topology shown is that of the NJ analysis (distance calculated using Kimura two parameters correction).
* : branches also found in the maximum likelihood analysis P<0.01.
+ : branches also found in every of the 45 most parsimonious trees.
% : bootstrap results.
Despite the fact that a few branches only are strongly supported by all methods and bootstrap, the position of the new sequences is very clear and robust. It is clearly a Pseudoalteromonas species and it forms a very robust clade with P. peptidolytica F12-50-A1T.
The three "classical" methods used being in agreement for the
position of the new sequence, there is no needs to use a more
sophisticated approach.
Conclusion.
The new sequence is either a strain of P. peptidolytica, or is a new species of Pseudoalteromonas, closely related to P. peptidolytica. Measurements of DNA/DNA hybridations between these two genomic DNAs are required to decide. Richard Christen. Data obtained on May 29th 2006.