EMBL Extractor is a software for molecular biologists and bioinformaticians. It consists of a series of perl scripts and makes use of a MySQL database. Using EmblEx is through a graphical interface within a web browser or through the command line under linux or windows.
Emblex extracts data from EMBL entries contained in a file according to keywords contained in each entry. Fields analyzed for each entry are sub-divisions of FT fiels (http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). Complete data or sub-sections that contained these key-words can be saved as a web page, an ASCII file or directly stored in a BioSQL database.
The biological databases are growing constantly (more than 90.000 sequences of 16S rRNA for the bacteria), wich making research increasingly long and difficult. Interfaces Web make it possible to rather easily seek data for some genes or transcribed given. The two principal search engines used in these interfaces are ACNUC and SRS. These tools are usable by an interface web, and a command line, allows to recover information from a list of keywords or properties like sequence length, etc.
Before launching us in the design of our clean tools of data retrieval, before seeking us to use these two tools which did not give us satisfaction.
Keywords searches are different. EmblEx presently searches only in the "features" fields but has a higher resolution power than either SRS or ACNUC (an example will be provided at the end of this document). It can be very easily downloaded and installed on a regular PC in a matter of minutes. As it uses the most used Perl langage, it can be tailored to fit some subtle requirements. Inputs and outputs can be located on the same machine or they can be distributed over the net. As in ACNUC, subsequences defined by specific keywords can be automatically extracted, but contrarily to ACNUC full informations can be retained. Finally, EmblEx can automatically search public repository for new releases of the EMBL datafiles.
Before running EmblEx, you need MySQL, Apache, Perl and bioperl. Most of these are already installed on most Linux boxes. Under Microsoft Windows, Apache and MySQL are easily installed using EasyPHP. Perl and bioPerl are also very easy installed. A checklist is provided for both environments, and for the Linux Mandrake the serial list of dependencies is indicated and they can be easily retrieved from well known servers or from our own server.
Starting EmblEx is then simply done by entering: http://localhost/emblex/index.html in your browser. This makes the following window to appear.
Figure1. The main interface.
The main interface is composed of two parts : "Output" and "Extract Options".
Figure2. Choose one ore more output format.
Toggle the radio button on "Web Page" in the screen as shown in Figure 3. This makes appear a new window as shown on Figure 4.
Figure 4. Choose an output directory.
The following fields can be extracted as a web page : id, accession, release, classification, species, total sequence, total sequence length, extracted sequence, extracted sequence length. The result will appear in a navigator window.
This procedure can for example be used as a first step to decide whether or not you have choosed the proper keywords, before extraction to a file or to BIOSQL.
Toggle the radio button onto "Text file".
In this case you have to provide the directory where this file will be created. It will be created if it does not exist. The file is located at :
'apache_default_dir'/cgi-bin/partiseq_cgi/projects/EmblEx/'your_directory_name'/'your_file_name'

Figure 5. Choose a text file format.
You first need to provide the directory where this file will be created. It will be created if it does not exist.
Then indicate a file name for extraction. If this file already exists, it will be replaced.
The file is located at :
'apache_default_dir'/cgi-bin/partiseq_cgi/projects/EmblEx/'your_directory_name'/'your_file_name'
You may then choose between two different formats :
- embl format : ouput file with EMBL tag defined lines, all lines from the EMBL entry are created.
- user fomat : a window that opens allows to select which informations will be extracted in the file.
Toggle the radio button on "Database".
Figure 6. Choose the BioSQL Database format.
You first need to indicate the 'spacename'.
Imported sequences are registered with a reference (a name space). This reference allows to retrieve a particular dataset. For example in a BioSQL
database, one may want to store informations for different genes and be able to retrieve informations for a particular gene in an easy manner (for example, 16S rRNA, rpoB, recA, 5.8S, etc.).
EMBL entries contain in selected keywords will be extracted in the BioSQL database. When this particular database does not exists, it will be created.
Import into BioSQL relies on a "entries_update" script taken from bioPerl. If no entry exists in BioSQL, a new one is inserted, if the entry already exists (but is older), it will be updated (see the documentation of bioperl::BD::bioperl_db for more informations). The first time you run the script, it would be better to import taxonomy in BioSQL database (check "with taxonomy"). Note that this script take several time (from 10 minutes to many hours).
This window contains three parts : choose the file to be parsed, define the keywords, fields in which key-words will be searched for.
EmblEx works on an ASCII file from EMBL. This file should be in the same directory as EmblEx for example : "apache_directory"/cgi-bin/embl_cgi/projects/emblex/. It must have a ".dat" extension to be recognized by EmblEx. The drop-down list then allows to select one such file.
There are two ways for adding an EMBL file to an EmblEx project : simple or automatic. Simple is when you manually download by ftp an EMBL file onto your computer. Automatic is when you ask EmblEx to test whether or not there is a new release on you favorite ftp site (see chapter « Adding a file of entries »).
Field "Organism" is optionnal, active if checked. You could type a word or an expression which will be searching in "OC" field of each EMBL entrie. "OC" field contains taxonomy of organisms, so you could restrain your search for a specific taxonomic rank.
This is the section in which you indicate which key-words will be used to select entries or sub-entries. There are three possibilities.
Figure 9.
Figure 10.
After having defined your key-word(s), you have to defined where you want to search for them.
An EMBL is constituted of feature keys (ID, AC, SV, DT, DE, KW, OS, OC, FT, etc). The FT feature keys is made of FT-Keys. (Description of EMBL format available at : http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html)
Available feature keys are indicated in the list on the left (it is consisted of the list of the FTKeys). Arrows in the middle allow to move the selection to the list on the right (or the reverse). Search is done sequentialy. If a key-word is found in the first field, it is not done in the other ones. A proper order of the fields to search can increase extraction speed. Also, the sequence defined by the boundaries indicated for this feature will be used to extract the subsequence.
EmblEx requires a file containing entries using the EMBL format. This file must be located in the same directory as emblex project (eg. "C:\Program Files\EasyPHP1-7\cgi-bin\partiseq_cgi\projects\emblex"), and must have ".dat" as extension.
There are two methods to append an EMBL file to a EmblEx project.
The user connects by ftp to EMBL or a miror site and download a file (do not forget to check that you are using the text mode). Files must be in ASCII (they need to be uncompressed if necessary).
If one wishes to use more than one file, they need to be concatened in a single file.
Once this is done, every .dat file located in the proper directory appears in the drop-down list described in section « EMBL file to be parsed »(Figure 8).
This method helps download of files when new versions appear.
This requires that adresses of ftp servers be known by EmblEx. As explained previously this information is provided by the users and is stored in a local database.
Simply clic on the link "manage FTP servers" (Figure 8) on EmblEx main page to enter the window that allows to define one or several ftp servers. You can also access this window by entering "http://localhost/cgi-bin/partiseq_cgi/scripts/cgi/tablemanager.cgi" in your browser.
Figure 11.1. Configuration window for automated ftp.
There are three possibilities : "Add / modify ftp server" and '"Delete ftp server".
Figure 11.2. Choose between add, modify or delete ftp server from list.
If no ftp server had already been chosen, will have obviously to choose "Add ftp server".
The list of ftp servers can seen in the drop-down list on the left.
Figure 11.3. Consult the list of available servers.
If the list is empty you have to add at least one server in order to use the automated process.
Figure 11.4. Define server's parameters.
Once this is done, simply clic on "OK".
In the window, it is then possible to select a server.
Figure 11.5. Select an available server.
Similarily, you may "delete FTP serveur" from the list» (then clic "OK").
NOTE 1: If the new server does not appear in the list, please refresh you browser window (MAJ-refresh).
NOTE 2: If configuration parameters do not appear, please toggle on the radio button "Modify partiseq database".
If this directory is at the root of the Apache directory, opening EmblEx is simply using "http://localhost/emblex/index.html". If it is located in a sub-directory, the path has to be typed accordingly.
EmblEx requires MySQL (v3.23 or later).
One or two database will be created.
This is a very light-weight database that allows to store the EmblEx parameters; it is made of two tables
This database is transparent for the users as it is accessed when the user chooses the different options using the graphical interface. It would be rather unwise to try to modify values in these tables in an other manner than using the graphical interface :-(.
This is the standard BioSQL database (http://cvs.open-bio.org/). We used this format since it then allows reuse of code developed within the bio-(Python|Java|Perl) community. If you are under Windows, EmblEx requires ActivePerl 5.6.1 (do not use a later package).
The script can be launch with a command line. You need to specifie several options.
exemple : file_parser.pl --option1 value1 --option2 value2
The options with * are necessary.
| name | description | value |
| basename | database name | by default : emblex |
| biosql_result | namespace for Biosql | by default : new |
| biosql_scheme | localisation of the bioSQL database schema | by default : ../../sql_DB/biosqldb-mysql.sql |
| biosql_taxo | insert or update taxonomy in bioSQL | 0 or 1 (by default : 0) |
| filename* | name of the EMBL file to parse (must be located in basename directory) | for exemple : toto.dat |
| FT_search | FT-Keys to search in, from the most important to the less important, must be separate by a space | -, -10_signal, -35_signal, 3'clip, 3'utr, 5'clip, 5'utr, attenuator, c_region, caat_signal, cds, conflict, d-loop, d_segment, enhancer, exon, gap, gc_signal, gene, idna, intron, j_segment, ltr, mat_peptide, misc_binding, misc_difference, misc_feature, misc_recomb, misc_rna, misc_signal, misc_structure, modified_base, mrna, n_region, old_sequence, operon, orit, polya_signal, polya_site, precursor_rna, prim_transcript, primer_bind, promoter, protein_bind, rbs, rep_origin, repeat_region, repeat_unit, rrna, s_region, satellite, scrna, sig_peptide, snorna, snrna, source, stem_loop, sts, tata_signal, terminator, transit_peptide, trna, unsure, v_region, v_segment, variation (by default : source) |
| ip | adress of the mysql server | by default : 127.0.0.1 |
| keywords | keywords or regular expression | |
| login | login to connect to mysql | by default : root |
| mysql | let this value to 1 | |
| namespace | it is used to retrieve data in bioSQL database | |
| output | name of file for generated data | by default : if no name specified, results will be display in console |
| output_format | output format for texte file and web page | embl_format, user_format (by default : embl_format) |
| partiseq | let this value to 1 | |
| password | password to connect to mysql | empty by default |
| project_dir | localisation of EmblEx projects | by default : ../../projects/"basename_value" |
| resultstodb | save result into bioSQL database | 0 or 1 (by default : 0) |
| tablename | directory for generated data | by default : tmp |
| words | type of keywords to search | nokeywords, words, regexp (by default : nokeywords) |
| AC | accession number | 0 or 1 (by default : 1) |
| classification | classification | 0 or 1 (by default : 1) |
| id | id number | 0 or 1 (by default : 1) |
| release | version | 0 or 1 (by default : 1) |
| seq | sub sequence | 0 or 1 (by default : 1) |
| seq_length | sub sequence length | 0 or 1 (by default : 1) |
| sequence | complete sequence | 0 or 1 (by default : 1) |
| sequence_length | complete sequence length | 0 or 1 (by default : 1) |
| specie | species | 0 or 1 (by default : 1) |