Introduction

What is EmblEx ?

EMBL Extractor is a software for molecular biologists and bioinformaticians. It consists of a series of perl scripts and makes use of a MySQL database. Using EmblEx is through a graphical interface within a web browser or through the command line under linux or windows.

Purpose of EmblEx.

Emblex extracts data from EMBL entries contained in a file according to keywords contained in each entry. Fields analyzed for each entry are sub-divisions of FT fiels (http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). Complete data or sub-sections that contained these key-words can be saved as a web page, an ASCII file or directly stored in a BioSQL database.

Why EmblEx ?

The biological databases are growing constantly (more than 90.000 sequences of 16S rRNA for the bacteria), wich making research increasingly long and difficult. Interfaces Web make it possible to rather easily seek data for some genes or transcribed given. The two principal search engines used in these interfaces are ACNUC and SRS. These tools are usable by an interface web, and a command line, allows to recover information from a list of keywords or properties like sequence length, etc.

Before launching us in the design of our clean tools of data retrieval, before seeking us to use these two tools which did not give us satisfaction.

Differences between Emblex,Acnuc and SRS.

Keywords searches are different. EmblEx presently searches only in the "features" fields but has a higher resolution power than either SRS or ACNUC (an example will be provided at the end of this document). It can be very easily downloaded and installed on a regular PC in a matter of minutes. As it uses the most used Perl langage, it can be tailored to fit some subtle requirements. Inputs and outputs can be located on the same machine or they can be distributed over the net. As in ACNUC, subsequences defined by specific keywords can be automatically extracted, but contrarily to ACNUC full informations can be retained. Finally, EmblEx can automatically search public repository for new releases of the EMBL datafiles.

Requirements for EmblEx.

Before running EmblEx, you need MySQL, Apache, Perl and bioperl. Most of these are already installed on most Linux boxes. Under Microsoft Windows, Apache and MySQL are easily installed using EasyPHP. Perl and bioPerl are also very easy installed. A checklist is provided for both environments, and for the Linux Mandrake the serial list of dependencies is indicated and they can be easily retrieved from well known servers or from our own server.

Hands on.

Starting EmblEx is then simply done by entering: http://localhost/emblex/index.html in your browser. This makes the following window to appear.


Figure1. The main interface.

The main interface is composed of two parts : "Output" and "Extract Options".

  1. Output: allows to choose the output format (as explained above).
  2. Extract Options : Define the keywords you want to use.

The Output window.

Emblex can extract results into three output formats simultaneously.


Figure2. Choose one ore more output format.

Web page

Toggle the radio button on "Web Page" in the screen as shown in Figure 3. This makes appear a new window as shown on Figure 4.

output directory
Figure 4. Choose an output directory.

The following fields can be extracted as a web page : id, accession, release, classification, species, total sequence, total sequence length, extracted sequence, extracted sequence length. The result will appear in a navigator window.

This procedure can for example be used as a first step to decide whether or not you have choosed the proper keywords, before extraction to a file or to BIOSQL.

Text file.

Toggle the radio button onto "Text file".
In this case you have to provide the directory where this file will be created. It will be created if it does not exist. The file is located at :
'apache_default_dir'/cgi-bin/partiseq_cgi/projects/EmblEx/'your_directory_name'/'your_file_name'

text file format
Figure 5. Choose a text file format.

You first need to provide the directory where this file will be created. It will be created if it does not exist.

Then indicate a file name for extraction. If this file already exists, it will be replaced.

The file is located at :
'apache_default_dir'/cgi-bin/partiseq_cgi/projects/EmblEx/'your_directory_name'/'your_file_name'

You may then choose between two different formats :
- embl format : ouput file with EMBL tag defined lines, all lines from the EMBL entry are created.
- user fomat : a window that opens allows to select which informations will be extracted in the file.

BioSQL database.

Toggle the radio button on "Database".

BioSQL database format
Figure 6. Choose the BioSQL Database format.

You first need to indicate the 'spacename'.
Imported sequences are registered with a reference (a name space). This reference allows to retrieve a particular dataset. For example in a BioSQL database, one may want to store informations for different genes and be able to retrieve informations for a particular gene in an easy manner (for example, 16S rRNA, rpoB, recA, 5.8S, etc.).

EMBL entries contain in selected keywords will be extracted in the BioSQL database. When this particular database does not exists, it will be created.

Import into BioSQL relies on a "entries_update" script taken from bioPerl. If no entry exists in BioSQL, a new one is inserted, if the entry already exists (but is older), it will be updated (see the documentation of bioperl::BD::bioperl_db for more informations). The first time you run the script, it would be better to import taxonomy in BioSQL database (check "with taxonomy"). Note that this script take several time (from 10 minutes to many hours).

Window "Extract Options".

extract options
Figure 7. Extract options.

This window contains three parts : choose the file to be parsed, define the keywords, fields in which key-words will be searched for.

EMBL file to be parsed.


Figure 8. Choose an input file.

EmblEx works on an ASCII file from EMBL. This file should be in the same directory as EmblEx for example : "apache_directory"/cgi-bin/embl_cgi/projects/emblex/. It must have a ".dat" extension to be recognized by EmblEx. The drop-down list then allows to select one such file.

There are two ways for adding an EMBL file to an EmblEx project : simple or automatic. Simple is when you manually download by ftp an EMBL file onto your computer. Automatic is when you ask EmblEx to test whether or not there is a new release on you favorite ftp site (see chapter « Adding a file of entries »).

  1. The button "update file" automates the update (or creation) of the entry file. This works only after having indicated available ftp servers, but by default it uses the EMBL server (see chapter « Adding a file of entries »).
  2. The link "manage FTP servers" opens a new interface that allows to define your own ftp servers (see chapter « Adding a file of entries »).

Field "Organism" is optionnal, active if checked. You could type a word or an expression which will be searching in "OC" field of each EMBL entrie. "OC" field contains taxonomy of organisms, so you could restrain your search for a specific taxonomic rank.

Keyswords selection.

This is the section in which you indicate which key-words will be used to select entries or sub-entries. There are three possibilities.

keywords
Figure 9.

  1. "No keyswords". Every entry from the entry file will be extracted.

  2. "Keyswords". You then enter a single or a series of key-words, separated by "|" or "&". This separator allows to use combined key-words containing spaces. Each entry is then scanned using a "|" for OR, and "&" for AND, to combine key-words. The analysis is case insensitive. You can use parenthesis to simplify your expression.

    For example : bacteria&(16s rna gene|16s rna)

  3. "Upload a keyswords file". You can specify an ASCII file containing keywords. Each key-word (simple or words separated by spaces) must be separate by "&" for AND search, and each line represent an argument for the OR searche.

    For example :
    16s rna gene&bacteria
    16s rna&bacteria This file is chosen after a clic on the button "Select file".

Searching keyswords in a field.

searching keywords in a field
Figure 10.

After having defined your key-word(s), you have to defined where you want to search for them.

An EMBL is constituted of feature keys (ID, AC, SV, DT, DE, KW, OS, OC, FT, etc). The FT feature keys is made of FT-Keys. (Description of EMBL format available at : http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html)

Available feature keys are indicated in the list on the left (it is consisted of the list of the FT­Keys). Arrows in the middle allow to move the selection to the list on the right (or the reverse). Search is done sequentialy. If a key-word is found in the first field, it is not done in the other ones. A proper order of the fields to search can increase extraction speed. Also, the sequence defined by the boundaries indicated for this feature will be used to extract the subsequence.

Adding a file of entries.

EmblEx requires a file containing entries using the EMBL format. This file must be located in the same directory as emblex project (eg. "C:\Program Files\EasyPHP1-7\cgi-bin\partiseq_cgi\projects\emblex"), and must have ".dat" as extension.
There are two methods to append an EMBL file to a EmblEx project.

Manual method.

The user connects by ftp to EMBL or a miror site and download a file (do not forget to check that you are using the text mode). Files must be in ASCII (they need to be uncompressed if necessary).

If one wishes to use more than one file, they need to be concatened in a single file.

Once this is done, every .dat file located in the proper directory appears in the drop-down list described in section « EMBL file to be parsed »(Figure 8).

Automated method.

This method helps download of files when new versions appear.
This requires that adresses of ftp servers be known by EmblEx. As explained previously this information is provided by the users and is stored in a local database.

Simply clic on the link "manage FTP servers" (Figure 8) on EmblEx main page to enter the window that allows to define one or several ftp servers. You can also access this window by entering "http://localhost/cgi-bin/partiseq_cgi/scripts/cgi/tablemanager.cgi" in your browser.

automated ftp
Figure 11.1. Configuration window for automated ftp.

There are three possibilities : "Add / modify ftp server" and '"Delete ftp server".

add, modify or delete ftp server from list
Figure 11.2. Choose between add, modify or delete ftp server from list.

If no ftp server had already been chosen, will have obviously to choose "Add ftp server".

The list of ftp servers can seen in the drop-down list on the left.

list of available servers
Figure 11.3. Consult the list of available servers.

If the list is empty you have to add at least one server in order to use the automated process.


Figure 11.4. Define server's parameters.

  1. « ftp server url ». Address of the FTP server.
    For example : if the file is in ftp://ftp.infobiogen.fr/pub/db/embl/EB/, the server name is ftp.infobiogen.fr
  2. "ftp location for files". Ftp server address and path of the file to be downloaded.
    For example : if the file is in ftp://ftp.infobiogen.fr/pub/db/embl/EB/, the location is /pub/db/embl/EB/
  3. "list of files to download". List of files you wish to download. If you wish to download several files, each filename needs to be separated with a space (example: « pro01.dat pro02.dat pro03.dat pro04.dat pro05.dat pro06.dat » or « fun.dat »).
  4. "name of EMBL file". Name of the file in local.
  5. « anonymous ftp ». If you choose « yes », you don't have to enter your login and password (6 and 7). If ftp server require a user name, you must choose « no » and enter valid login and password.
  6. "login to acces ftp server". It will be "anonymous" if you choose anonymous access, but a real login can be specified.
  7. "password to access ftp server". If anonymous, you can't change it.

Once this is done, simply clic on "OK".

In the window, it is then possible to select a server.


Figure 11.5. Select an available server.

Similarily, you may "delete FTP serveur" from the list» (then clic "OK").

NOTE 1: If the new server does not appear in the list, please refresh you browser window (MAJ-refresh).
NOTE 2: If configuration parameters do not appear, please toggle on the radio button "Modify partiseq database".

EmblEx components

Files and Directory contents description.

In the Apache directory that contains web pages (www or http).

  1. index.html Main page of EmblEx.
  2. INSTALL Detailled instructions for EmblEX installations.
  3. other Dependencies required for installation.
  4. help Various help files.

If this directory is at the root of the Apache directory, opening EmblEx is simply using "http://localhost/emblex/index.html". If it is located in a sub-directory, the path has to be typed accordingly.

In the "cgi-bin" directory of Apache.


MySQL

EmblEx requires MySQL (v3.23 or later).

One or two database will be created.

This is a very light-weight database that allows to store the EmblEx parameters; it is made of two tables

  1. DB_list. Contains the list of the EmblEx projects (as defined above).
  2. embl_files_list. Contains the parameters for accessing ftp servers, required for automatic updates from anonymous public servers.

This database is transparent for the users as it is accessed when the user chooses the different options using the graphical interface. It would be rather unwise to try to modify values in these tables in an other manner than using the graphical interface :-(.

This is the standard BioSQL database (http://cvs.open-bio.org/). We used this format since it then allows reuse of code developed within the bio-(Python|Java|Perl) community. If you are under Windows, EmblEx requires ActivePerl 5.6.1 (do not use a later package).

Command line

The script can be launch with a command line. You need to specifie several options.
exemple : file_parser.pl --option1 value1 --option2 value2

Options :

The options with * are necessary.

namedescriptionvalue
basenamedatabase nameby default : emblex
biosql_resultnamespace for Biosqlby default : new
biosql_schemelocalisation of the bioSQL database schemaby default : ../../sql_DB/biosqldb-mysql.sql
biosql_taxoinsert or update taxonomy in bioSQL0 or 1 (by default : 0)
filename*name of the EMBL file to parse (must be located in basename directory)for exemple : toto.dat
FT_search FT-Keys to search in, from the most important to the less important, must be separate by a space -, -10_signal, -35_signal, 3'clip, 3'utr, 5'clip, 5'utr, attenuator, c_region, caat_signal, cds, conflict, d-loop, d_segment, enhancer, exon, gap, gc_signal, gene, idna, intron, j_segment, ltr, mat_peptide, misc_binding, misc_difference, misc_feature, misc_recomb, misc_rna, misc_signal, misc_structure, modified_base, mrna, n_region, old_sequence, operon, orit, polya_signal, polya_site, precursor_rna, prim_transcript, primer_bind, promoter, protein_bind, rbs, rep_origin, repeat_region, repeat_unit, rrna, s_region, satellite, scrna, sig_peptide, snorna, snrna, source, stem_loop, sts, tata_signal, terminator, transit_peptide, trna, unsure, v_region, v_segment, variation (by default : source)
ipadress of the mysql serverby default : 127.0.0.1
keywordskeywords or regular expression
loginlogin to connect to mysqlby default : root
mysql let this value to 1
namespaceit is used to retrieve data in bioSQL database
outputname of file for generated databy default : if no name specified, results will be display in console
output_formatoutput format for texte file and web pageembl_format, user_format (by default : embl_format)
partiseq let this value to 1
passwordpassword to connect to mysqlempty by default
project_dirlocalisation of EmblEx projectsby default : ../../projects/"basename_value"
resultstodbsave result into bioSQL database0 or 1 (by default : 0)
tablenamedirectory for generated databy default : tmp
wordstype of keywords to searchnokeywords, words, regexp (by default : nokeywords)
ACaccession number0 or 1 (by default : 1)
classificationclassification0 or 1 (by default : 1)
idid number0 or 1 (by default : 1)
releaseversion0 or 1 (by default : 1)
seqsub sequence0 or 1 (by default : 1)
seq_lengthsub sequence length0 or 1 (by default : 1)
sequencecomplete sequence0 or 1 (by default : 1)
sequence_lengthcomplete sequence length0 or 1 (by default : 1)
speciespecies0 or 1 (by default : 1)