Les Principales Bases de données de séquences

EBI
NCBI
DDBJ
PBIL
retour

Sur EBI

EMBL


The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications.

The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. The current database release (Release 78, March 2004), with according Release notes and user manual are available from the EBI servers. A sample database entry is shown here.
haut/top


UniProt


UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
haut/top


Swiss-Prot


The UniProt/Swiss-Prot Protein Knowledgebase is an annotated protein sequence database established in 1986.

The UniProt/Swiss-Prot Protein Knowledgebase is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. UniProt, a "one-stop shop" that allows easy access to all publicly available information of protein sequence annotation

It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).
haut/top


InterPro


InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.

Further information on InterPro can be found in the documentation which links to:


TrEMBL


UniProt/TrEMBL is a computer-annotated protein sequence database complementing the UniProt/Swiss-Prot Protein Knowledgebase.

UniProt/TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein sequences extracted from the literature or submitted to UniProt/Swiss-Prot.
The database is enriched with automated classification and annotation.
haut/top


Ensembl



Ensembl is a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Access to all the software and data is free and without constraints of any kind. The project is primarily funded by the Wellcome Trust. It is a comprehensive source of stable annotation with confirmed gene predictions that have been integrated from external data sources. Ensembl annotates known genes and predicts new ones, with functional annotation from InterPro, OMIM, SAGE and gene families.
haut/top


Genome Reviews



The EBI Genome Reviews are curated versions of entries in the EMBL/Genbank/DDBJ nucleotide sequence databases representing the complete sequences of chromosomes and plasmids. Each Genome Review represents an enhanced version of the original sequence, with additional annotation imported from other data sources such as the UniProt knowledgebase, the GOA (GO Annotation) project, InterPro etc. In addition, annotations used inconsistently among the original submissions have been standardised, and deleted in cases where the coverage is low.

Genome Reviews v1.0 was released on was May 10th 2004
haut/top


ASD


The Alternative Splicing Database (ASD) Project aims to understand the mechanism of alternative splicing on a genome-wide scale by creating a database of alternatively spliced exons from human, and other model species.

At the moment three databases are available: AltSplice, AltExtron and AEdb. AltExtron is a computer generated high quality data set of alternatively spliced human genes and their properties; AEdb is the manually curated (from literature) equivalent. It is the long-term plan to solicit web submission of data to AEdb from laboratory scientists. AltSplice implements a computational pipeline (for detailed detection & characterisation of splice variants) to production standards.

Other satellite databases generated by the members of the ASD consortium will be posted in due course. Currently, the computationally generated AltSplice database has been integrated with the manually curated database of Aedb. This integration adds the value of evidence to computationally predicted isoform splice events.
haut/top



GOA



GOA is a project run by the European Bioinformatics Institute that aims to provide assignments of gene products to the Gene Ontology (GO) resource.

The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms, even while knowledge of gene and protein roles in cells is still accumulating and changing. In the GOA project, this vocabulary will be applied to a non-redundant set of proteins described in the UniProt Resource (Swiss-Prot/TrEMBL/PIR-PSD) and Ensembl databases that collectively provide complete proteomes for Homo sapiens and other organisms.

In the first stage of this project, GO assignments have been applied to a data set representing the human proteome by a combination of electronic mappings and manual curation. Subsequently, GO assignments for all complete and incomplete proteomes that exist in UniProt have been provided. GOA will be updated monthly in accordance with the latest data released by the primary data sources.
haut/top



IntEnz


IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature. The Enzyme Nomenclature are recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) on the Nomenclature and Classification of Enzyme-Catalysed Reactions.
IntEnz is supported by NC-IUBMB and contains enzyme data curated and approved by this committee.

Further information on IntEnz can be found in the documentation which links to:
Classification and Nomenclature of Enzyme-Catalysed Reactions
Sample Entry
About IntEnz
haut/top



Pandit


PANDIT is a collection of multiple sequence alignments and phylogenetic trees covering many common protein domains. It contains:
the seed protein sequence alignments from the Pfam-A (curated families) database (version 12.0)
nucleotide sequence alignments derived from sequences available for the above and using the protein alignments as ‘templates’
protein sequence alignments restricted to the family members for which nucleotide sequences are available
inferred phylogenetic trees for each alignment
haut/top



sur NCBI

RefSeq


The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. RefSeqs are used as a reagent for the functional annotation of some genome sequencing projects, including those of human and mouse.
haut/top

Protein


The protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.
haut/top

GenBank


Identique à EMBL
haut/top

UniGene


UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
haut/top

Homologene


HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic geneomes.
haut/top

UniSTS


UniSTS is a NCBI resource that reports information about markers, or Sequence Tagged Sites (STS).

For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to LocusLink, dbSNP, RHdb, GDB, MGD, and the Entrez Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Marker data, e-PCR and mapping data are availble from the FTP site.

UniSTS integrates marker and mapping data from public resources including GenBank, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map).
haut/top

dbEST



dbEST (Nature Genetics 4:332-3;1993) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.
haut/top

TPA


A TPA sequence is derived or assembled from primary sequence data currently found in the DDBJ/EMBL/GenBank International Nucleotide Sequence Collaboration Databases. It can be genomic or mRNA sequence, and can be assembled or derived from primary genomic and/or mRNA sequences. These sequences are submitted to DDBJ/EMBL/GenBank as part of the process of publishing biological experiments that include the annotation of existing nucleotide sequences in the primary sequence database. Thus, a publicly accessible TPA record will be linked to a publication that documents that the data are supported by biological experimentation.

Examples of TPA sequences are:
mRNA assembled from overlapping EST sequences.
mRNA derived from an unannotated section of genomic sequence by comparison with another known mRNA from a different organism.
mRNA assembled from overlapping EST sequences, other partial mRNAs, and/or genomic sequences.
previously unannotated genomic sequence now described with the exons, introns, and coding region information (CDS) of a new gene.
haut/top

PopSet


A PopSet is a set of DNA sequences that have been collected to analyse the evolutionary relatedness of a population. The population could originate from different members of the same species, or from organisms from different species. They are submitted to GenBank via Sequin, often as a sequence alignment.
haut/top

GSS


The GSS division of GenBank is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA).  It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analyzing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence.
haut/top

SNP


SNP stands for "single nucleotide polymorphism".  SNPs are the most common genetic variations and occur once every 100 to 300 bases.  A key aspect of research in genetics is the association of sequence variation with heritable phenotypes.  It is expected that SNPs will accelerate the identification of disease genes by allowing researchers to look  for associations between a disease and specific differences (SNPs) in a population.  This differs from the more typical approach of pedigree analysis which tracks transmission of a disease through a family.  It is much easier to obtain DNA samples from a random set of individuals in a population than it is to obtain them from every member of a family over several generations.  Once discovered, these polymorphisms can be used by additional laboratories, using the sequence information around the polymorphism and the specific experimental conditions.
haut/top

dbSTS


dbSTS is an NCBI resource that contains sequence and mapping data on short genomic landmark sequences or Sequence Tagged Sites
haut/top

Genomes


The whole genomes of over 1000 viruses and over 100 microbes can be found in Entrez Genome. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and organelles.
haut/top

Gene


Gene provides a unified query environment for genes defined by sequence and/or in NCBI's Map Viewer. You can query on names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, and many other attributes associated with genes and the products they encode.

Because Gene is now an Entrez database, all the familiar and useful functions are now available, including Preview/Index, History, and LinkOut.
haut/top

LocusLink


LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.
NOTE : en remplacement par Gene
haut/top

COG


Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.
haut/top


MGC


The goal of the Mammalian Gene Collection (MGC), a trans-NIH initiative, is to provide full-length open reading frame (FL-ORF) clones for human, mouse, and rat genes. All MGC sequences are deposited in GenBank and the clones can be purchased from distributors of the IMAGE consortium
haut/top


PBIL

HOVERGEN

HOVERGEN is a database of homologous vertebrate genes, structured under ACNUC sequence database management system. It allows one to select sets of homologous genes among vertebrate species, and to visualize multiple alignments and phylogenetic trees. Thus HOVERGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generaly, HOVERGEN gives an overall view of what is known about a peculiar gene family.
haut/top

HOBACGEN

HOBACGEN is a database system that contains all the protein sequences of bacteria organized into families. It allows one to select sets of homologous genes from bacterial species and to visualize multiple alignments and phylogenetic trees. Thus HOBACGEN is particularly useful for comparative genomics, phylogeny and molecular evolution studies on bacteria.
haut/top

HOGENOM

HOGENOM is a database of homologous genes from fully sequenced organisms, structured under ACNUC sequence database management system. It allows one to select sets of homologous genes among species, and to visualize multiple alignments and phylogenetic trees. Thus HOGENOM is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generaly, HOGENOM gives an overall view of what is known about a peculiar gene family.
haut/top

NUREBASE

NUREBASE is a reference database on Nuclear Hormone Receptors.
haut/top

RTKdb

Welcome to the RTKdb, the dabasebase dedicated to Receptor Tyrosine Kinase. This work is shared by the 'Centre de Génétique Moléculaire et Cellulaire' (CGMC) and the laboratory of 'Biométrie et Biologie Evolutive' (BBE). This site is hosted by the 'Pôle Bio-Informatique Lyonnais' (PBIL)
haut/top

HCVDB

The Hepatitis C Virus DataBase
The Hepatitis C Virus DataBase (HCVDB) is a project of the "Réseau National Hépatites" (RNH).
The aim of HCVDB is to establish correlations between virus sequences and pathology.
haut/top

EMGLib

This page allows to access EMGLib, a database devoted to the completely sequenced bacterial genomes and the yeast genome. Starting from the sequences available in the "genome" division of GenBank we have improved and corrected their annotations and we have structured the flat files using the ACNUC database management system.
haut/top


DDBJ

GIB

GIB  is the comprehensive data repository of complete microbial genomes
haut/top

GTOP


GTOP is a database built by the Laboratory of Gene-product Informatics at the National Institute of Genetics consisting of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures.
haut/top

Human Genomics Studio


DDBJ/CIB Human Genomics Studio" project, started in Apr. 2000, developed an original method of assembling the data of human genome sequence and producing its contig, and created more exact chromosome sequence based on genome sequences data which have been registered to the international DNA databases, DDBJ/EMBL/GenBank, and publicized from them.
haut/top