THEA: ontology-driven analysis of microarray data

Pasquier C, Girardot F, Jevardat de Fombelle K, Christen R.
Bioinformatics. 2004 Nov 1;20(16):2636-43

Supplementary materials

Data pre-processing

The CEL files were downloaded from the "Genome-wide Expression Patterns of Drosophila in Response to Immune Challenge Homepage" (http://www.fruitfly.org/expression/immunity/) and treated using Bioconductor's affy package (Bioconductor is an open source software that can be dowloaded from http://www.bioconductor.org/, the affy package is dedicaced to the analysis of affymetrix data); the releases used were bioconductor 1.2 and affy 1.2.30 (for a review of Bioconductor's specificities, see for example Dudoit et al., 2003. The method used to calculate the expression indexes was the 'rma' method, using the default options as described by Irizarry et al. (Irizarry et al., 2003a) which is arguably one the most pertinent way to process affymetrix arrays measurements so far (for comparisons see Irizarry et al., 2003b) and http://affycomp.biostat.jhsph.edu/).

Every possible treated vs control (i.e. infected vs non infected) ratios were computed. These results were subsequently submitted to a SAM multiclass analysis in order to select the genes showing statistically significant variation(s) of expression across experimental conditions (Tusher et al. 2001, software downloadable from http://www-stat.stanford.edu/~tibs/SAM/). The chosen parameters ensured that less than 1% false positives were selected and lead to the selection of 1623 probe-sets of which only those showing a mean fold-change of at least 1.3 in any comparison were retained for further analysis. This further reduced the data to 1290 probe-sets, corresponding to 1277 independent gene products. For each of these selected probe-sets and each experimental conditions the mean of the logged expression values across replicates were calculated, these means where subsequently used to calculate the different infected/uninfected logged expression ratios. This dataset was then uploaded on the GEPAS web site (http://gepas.bioinfo.cnio.es/, Herrero et al., 2003a), where the use of the preprocessor (Herrero et al., 2003b) allowed the merging of the replicates (using the median of the ratios) as well as the generation of various entry files for the different classification programs available on the same server. We chose to realize a SOTA analysis (Herrero et al., 2001) using the 'correlation coefficient (linear)' metrics and '90% variability' as end training condition. The generated tree was loaded in THEA and analyzed using the program's features.

Dudoit S, Gentleman RC, Quackenbush J. (2003) Open source software for the analysis of microarray data. Biotechniques. Suppl:45-51.

Herrero J, Valencia A, Dopazo J. (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics. 17(2):126-36.

Herrero J, Al-Shahrour F, Diaz-Uriarte R, Mateos A, Vaquerizas JM, Santoyo J, Dopazo J. (2003a) GEPAS: A web-based resource for microarray gene expression data analysis. Nucleic Acids Res. 31(13):3461-7.

Herrero J, Diaz-Uriarte R, Dopazo J. (2003b) Gene expression data preprocessing. Bioinformatics. 19(5):655-6. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. (2003a) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31(4):e15.

Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003b) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 4(2):249-64. Tusher VG, Tibshirani R, Chu G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 98(9):5116-21. Erratum Proc Natl Acad Sci U S A 2001 98(18):10515.

Available Data files

Files Descriptions
RMA data Expression levels calculated with the RMA method.
expression ratios infected/uninfected logged expression ratios of the replicates merged by GEPAS (using the median of the ratios).
SOTA tree Result of the SOTA analysis (Herrero et al., 2001) using the 'correlation coefficient (linear)' metrics and '90% variability' as end training condition.
Table_S1.rtf Statistics of the namings realized by THEA using the cutoffs described in the legend of the Figure 3 of the paper. (MF: Molecular Fonction, CC: Cellular Component, BF: Biological Fonction, DGA: Drosphila Gross Anatomy, DDS : Drosophila Developmental Stage).
Table_S2.rtf Comparison of THEA and other related programs.