Problems with Gene Expression Data and Solutions under the Bayesian Framework

Abstract: Microarray technologies allow one to simultaneously measure the expression levels of thousands of genes in a biological sample. Microarrays have been widely used over the past few years and thereby advanced our biological knowledge tremendously at a genomic scale. However, producing good quality expression data and analyzing large data sets is itself a challenge as one has to account for the various sources of variability and errors that may occur during sample preparation, hybridization and scanning.

Analyzing expression data is usually treated as a multistep process where the output from one step goes as an input to the next. The steps involved in analyzing expression data are correcting intensities for background noise, normalizing both within and across arrays, assessing which genes are differentially expressed, and clustering of genes or conditions with similar expression profiles or patterns. A drawback of splitting up the analysis of gene expression data into steps that are dealt independently is that the error associated with each step is then ignored in the downstream analysis. The integration replaces the multistep analysis while accounting for the uncertainties in the data generation and analysis process.

We focus on the errors at the time of scanning and introduce an Bayesian integrated model for analyzing expression data that includes improving data quality, estimating array effects, and finally suggests how to choose a list of genes for further investigation.