Bayesian non-negative factor analysis for reconstructing transcription factor mediated regulatory networks
© Meng et al; licensee BioMed Central Ltd. 2011
Published: 14 October 2011
Transcriptional regulation by transcription factor (TF) controls the time and abundance of mRNA transcription. Due to the limitation of current proteomics technologies, large scale measurements of protein level activities of TFs is usually infeasible, making computational reconstruction of transcriptional regulatory network a difficult task.
We proposed here a novel Bayesian non-negative factor model for TF mediated regulatory networks. Particularly, the non-negative TF activities and sample clustering effect are modeled as the factors from a Dirichlet process mixture of rectified Gaussian distributions, and the sparse regulatory coefficients are modeled as the loadings from a sparse distribution that constrains its sparsity using knowledge from database; meantime, a Gibbs sampling solution was developed to infer the underlying network structure and the unknown TF activities simultaneously. The developed approach has been applied to simulated system and breast cancer gene expression data. Result shows that, the proposed method was able to systematically uncover TF mediated transcriptional regulatory network structure, the regulatory coefficients, the TF protein level activities and the sample clustering effect. The regulation target prediction result is highly coordinated with the prior knowledge, and sample clustering result shows superior performance over previous molecular based clustering method.
The results demonstrated the validity and effectiveness of the proposed approach in reconstructing transcriptional networks mediated by TFs through simulated systems and real data.
Transcription factor is one major gene regulator that governs the response of cells to changing endogenous or exogenous conditions . Understanding how transcriptional regulatory networks (TRNs) induce cellular states and eventually define the phenotypes represents a major challenge facing systems biologists. So far, numerous models have been proposed to infer the transcriptional regulations by TFs including, ordinary differential equations, (probabilistic) Boolean networks, Bayesian networks, and information theory and association models, etc . Ideally, the TF protein level activities are needed for exact modeling; however, due to low protein coverage and poor quantification accuracy of high throughput proteomics technologies such as protein array and liquid chromatography-mass spectrometry (LC-MS), the measurements of TF protein activities are currently hardly available. As a compromise, most of the aforementioned models conveniently yet inappropriately assume the TF’s mRNA expression as its protein activity. Given the fact that gene mRNA expression and its protein abundance are poorly correlated [3, 4], these models cannot accurately model the transcriptional cis-regulation or reveal at the best TF trans-regulation.
In contrast, works based on factor models [5–10] point to a natural and promising direction for modeling the TF mediated regulations, where the microarray gene expression is modeled as a linear combination of unknown TF activities, and the loading matrix in this model indicates the strength and the type (up- or down- regulation) of regulation. However, due to distinct features of TF regulation, conventional FA model is not readily applicable. First, due to various reasons (normal and disease, cancer grade, subtypes, etc), the samples are usually not independent with each other but show some clustering effect; while in the existing FA models, factors are typically assumed independent, which, although true in many applications, is not a realistic assumption for TF medicated regulation. Secondly, since a TF only regulates a small subset of genes, the loading matrix should be sparse. While with constructions of TF regulation databases, such as TRANSFAC , the knowledge of TF regulated genes becomes increasingly available, and should be included in the model so as to boost signal-to-noise and improve performance . The inclusion of prior information and sparsity constraint naturally call for a Bayesian solution. As an added advantage, having this prior knowledge actually resolves the factor order ambiguity of the conventional factor analysis. Thirdly, as suggested in [13–15], the non-negative assumption on TF activities be imposed.
In a response to these requirements of modeling TF mediated regulatory networks, we propose here a novel Bayesian non-negative factor model (BNFM). Different from conventional factor analysis models, BNFM consists of a sparse loading matrix and a set of correlated non-negative factors. The sparsity of the loading matrix is constrained by a sparse prior  that directly reflects our existing knowledge of TF regulation. That is if a gene is known to be regulated by a TF, then the prior probability that this regulation exists is high, and otherwise, very low due to the generic sparse nature of TF regulation (A TF only regulates a small number of genes in the whole genome). Because of clustering effect on the data samples, the factors in this BNFM model are considered to be correlated and modeled by a Dirichlet process mixture (DPM) prior . DPM imposes a natural non-parametric clustering effect  among samples of the same TF and can automatically determine the optimal number of clusters. Moreover, since the activities of TFs are non-negative, they are assumed to follow a (non-negative) rectified Gaussian distribution . Due to the complex nonlinear structure of the BNFM, the estimation of the model becomes analytically infeasible and highly complicated numerically. A Gibbs sampling solution is developed to infer all the relevant unknown variables.
Bayesian non-negative factor model
where, denotes the rectified Gaussian distribution . Since and γ n are still defined in (4) by the DP, x n is hence modeled by the DPM of the rectified Gaussian distributions and the elements of x n are accordingly correlated. In contrast to the conventional mixture model, the DPM model enables the number of clusters to be learnt adaptively from the data instead of being predefined.
In most cases, π g , l are likely to be smaller than 10%. In practice, databases such as TRANSFAC  and DBD  provide information of experimentally validated or predicted target genes of TFs, and this knowledge can be incorporated in the model by setting, for instance, π g , l = 0. 9, if TF l is known to regulate gene g; or otherwise π g , l = 0. 025. The variable defines how much the target genes are loaded on the corresponding TF and with prior distribution .
c- a vector of constant, which can be considered as the constant term retained when linearizing the general relationship y n = f(x n ) as y n = Ax n + c. It may also be interpreted as static response of gene transcriptional expressions.
Equivalent model for centralized observations
The goal is to obtain the posterior distributions and hence the estimates of A, x n ∀n, γ n ∀n, given y n ∀n and π g , l ∀g, l, which is the TF binding prior information extracted from existing database. For convenience, we let Θ denote all the known and unknown variables.
Gibbs sampling solution
The proposed BNFM model is high dimensional and analytically intractable, so a Gibbs sampling solution is proposed. Gibbs sampling devises a Markov Chain Monte Carlo scheme to generate random samples of the unknowns from the desired but intractable posterior distributions and then approximate the (marginal) posterior distributions with these samples. The key of Gibbs sampling is to derive the conditional posterior distributions and then draw samples from them iteratively until convergence is reached. The proposed Gibbs sampler can be summarized as follows:
Gibbs sampling for BNFA
for n = 1 to N
Note that are marginalized and therefore does not need to be sampled. The algorithm iterates until the convergence of samples, which can be assessed by the scheme described in , [chap. 11.6]. The samples after convergence will be collected to approximate the marginal posterior distributions and the estimates of the unknowns. Since µ x can be approximated and calculated numerically, the factor x n can be recovered from the centralized factor with (9). The required conditional distributions of the above proposed Gibbs sampling solution are detailed in the next.
Conditional distributions of the proposed Gibbs sampling solution
For simplicity, we let x n and y n denote the centralized factors and data in this section.
where, and , and π g , l is the prior knowledge of the probability of a g , l to be non-zero. When π g , l = 0. 5, i.e, a noninformative prior on sparsity is assumed, depends only on BF 01, and when BF 01 > 1. Since model selection based BF 01 favors a g , l = 0, it suggests that this Bayesian solution favors sparse model even when π g , l = 0. 5.
where, π s l,n = 1 – sgn (x l , n + µ x )
Test on simulated system
The proposed BNFM model was first tested on a simulated system, in which the microarray data consists of the expression profiles of 150 genes with 40 samples. The samples form 5 clusters and the 150 genes were assumed to be regulated by 10 TFs. The sparsity of loading matrix was set at 10%, which means that on average each gene is regulated by 1 TFs, and each TF regulates 15 genes. To simulate a practical imperfect database, the precision and recall of the prior knowledge were both set equal to 0.9 each, i,e., 90% of the database recorded regulations indeed happened in this specific data set (10% of the database recorded regulations may be context-specific and didn’t happen in the data); and 90% of the true regulations was recorded in the database (10% of true regulations are not in the database). This setting indicates that the recorded prior regulations may not exist in the experiment, and the unknown regulations could exist. Since this is a relatively large data set involving sampling of many variables, instead of examining convergence based on , [chap. 11.6], we adopted a more practical strategy by running a single MCMC chain for 10000 iterations with a burn-in period of 2000 iterations .
The F metrics satisfy all the 4 formal constraints defined in  including cluster homogeneity, cluster completeness, rag bag, and cluster size vs. quantity. We adopt the F metrics to evaluate the clustering result in the following tests. Similarly, a Van Rijsbergen’s F metric that combines the target prediction precision and recall is used to measure the target prediction result. Since our model can avoid sign ambiguity problem, the loading and factor estimations were evaluated using its Pearson’s correlation with their true values.
Test on breast cancer data
List of tested 15 TFs and aliases
BPc; C/EBP; C/EBP alpha; C/EBPalpha; CBP;
c-Ets-1; c-Ets-1 54; c-Ets-1A; Ets1; p54; p54c-Ets-1.
c-Fos; FBJ osteosarcoma oncogene; p55(c-fos).
c-Myc; MYC; v-myc myelocytomatosis viral oncogene homolog (avian).
ATF-47; CREB; CREB-341; CREB-A; CREB-isoform1; CREB1;
activating transcription factor 2; ATF2; CRE-BP1; CREB2; CREBP1;
AT225; early growth response protein 1;
AGP/EBP; ANF-2; C/EBP beta; C/EBP-beta; C/EBPbeta; CEBPB;
NFkappaB; Nuclear Factor kappa B.
ASp53; LFS1; NSp53; p53; p53as; RSp53; tp53; TRP53;
activating transcription factor 1; ATF1; EWS-ATF1; FUS/ATF-1;
acute-phase response factor; APRF;
signal transducer and activator of transcription 1.
activating enhancer binding protein 2 alpha; activator protein-2;
ATF-47; CREB; CREB1; cyclic AMP response element-binding protein;
List of tested 199 genes
Survival test of clustering results
Survival test of previous results
We proposed a new approach to uncover the transcriptional regulatory networks from microarray gene expression profiles. We discuss next a few distinct features of it.
First, to reflect the fact that a TF only regulates a small number of genes among the whole genome, the loading matrix of the factor model is constrained by a sparse prior , which directly reflects our existing knowledge of the particular TF-gene regulation, i.e., if the regulation exists according to prior knowledge, the probability of the corresponding component of the loading matrix to be non-zero is large; or otherwise, very small. The introduction of sparsity significantly constrains the factor model and helps to enable the inference of a set of correlated samples.
To model the samples correlation due to, for instance, cancer subtypes, the samples are modeled by a Dirichlet process mixture (DPM), which imposes clustering effect among samples and can automatically determine the optimal number of clusters from data rather than be predefined in the algorithm. Forth, other types of data, such as ChIP-chip data [39–41] and DNA methylation data  can be conveniently integrated with gene expression data  under the proposed framework by setting a slightly different prior probabilities to the loading matrix. Integrating additional data types can potentially improve the accuracy of the reconstructed networks. .
However, the proposed model is not without shortcomings. First, this model can not yet capture regulations from TFs that are not specified in the prior knowledge database. In reality, it is possible that some TFs that are not specified in the prior knowledge actually regulate the gene transcription. Second, the algorithm may not converge in a reasonable number of iterations on a large data set, thus cannot yet be applied to genome wide data set. Because the model parameters are high dimensional and highly correlated, the speed of convergence may significantly slow down on a large data set [44, 45]. However, such restriction on the size of genes and TFs forces us to focus the analysis on most relevant genes and TFs, making the analysis more targeted and easy to interpret. We demonstrate in section Result, how such analysis can be carried out starting from a whole genome microarray data. With the advancement in ChIP-seq technology and increasing knowledge of TFs biological functions, the proposed model could be applied for a genome-wide study in the future.
However, the precision or recall of the prior knowledge database are only arbitrarily specified (both 90%). In practice, the quality of prior knowledge should be evaluated first before more reasonable prior probabilities of regulations can be assigned.
A Bayesian factor model that has sparse loading matrix, non-negative factors, and correlated samples, was proposed to unveil the latent activities of transcription factors and their targeted genes from observed gene mRNA expression profiles. By naturally incorporating the prior knowledge of TF regulated genes, the sparsity constraint of the loading matrix, the non-negativity constraints of TF activities, the regulation coefficients and TF activities can be estimated. A Gibbs sampling solution was proposed and model inference. The effectiveness and validity of the model and the proposed Gibbs sampler were evaluated on simulated systems. The proposed method was applied to the breast cancer microarray data and a TF regulated network for breast cancer data was reconstructed. The inferred TF activities indicates 3 patients clusters, two of which possess significant differences in survival time after treatment. These results demonstrated that the BNFM provides a viable approach to reconstruct TF mediated regulatory networks and estimate TF activities from mRNA expression profiles. The BNFM will be an important tool for not only understanding the transcriptional regulation but also predicting the clinical outcomes of treatment.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
- Hobert O: Gene regulation by transcription factors and microRNAs. Science 2008,319(5871):1785. 10.1126/science.1151651PubMedView ArticleGoogle Scholar
- Huang Y, Tienda-Luna I, Wang Y: Reverse engineering gene regulatory networks. Signal Processing Magazine, IEEE 2009, 26: 76–97.View ArticleGoogle Scholar
- Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 2003,4(9):117. 10.1186/gb-2003-4-9-117PubMed CentralPubMedView ArticleGoogle Scholar
- Gygi S, Rochon Y, Franza B, Aebersold R: Correlation between protein and mRNA abundance in yeast. Molecular and Cellular Biology 1999,19(3):1720.PubMed CentralPubMedGoogle Scholar
- Sabatti C, James G: Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics 2006,22(6):739. 10.1093/bioinformatics/btk017PubMedView ArticleGoogle Scholar
- Sanguinetti G, Lawrence N, Rattray M: Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities. Bioinformatics 2006,22(22):2775. 10.1093/bioinformatics/btl473PubMedView ArticleGoogle Scholar
- Yu T, Li K: Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics 2005,21(21):4033. 10.1093/bioinformatics/bti656PubMedView ArticleGoogle Scholar
- Boulesteix A, Strimmer K: Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theoretical Biology and Medical Modelling 2005, 2: 23. 10.1186/1742-4682-2-23PubMed CentralPubMedView ArticleGoogle Scholar
- Kao K, Yang Y, Boscolo R, Sabatti C, Roychowdhury V, Liao J: Transcriptome-based determination of multiple transcription regulator activities in Escherichia coli by using network component analysis. Proceedings of the National Academy of Sciences 2004,101(2):641. 10.1073/pnas.0305287101View ArticleGoogle Scholar
- Meng J, Zhang JM, Qi YA, Chen Y, Huang Y: Uncovering Transcriptional Regulatory Networks by Sparse Bayesian Factor Model. Eurasip Journal On Advances In Signal Processing 2010.Google Scholar
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel A, Kel-Margoulis O, et al.: TRANSFAC (R): transcriptional regulation, from patterns to profiles. Nucleic acids research 2003, 31: 374. 10.1093/nar/gkg108PubMed CentralPubMedView ArticleGoogle Scholar
- Ideker Trey, JHL Dutkowski: Boosting Signal-to-Noise in Complex Biology: Prior Knowledge Is Power. Cell 2011,144(6):860–863. 10.1016/j.cell.2011.03.007PubMed CentralPubMedView ArticleGoogle Scholar
- Qi Q, Zhao Y, Li M, Simon R: Non-negative matrix factorization of gene expression profiles: a plug-in for BRB-ArrayTools. Bioinformatics 2009,25(4):545. 10.1093/bioinformatics/btp009PubMedView ArticleGoogle Scholar
- Hoyer P: Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research 2004, 5: 1469.Google Scholar
- Brunet J, Tamayo P, Golub T, Mesirov J: Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences of the United States of America 2004,101(12):4164. 10.1073/pnas.0308531101PubMed CentralPubMedView ArticleGoogle Scholar
- Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M: High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association 2008,103(484):1438–1456. 10.1198/016214508000000869PubMed CentralPubMedView ArticleGoogle Scholar
- Sudderth E: Graphical models for visual object recognition and tracking. PhD thesis. Massachusetts Institute of Technology; 2006.Google Scholar
- Ferguson T: A Bayesian analysis of some nonparametric problems. The annals of statistics 1973,1(2):209–230. 10.1214/aos/1176342360View ArticleGoogle Scholar
- Socci N, Lee D, Sebastian Seung H: The rectified Gaussian distribution. Advances in Neural Information Processing Systems 1998, 350–356.Google Scholar
- Cui X, Churchill G: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003,4(4):210. 10.1186/gb-2003-4-4-210PubMed CentralPubMedView ArticleGoogle Scholar
- Wong C: Differential Expression and Annotation. 2009.Google Scholar
- Wilson D, Charoensawan V, Kummerfeld S, Teichmann S: DBD-taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Research 2008,36(Database issue):D88.PubMed CentralPubMedGoogle Scholar
- Tipping M, Bishop C: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1999,61(3):611–622. 10.1111/1467-9868.00196View ArticleGoogle Scholar
- Gelman A, Carlin J, Stern H, Rubin D: Bayesian data analysis. London, Glasgow, et al 1995.Google Scholar
- Thompson W, Newberg L, Conlan S, McCue L, Lawrence C: The Gibbs centroid sampler. Nucleic Acids Research 2007.Google Scholar
- Van Rijsbergen C: Foundation of evaluation. Journal of Documentation 1974,30(4):365–373. 10.1108/eb026584View ArticleGoogle Scholar
- Bagga A, Baldwin B: Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics Morristown, NJ, USA; 1998:79–85.View ArticleGoogle Scholar
- Amigó E, Gonzalo J, Artiles J, Verdejo F: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 2009,12(4):461–486. 10.1007/s10791-008-9066-8View ArticleGoogle Scholar
- Hoadley K, Weigman V, Fan C, Sawyer L, He X, Troester M, Sartor C, Rieger-House T, Bernard P, Carey L, et al.: EGFR associated expression profiles vary with breast tumor subtype. BMC genomics 2007, 8: 258. 10.1186/1471-2164-8-258PubMed CentralPubMedView ArticleGoogle Scholar
- Mullins M, Perreard L, Quackenbush J, Gauthier N, Bayer S, Ellis M, Parker J, Perou C, Szabo A, Bernard P: Agreement in breast cancer classification between microarray and quantitative reverse transcription PCR from fresh-frozen and formalin-fixed, paraffin-embedded tissues. Clinical chemistry 2007,53(7):1273. 10.1373/clinchem.2006.083725PubMedView ArticleGoogle Scholar
- Herschkowitz J, Simin K, Weigman V, Mikaelian I, Usary J, Hu Z, Rasmussen K, Jones L, Assefnia S, Chandrasekharan S, et al.: Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors. Genome biology 2007,8(5):R76. 10.1186/gb-2007-8-5-r76PubMed CentralPubMedView ArticleGoogle Scholar
- Herschkowitz J, He X, Fan C, Perou C: The functional loss of the retinoblastoma tumour suppressor is a common event in basal-like and luminal B breast carcinomas. Breast Cancer Res 2008,10(5):R75. 10.1186/bcr2142PubMed CentralPubMedView ArticleGoogle Scholar
- Perou C, Sørlie T, Eisen M, van de Rijn M, Jeffrey S, Rees C, Pollack J, Ross D, Johnsen H, Akslen L, et al.: Molecular portraits of human breast tumours. Nature 2000,406(6797):747–752. 10.1038/35021093PubMedView ArticleGoogle Scholar
- Sørlie T, Perou C, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen M, Van De Rijn M, Jeffrey S, et al.: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America 2001,98(19):10869. 10.1073/pnas.191367098PubMed CentralPubMedView ArticleGoogle Scholar
- Sørlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al.: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America 2003,100(14):8418. 10.1073/pnas.0932692100PubMed CentralPubMedView ArticleGoogle Scholar
- Shai R, Shi T, Kremen T, Horvath S, Liau L, Cloughesy T, Mischel P, Nelson S: Gene expression profiling identifies molecular subtypes of gliomas. Oncogene 2003,22(31):4918–4923. 10.1038/sj.onc.1206753PubMedView ArticleGoogle Scholar
- Kim P, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Research 2003,13(7):1706. 10.1101/gr.903503PubMed CentralPubMedView ArticleGoogle Scholar
- Li T, Ding C: The relationships among various nonnegative matrix factorization methods for clustering. Data Mining, 2006.ICDM’06. Sixth International Conference on 2006, 362–371.Google Scholar
- Lieb J, Liu X, Botstein D, Brown P: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nature genetics 2001,28(4):327–334. 10.1038/ng569PubMedView ArticleGoogle Scholar
- Iyer V, Horak C, Scafe C, Botstein D, Snyder M, Brown P: Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001,409(6819):533–538. 10.1038/35054095PubMedView ArticleGoogle Scholar
- Ren B, Robert F, Wyrick J, Aparicio O, Jennings E, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al.: Genome-wide location and function of DNA binding proteins. Science’s STKE 2000,290(5500):2306.Google Scholar
- Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature genetics 2003, 33: 245–254. 10.1038/ng1089PubMedView ArticleGoogle Scholar
- Tasheva E, Klocke B, Conrad G: Analysis of transcriptional regulation of the small leucine rich proteoglycans. Mol Vis 2004, 10: 758–772.PubMedGoogle Scholar
- Justel A: Gibbs sampling will fail in outlier problems with strong masking. Journal of Computational and Graphical Statistics 1996,5(2):176–189. 10.2307/1390779Google Scholar
- Borgs C, Chayes J, Frieze A, Kim J, Tetali P, Vigoda E, Vu V: Torpid mixing of some Monte Carlo Markov chain algorithms in statistical physics. ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, Volume 40 1999, 218–229.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.