- Open Access
Identifying transcription factors and microRNAs as key regulators of pathways using Bayesian inference on known pathway structures
© Roqueiro et al; licensee BioMed Central Ltd. 2012
- Published: 21 June 2012
Transcription factors and microRNAs act in concert to regulate gene expression in eukaryotes. Numerous computational methods based on sequence information are available for the prediction of target genes of transcription factors and microRNAs. Although these methods provide a static snapshot of how genes may be regulated, they are not effective for the identification of condition-specific regulators.
We propose a new method that combines: a) transcription factors and microRNAs that are predicted to target genes in pathways, with b) microarray expression profiles of microRNAs and mRNAs, in conjunction with c) the known structure of molecular pathways. These elements are integrated into a Bayesian network derived from each pathway that, through probability inference, allows for the prediction of the key regulators in the pathway. We demonstrate 1) the steps to discretize the expression data for the computation of conditional probabilities in a Bayesian network, 2) the procedure to construct a Bayesian network using the structure of a known pathway and the transcription factors and microRNAs predicted to target genes in that pathway, and 3) the inference results as potential regulators of three signaling pathways using microarray expression profiles of microRNA and mRNA in estrogen receptor positive and estrogen receptor negative tumors.
We displayed the ability of our framework to integrate multiple sets of microRNA and mRNA expression data, from two phenotypes, with curated molecular pathway structures by creating Bayesian networks. Moreover, by performing inference on the network using known evidence, e.g., status of differentially expressed genes, or by entering hypotheses to be tested, we obtain a list of potential regulators of the pathways. This, in turn, will help increase our understanding about the regulatory mechanisms relevant to the two phenotypes.
- Bayesian Network
- Random Forest
- KEGG Pathway
- Differential Expression Analysis
- Conditional Probability Table
Transcription factors (TFs) and microRNAs are well-known regulators of gene expression. The former bind directly to the regulatory regions of genes whereas the latter regulate the expression of genes at a post-transcriptional stage. Although they have different mechanisms of regulation, evidence suggests that TFs and microRNAs regulate target genes in a coordinated way . In order to facilitate the elucidation of these regulatory mechanisms, several databases have been released based on the analysis of sequence information for predicted regulatory interactions. Backes et al.  have compiled a dictionary on microRNAs and their putative pathways based on the enrichment of the predicted microRNAs targets for each pathway in KEGG  and TRANSPATH . Le Bechec et al.  have created a database (MIR@NT@N) that stores predicted interactions between: a) a TF and its target genes (including microRNAs) and b) microRNAs and their predicted target genes. These databases facilitate the retrieval of regulatory interactions based on a query list as input but the expression data of mRNA and microRNA are not effectively explored. The analysis tool mirConnX , recently published, allows the input of concurrent microRNA and mRNA profiling data for an integrative analysis. The targets of TFs and microRNAs are selected based on the association strength between the regulator and its target. In all the above mentioned work, the analysis of the interactions is focused solely on direct targets. In this work we propose a novel integrative method to analyze microRNA and mRNA expression data in conjunction with sequence-based predicted regulators and the structures of existing pathways. We combine all this information into Bayesian networks, which allow the prediction of pathway regulators, not only based on direct targets but also by inference of the most probable effect of the regulators on other downstream genes. (The preliminary results have been presented at the BIBM 2011 conference ).
Bayesian networks  have been extensively used for the reconstruction of gene networks based on microarray expression data. In this context the goal was the inference of interactions and statistical dependencies among genes. These dependencies were, in turn, used to learn the dynamic structure of a regulatory network . This methodology has been the foundation for numerous algorithmic approaches. In all these cases, the Bayesian network (BN) -or its more generic dynamic counterpart (DBN) - was used as a tool to reverse engineer the gene network, i.e., the interactions between genes were inferred from observational data. In this work, we do not focus on the task of learning the structure of the BN from expression data. Our goal is to use a known network structure, describing interactions between genes and proteins, for Bayesian inference. The network structure can be any experimentally confirmed interaction network (for example, pathways obtained from KEGG  or from the Pathway Interaction Database ). Due to the fact that only some TFs and no microRNAs are included in the above mentioned pathways, we extend the pathways to contain TFs and microRNAs that are predicted to target nodes in the pathway. We further compute conditional probabilities between the nodes in the extended network using expression data, with the ultimate goal of building a BN for each individual pathway. Finally, these BNs receive as evidence a list of differentially expressed genes and provide as output a ranked list of TFs and microRNAs that best explain the expression level of genes in the network. As a result of this, the output TFs and microRNAs are hypothesized to be putative regulators of the pathway.
A. Pre-processing of raw microarray data
Analyzed ER+/ER- expression datasets
Number of samples
(mRNA) GSE5847 
(mRNA) GSE7390 
(mRNA) GSE3494 
(mRNA) GSE2603 
(mRNA) GSE2990 
(mRNA) GSE2034 
(mRNA) GSE19783 
(microRNA) GSE19536 
(mRNA) GSE22219 
(microRNA) GSE22216 
Gene expression analysis was performed using packages in Bioconductor . The Robust Multichip Averaging algorithm (RMA)  with quantile normalization was used for normalization of the Affymetrix microarrays. Additionally, to minimize the noise level in the subsequent task of data discretization, Affymetrix detection calls were used, only for Affymetrix data, to identify probesets with low or no level of expression.
The raw microRNA data from Enerly were normalized with the RMA algorithm using the AgiMicroRna package . The mRNA data in the Enerly study as well as the mRNA and microRNA data in the Buffa study were already normalized.
The Chip Description Files (CDFs) provided by the manufacturers were used to map probes in the Illumina and Agilent microarrays. A custom CDF was used to map probesets to unique Entrez gene Ids  for the Affymetrix microarrays. See Additional file 1, Supplemental Methods, Section A for more details about the pre-processing of microarray data.
B. Discretization of expression data
A characteristic of BNs is that a node in the network must have distinct (and finite) discrete states. This required a discretization method to convert the expression data obtained from a microarray into discrete values to be fed to the BN. We decided to use 5 states to discretize the expression values of all genes, namely 1 = very low, 2 = medium low, 3 = medium, 4 = medium high and 5 = very high.
We implemented three discretization methods and compared them in order to determine the most appropriate one for our BNs. The first method was named Sigma-mu and was based on the mean (μ) and standard deviation (σ) of all the expression values of a microarray. The expression level of a gene/microRNA was compared against how many standard deviations away from the mean it was. The five discrete values were assigned as: very low and very high (≥2σ from μ); medium low and medium high (≥1σ from μ); and medium (< 1σ from μ).
The second method was based on the quantiles of the expression values in a microarray. The density function of the expression values in a microarray was used to obtain estimates of the intervals that accumulated 20%, 40%, 60%, 80% and 100% of the expression values. A discrete value from 1 through 5 was assigned to a gene/microRNA based on the interval on which it fell. Please refer to Additional file 1, Supplemental Methods, Section B for details on these two methods.
The third method we implemented is based on the clustering of expression values of the genes/microRNAs in a microarray. Partition Around Medoids (PAM) was used as our clustering algorithm. The genes/microRNAs whose expression values were clustered in the lowest cluster -cluster 1, corresponding to the lowest expression levels- were discretized as very low. Conversely, the genes clustered in the highest cluster (cluster 5) were discretized as very high. Genes in clusters 2 and 4 were discretized as medium low and medium high respectively. Finally, genes in the remaining cluster (3) were discretized as medium.
Independent of the discretization method used, probesets marked as "Absent" in Affymetrix microarrays were given a discrete value of very low. This process was repeated for all microarrays to yield a discrete value for each gene/microRNA in each microarray.
C. Differential expression analysis
Determining what genes were differentially expressed in each of the eight datasets had different purposes. Differentially expressed genes in the first six datasets were used to determine the most appropriate discretization method, whereas the differentially expressed genes in the Enerly and Buffa datasets were used as evidence in the Bayesian inference process.
The differential expression analysis was performed on all normalized microarrays. For Affymetrix, a probeset was discarded if it was not marked as "Present" or "Marginal" in more than 85% of the samples in the study, or if the coefficient of variability (CV) of the expression values of the probeset was less than 50% across samples in the study. The limma package  with the Benjamini-Hochberg correction for multiple tests was used for differentially expression analysis. The adjusted p-value threshold was set to 0.05.
For the Agilent mRNA chips, the normalized expression data were downloaded from GEO and only the probes with unique Entrez gene Ids were kept. For the Agilent microRNA data, the probes with a detection signal of less than 10% of the samples or not associated with H.sapiens were discarded.
The normalized expression data of Illumina mRNA chips were downloaded from GEO and those probes with unique Entrez gene Ids were retained. Probes with a CV less than 20% were filtered out. For the Illumina microRNA chips, only probes associated with H.sapiens were retained. The differential expression analyses were performed with limma as described above. See Additional file 1, Supplemental Methods, Section C for more details.
D. Structure pre-processing for KEGG pathways
The KEGG database  provides experimental knowledge in many forms, one of them being molecular networks called KEGG pathway maps. For our work, the pathway maps were analyzed as networks, with directed edges between the nodes representing a known interaction. The pathways analyzed were related to signaling (KEGG Ids 04010-04350) and cancer (05200- 05223).
The structure of a pathway including nodes and edges was used as the backbone of a BN. Before the BN could be constructed, a pre-processing step was implemented on the pathway. This pre-processing yielded a new network, based on the original pathway, with the following properties:
No cycles: The KEGG pathway was transformed into a directed acyclic graph (DAG). Edges that created a loop were discarded.
Nodes with expression data: The Entrez ID of each node in the pathway was checked against the list of genes that had expression data (10,722 Entrez IDs from our microarray analysis, see Additional file 1, Supplemental Methods, Section B). Nodes with no expression data were removed. The parents and children (if any) of a removed node were updated to include new edges linking them.
Limited types of interactions: Only the following interactions annotated in a KEGG pathway were taken into consideration: a) gene expression relations: expression, repression and indirect effect; and b) protein-protein interactions: activation, inhibition and indirect effect.
The package KEGGgraph  in Bioconductor was used for parsing the raw KEGG Markup Language files.
E. The predicted targets of transcription factors and microRNAs
TF target prediction. bindSDb  is a database we developed to store experimentally proven and predicted transcription factor binding sites. For the prediction portion, the database returns a set of TFs that are predicted to bind to the promoter region of a gene based on sequence analysis. It uses the MATCH  algorithm to determine if a TF may bind to the promoter of the gene. Each TF was represented by one or more position weight matrices from TRANSFAC (ver. 2010.1) . In our work, for each gene in a pathway, or protein encoded by a gene, we obtained from bindSDb all the TFs that are predicted to bind to the promoter region of the gene (defined as +2Kb, -2Kb from the transcription start site).
microRNA target prediction. All microRNA-gene predictions were downloaded from TargetScan Human release 6.0 . TargetScan is a microRNA target prediction algorithm that searches highly conserved 3'UTR targets for 8-mer and 7-mer sites matching the seed region of microRNAs. We downloaded target predictions for 677 microRNA families, as defined by TargetScan, and obtained a total of 54,479 unique pairs between microRNA family and target gene.
F. Selection of TFs and microRNAs with Random Forest
Number of target TFs and microRNAs
Number of TFs per node
Number of microRNAs per node
MAPK signaling pathway
mTOR signaling pathway
p53 signaling pathway
G. Pathway extension
At this stage, we have all the required information to create a BN for a pathway. The modified pathway obtained after pre-processing in section D was extended to accommodate the TFs and microRNAs ranked in section F. Our RF analysis output two variable importance rankings for each gene: one for the TFs and one for the microRNAs. These rankings list the TFs and microRNAs in decreasing order of the variable importance score assigned to each of them. An extended pathway was then created by connecting each node in the pre-processed pathway with the nodes representing the top 5 TFs and top 3 microRNAs, only if their variable importance score was greater than zero. Note that the same TF may target more than one gene in the pathway. Therefore, the node for the TF was added just once with multiple edges going from this node to different target genes. The same consideration applied to the microRNAs. This newly merged pathway was then fed to the BN process.
H. Bayesian network construction
A directed acyclic graph G = (V, E) where V is a set of variables and E is a set of directed edges between the variables.
Each variable in V has a finite set of mutually exclusive states.
For each variable B with parents A 1 , A 2 ,..., A p there is a set of parameter probabilities in the form of conditional probability tables (CPTs) that capture P(B | A 1 , A 2 ,..., A p ).
If the node did not have any parents, the CPT was basically a vector representing the prior of the node. It was computed by obtaining the frequencies of each discrete value across all the appropriate microarrays (TFs and genes used the first six datasets of Table 1, whereas microRNAs used either the Enerly or Buffa dataset).
If the node had parents A 1 , A 2 ,..., A p , the CPT reflected the probability of all possible combinations of states between the node and its parents. The probability of each possible combination was obtained by counting and then dividing by the total number of observations. A high-dimensional matrix C of 5-by-5-by...(p + 1)-times was used to compute the CPT. The matrix C was initialized with 1s to assume that each possible combination of states was possible. Then, for each microarray, the discrete expression values of the node and its parents were obtained as a vector v= [v A1 , v A2 , ..., v Ap , v node ]. The contents of matrix C at the cell C[v A1 , v A2 , ..., v Ap , v node ] were then incremented by one. At the end, each position of C was divided by the sum of all elements in C. The matter of what set of microarrays to use was resolved in the following way:
If any of the node's parents A 1 , A 2 ,..., A p was a microRNA, either the Enerly or Buffa dataset was used.
Otherwise, the first six datasets listed in Table 1 were used.
This distinction was absolutely necessary. In order to compute the CPT of a node that had at least one microRNA as parent, we needed to process microarrays that had both expression values for genes/TFs as well as microRNAs. Evidently, the CPTs of nodes with a microRNA targeting them were created from fewer observations than nodes whose parents were only TFs or other pathway nodes.
An important aspect of a BN is the evidence, i.e., the values assigned to the observed nodes. For evidence, the 907 differentially expressed genes between ER+ and ER- samples that overlapped as the top 2000-ranked genes in the mRNA-Enerly and mRNA-Buffa datasets were used (See Additional file 1, Table S6). Only those differentially expressed genes that were part of a pathway (not as TFs but as KEGG pathway nodes) were used as evidence.
Because the size of the CPT for each variable X i is exponential on the number of parents of X i , this computation is prohibitive for large networks. To complicate matters further, we would like an answer to the question: what is the probability of X i = x given the evidence e? This requires the marginalization of X i from equation (1).
Since exact inference is computationally infeasible, we have to find an approximation to the marginal probability P(X i | e). In our work, this was achieved by using a Gibbs sampler. The marginal probabilities for all unobserved nodes were sampled at a rate of Q × number of nodes in the BN, with Q = 250. See the Results section for details on how Q was computed. The BN creation, Gibbs sampler, inference engine and marginalization of nodes were implemented with the Bayes Net toolbox (BNT) for Matlab .
Comparison of the three discretization methods
Data discretization has a strong effect over the conditional probabilities assigned to each node in the BNs. Therefore, we conducted a comparison of the three discretization algorithms described in the Methods section to determine the one that was most appropriate to our study.
We compared the discrete values obtained from each method to identify the one that created the largest contrast between the two phenotypes in the data (ER+ vs. ER- in this case). To detect this contrast, we used as reference the genes which we had determined to be differentially expressed in each dataset (see Additional file 1, Table S5). In theory, if a gene is differentially expressed in a dataset, it means that the expression values of the gene in the ER+ samples are different from the expression values of the same gene in ER- samples.
gene k must be differentially expressed in dataset d
ER+ and ER- are the samples for each phenotype
the distance measure d() is simply the absolute value of x - y, where x and y are discrete values between 1 and 5.
Approximation of marginal probability for Bayesian inference
Inference results on the breast cancer data
We have systematically constructed BNs for all the 34 KEGG pathways based on the procedures described in Methods. The number of nodes and edges in the original pathways and the number of nodes and edges in the expanded Bayesian networks are provided in the Additional file 2.
We present our inference results in an attempt at uncovering the relationships among TFs, microRNAs and pathway genes that are associated with ER+ and ER- breast tumors. ER+ and ER- tumors display different molecular patterns in terms of cell differentiation, proliferation, survival, invasion and angiogenesis. Understanding the distinct molecular mechanisms in tumors with different ER status will provide insight into potential novel targets for breast cancer treatment .
Each BN of a pathway was given two different sets of evidence corresponding to two scenarios. In scenario #1, the evidence was the discrete values in ER+ samples of the differentially expressed genes. After providing the BN with the evidence we ran the inference process and approximated the marginals for all unobserved nodes. In scenario #2, the same inference process was performed and the marginals were approximated. In this case, the evidence used was the discrete values in ER- samples of differentially expressed genes.
When analyzing the results, we decided to focus on nodes that fulfilled any of the following two conditions:
the node's marginals had one state with a probability larger than 0.8 in scenario #1 and lower than 0.8 in scenario #2 (or vice versa).
at least one of the node's marginals for one state had a 2-fold variation in probability between scenario #1 and scenario #2, with the resulting probability being larger than 0.5.
There is no particular reason why we chose these threshold values. They are in fact very stringent and served the purpose of providing a reduced set of results that were easy to manually validate against the true KEGG pathway structure.
Cell cycle pathway
Selected probability marginals for the cell cycle pathway (6 datasets + Buffa)
When inspecting the TFs that from sequence analysis and RF we have predicted to target CCND1 directly (NFIB, STAT6, SREBF1) we realize that their marginals are very similar in both scenarios. Because we know that the expression of CCND1 changed between scenarios #2 and #1, we are looking for a TF or microRNA that may also have changed between those scenarios and that may help explain the change in expression for CCND1. Neither of the TFs or microRNAs (not shown in Figure 7) that target CCND1 have a significant change in their marginals between scenarios. The TF TFE3 (transcription factor binding to IGHM enhancer 3) may provide a better explanation of why CCND1 is differentially expressed, even if TFE3 does not target CCND1 directly. In Figure 7 TFE3 is in light blue and targets SMAD3. Between scenarios #2 and #1 we can see (Table 3) that there is more certainty in scenario #1 that SMAD3 is at a lower state (a combined very low and medium low of 0.29+0.39 = 0.68). This implies a lower level of expression in that scenario (vs. 0.21+0.31 = 0.52 in scenario #2). The marginals have a moderate change from higher expression states in scenario #2 to lower states in #1. This transition is much sharper for Enerly (See Additional file 1, Table S8). In the Cell cycle pathway, SMAD3 promotes the expression of CDKN2B, which in turn regulates the expression of CCND1 and CDK4 by inhibiting them. Our BN simply keeps directed edges between nodes (as in Figure 7) but is not aware of the semantics of each edge (inhibition, expression, and so forth). Nevertheless our results adjust very well to the semantics of the pathway. When SMAD3 switches to a lower state (from scenario #2 to #1), CDNK2B has also a sharp increase of certainty of being in a very low expression state (from 0.48 in scenario #2 to 0.99 in scenario #1). Therefore, with a high chance of having low expression of CDKN2B, we also have a high chance of not inhibiting neither CCND1 nor CDK4 and this results in an increase in their expression level (for CCND1, from very low in scenario #2 to medium in #1; and for CDK4 it goes from a somewhat uncertain state of expression in scenario #2 to a 0.82 certainty of having medium high expression level in scenario #1).
Upon reviewing the TFs that are predicted to target SMAD3, we see that TFE3 is the only one with a marked contrast between scenarios. In scenario #2 there is 0.65 probability that its expression is medium high but this probability decreases to a 0.31 (more than 2-fold decrease) in scenario #1. This sharp decrease occurs because in scenario #1 there is more certainty of TFE3 being in a medium state of expression (0.69 vs. 0.04 in scenario #2). We therefore hypothesize that the transcription factor TFE3 is a key regulator in the Cell cycle pathway when comparing ER- and ER+ samples. We are not implying by any means that TFE3 affects directly the expression of SMAD3 but there is a clear relationship between their changes in expression levels and this allows us to postulate TFE3 as a regulator in the pathway.
In fact, TFE3 is a well-known transcription factor  and there is ample evidence of its synergizing effects with SMAD3 to enhance Transformer Growth Factor β (TGF-β) dependent transcription [35, 36].
p53 signaling pathway
The analysis of the p53 signaling pathway (KEGG Id 04115) provides an example of how to identify a regulator based on direct interactions between the regulator and genes in the pathway. For this pathway, our differential expression analysis reported 8 differentially expressed genes. Very few TFs passed our selection criteria and only one of them overlapped between the Enerly and Buffa datasets. This is the case of STAT5B known as signal transducer and activator of transcription 5B. STAT5B was predicted to target only 2 genes in this pathway: IGFBP3 (insulin-like growth factor binding protein 3) and PERP (p53 apoptosis effector related to PMP-22). These two genes are located in different parts of the pathway and are not directly related to each other. The marginals corresponding to IGFBP3 do not seem to have much of a variation between our two scenarios (Additional file 1, Tables S9 and S10). In contrast, PERP is differentially expressed (under-expressed) in ER+ samples. We can see that the TF STAT5B shifts its certainty of being in a state of medium low expression in scenario #2 to a more uncertain state in scenario #1. In fact, in scenario #1 we see an increment in the marginals corresponding to the lowest level of expressions (very low and medium low) which can be interpreted as a possible decrease of expression of STAT5B from scenario #2 to scenario #1.
STAT5 is one of the seven members of the STAT (signal transducers and activators of transcription) family of TFs and mediates the responses of cytokines, growth factors and hormones . It has been shown that STAT5 regulates apoptosis in a wide range of tumor cells . STAT5A and STAT5B are different proteins encoded by different genes.
PERP, a p53 transcriptional target, is induced specifically during apoptosis but not cell cycle arrest. Downregulation of PERP is associated with the aggressive, monosomy 3-type of uveal melanoma (UM) . It has not been proven that PERP is a direct target of STAT5B . But through our Bayesian inference process we were able to determine that STAT5B (by interacting with PERP) might be a key regulator in the p53 signaling pathway. This result was validated by two BNs constructed with different datasets (Enerly and Buffa)
ErbB signaling pathway
The previous results reported only TFs as potential regulators of their pathways. In the BNs of the ErbB signaling pathway (KEGG Id 04012), we identified a microRNA as a putative regulator. In this pathway there were 6 differentially expressed genes. One of them, PLCG2 (phospholipase C, gamma 2 phosphatidylinositol-specific) is under-expressed in ER+ (Additional file 1, Tables S11 and S12). Upon examining the TFs and microRNAs targeting that gene, only the microRNA hsa-miR-135b passed our selection criteria. In scenario #1 of Enerly, hsa-miR-135b reaches a certainty of being at a medium expression level (0.81) whereas in scenario #2 there is uncertainty about its level of expression, with higher probabilities in the very low and medium low states (0.23 and 0.36 respectively). This transition between scenarios #2 and #1 may be seen as an increase in the expression level of the microRNA. Because microRNAs have been shown to negatively regulate the expression of their target genes, if we couple the possible increase in expression of hsa-miR-135b between scenarios #2 and #1 with the fact that the expression of PLCG2 decreases between scenarios #2 and #1, we can propose with higher confidence that hsa-miR-135b is a potential regulator of the ErbB pathway by possibly affecting the expression of PLCG2. In this example, we also had validation of this fact between the Enerly and Buffa datasets.
We proposed an integrative bioinformatics methodology that combines a) the TFs and microRNAs that are predicted to target pathway genes, with b) microarray expression profiles of mRNA and microRNA, in conjunction with c) the known structure of molecular pathways. All these elements were integrated into a probabilistic framework (BN) that was used to make inferences about key TFs and microRNAs as regulators of the pathway. Using the procedures described in our work, one can systematically construct a BN for each individual pathway of interest. We have utilized 8 microarray expression datasets of mRNA and microRNA on ER+ and ER- breast tumors to demonstrate how to use the differentially expressed genes as evidence in order to infer key regulators in the constructed BNs. Another important use of our framework is to propose hypotheses about the expression levels of TFs or microRNAs and their effect on genes. We foresee the researcher posing questions of the form: "What would the expression level of genes g 1 and g 2 be if microRNA 3 is expressed at a very high level?"
Several technical issues deserve further investigation. When making inference about the expression level of a gene, TF or microRNA, we would ideally want to obtain the most probable explanation (MPE) given the evidence at hand. This evidence can be tangible, i.e., obtained from a microarray experiment, or, as it was mentioned before, it can be a set of hypotheses that interest us. In either case, an exact solution to the MPE problem in Bayesian inference has proven to be elusive due to the fact that approximating the MPE or finding the k-th MPE are both NP-hard problems . Thus, in this work we have decided to use the marginals as a proxy for MPE. In turn, we approximated the marginals for the unobserved nodes using a stochastic sampling algorithm (Gibbs sampler). We plan to improve our methodology by thoroughly examining different importance sampling algorithms that will minimize the variance between the drawn samples and the target distribution .
Finally, a self-imposed limitation of our model was the removal of edges that would create cycles in the network. Our next step will be to improve our probabilistic framework to use a dynamic Bayesian network (DBN) that allows for cycles and that better reflects the positive feedback present in many molecular pathways.
This paper presents a novel approach to the integrative analysis of microRNA and mRNA expression profiles with transcription factors and microRNAs within the context of molecular pathways. We developed a probabilistic framework (Bayesian network) which enables the inference of potential pathway regulators (transcription factors and microRNAs) that are likely causal regulators of the differentially expressed genes in a pathway. Our method may be useful to identify target genes of transcription factors and microRNAs.
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
- Martinez N, Walhout A: The interplay between transcription factors and microRNAs in genome-scale regulatory networks. BioEssays 2009, 31: 435–445. 10.1002/bies.200800212PubMed CentralPubMedView ArticleGoogle Scholar
- Backes C, Meese E, Lenhof H-P, Keller A: A dictionary on microRNAs and their putative target pathways. Nucleic Acids Research 2010,38(13):4476–4486. 10.1093/nar/gkq167PubMed CentralPubMedView ArticleGoogle Scholar
- Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 38: 355–360. 10.1093/nar/gkp896View ArticleGoogle Scholar
- Krull M, Voss N, Choi C, Pistor S, Potapov A, Wingender E: TRANSPATH®: an integrated database on signal transduction and a tool for array analysis. Nucleic Acids Research 2003,31(1):97–100. 10.1093/nar/gkg089PubMed CentralPubMedView ArticleGoogle Scholar
- Le Bechec A, Portales-Casamar E, Vetter G, Moes M, Zindy P-J, Saumet A, Arenillas D, Theillet C, Wasserman W, Lecellier C-H, et al.: MIR@NT@N: a framework integrating transcription factors, microRNAs and their targets to identify sub-network motifs in a meta-regulation network model. BMC Bioinformatics 2011,12(1):67. 10.1186/1471-2105-12-67PubMed CentralPubMedView ArticleGoogle Scholar
- Huang GT, Athanassiou C, Benos PV: mirConnX: condition-specific mRNA-microRNA network integrator. Nucleic Acids Research 2011, 39: W416-W423. 10.1093/nar/gkr276PubMed CentralPubMedView ArticleGoogle Scholar
- Roqueiro D, Huang L, Dai Y: Identifying Transcription Factors and microRNAs as Key Regulators of Pathways Using Bayesian Inference on Known Pathway Structures. BIBM 2011 2011, 228–233.Google Scholar
- Pearl J: Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning. Proceedings of the 7th Conference of the Cognitive Science Society 1985, 329–334.Google Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol 2000, 7: 601–620. 10.1089/106652700750050961PubMedView ArticleGoogle Scholar
- Schaefer CF, Anthony K, Krupa S, Buchoff J: PID: the Pathway Interaction Database. Nucleic Acids Res 2009, 37: 674–679. 10.1093/nar/gkn653View ArticleGoogle Scholar
- Boersma BJ, Reimers M, Yi M, Ludwig JA, Luke BT, Stephens RM, Yfantis HG, Lee DH, Weinstein JN, Ambs S: A stromal gene signature associated with inflammatory breast cancer. Int J Cancer 2008,122(6):1324–1332.PubMedView ArticleGoogle Scholar
- Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C: Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the Molecular Subtypes. Clinical Cancer Research 2008,14(16):5158–5165. 10.1158/1078-0432.CCR-07-4756PubMedView ArticleGoogle Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, et al.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proceedings of the National Academy of Sciences of the United States of America 2005,102(38):13550–13555. 10.1073/pnas.0506230102PubMed CentralPubMedView ArticleGoogle Scholar
- Minn AJ, Gupta GP, Siegel PM, Bos PD, Shu W, Giri DD, Viale A, Olshen AB, Gerald WL, Massague J: Genes that mediate breast cancer metastasis to lung. Nature 2005,436(7050):518–524. 10.1038/nature03799PubMed CentralPubMedView ArticleGoogle Scholar
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, et al.: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Journal of the National Cancer Institute 2006,98(4):262–272. 10.1093/jnci/djj052PubMedView ArticleGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, et al.: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005,365(9460):671–679.PubMedView ArticleGoogle Scholar
- Enerly E, Steinfeld I, Kleivi K, Leivonen S-K, Aure MR, Russnes HG, Rønneberg JA, Johnsen H, Navon R, Rødland E, et al.: miRNA-mRNA integrated analysis reveals roles for miRNAs in primary breast tumors. PLoS ONE 2011,6(2):e16915. 10.1371/journal.pone.0016915PubMed CentralPubMedView ArticleGoogle Scholar
- Buffa FM, Camps C, Winchester L, Snell CE, Gee HE, Sheldon H, Taylor M, Harris AL, Ragoussis J: microRNA-Associated Progression Pathways and Potential Therapeutic Targets Identified by Integrated mRNA and microRNA Expression Profiling in Breast Cancer. Cancer Research 2011,71(17):5635–5645. 10.1158/0008-5472.CAN-11-0489PubMedView ArticleGoogle Scholar
- Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2004,5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralPubMedView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 2003,31(4):e15-e15. 10.1093/nar/gng015PubMed CentralPubMedView ArticleGoogle Scholar
- Lopez-Romero P: Pre-processing and differential expression analysis of Agilent microRNA arrays using the AgiMicroRna Bioconductor library. BMC Genomics 2011,12(1):64. 10.1186/1471-2164-12-64PubMed CentralPubMedView ArticleGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al.: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Research 2005,33(20):e175-e175. 10.1093/nar/gni179PubMed CentralPubMedView ArticleGoogle Scholar
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article3.PubMedGoogle Scholar
- Zhang JD, Wiemann S: KEGGgraph: a graph approach to KEGG PATHWAY in R and bioconductor. Bioinformatics 2009,25(11):1470–1471. 10.1093/bioinformatics/btp167PubMed CentralPubMedView ArticleGoogle Scholar
- Roqueiro D, Frasor J, Yang D: bindSDb: A binding-information spatial database. Proceeding of the IEEE International Conference on Bioinformatics and Biomedicine Workshops 2010: 573–578.Google Scholar
- Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003,31(13):3576–3579. 10.1093/nar/gkg585PubMed CentralPubMedView ArticleGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34: D108–110. 10.1093/nar/gkj143PubMed CentralPubMedView ArticleGoogle Scholar
- Friedman RC, Farh KK-H, Burge CB, Bartel DP: Most mammalian mRNAs are conserved targets of microRNAs. Genome Research 2009, 19: 92–105.PubMed CentralPubMedView ArticleGoogle Scholar
- Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Jensen FV, Nielsen TD: Bayesian Networks and Decision Graphs, Second Edition. Information Science and Statistics Springer Science 2007.Google Scholar
- Kwisthout J: Most Probable Explanations in Bayesian networks: complexity and tractability. In Technical Report, ICIS-R10001. Radboud University Nijmegen; 2010.Google Scholar
- Murphy K: The Bayes Net Toolbox for Matlab. Computing Science and Statistics Proceeding of the Interface 2001.Google Scholar
- Osborne CK, Schiff R: Mechanisms of Endocrine Resistance in Breast Cancer. Annual Review of Medicine 2011,62(1):233–247. 10.1146/annurev-med-070909-182917PubMed CentralPubMedView ArticleGoogle Scholar
- Beckmann H, Su LK, Kadesch T: TFE3: a helix-loop-helix protein that activates transcription through the immunoglobulin enhancer muE3 motif. Genes & Development 1990,4(2):167–179. 10.1101/gad.4.2.167View ArticleGoogle Scholar
- Hua X, Miller ZA, Wu G, Shi Y, Lodish HF: Specificity in transforming growth factor β-induced transcription of the plasminogen activator inhibitor-1 gene: Interactions of promoter DNA, transcription factor μE3, and Smad proteins. Proc Natl Acad Sci U S A 1999,96(23):13130–13135. 10.1073/pnas.96.23.13130PubMed CentralPubMedView ArticleGoogle Scholar
- Hua X, Miller ZA, Benchabane H, Wrana JL, Lodish HF: Synergism between Transcription Factors TFE3 and Smad3 in Transforming Growth Factor-β-induced Transcription of theSmad7 Gene. Journal of Biological Chemistry 2000,275(43):33205–33208. 10.1074/jbc.C000568200PubMedView ArticleGoogle Scholar
- Basham B, Sathe M, Grein J, McClanahan T, D'Andrea A, Lees E, Rascle A: In vivo identification of novel STAT5 target genes. Nucleic Acids Research 2008,36(11):3802–3818. 10.1093/nar/gkn271PubMed CentralPubMedView ArticleGoogle Scholar
- Longley DB, Johnston PG: Apoptosis, Cell Signaling, and Human Diseases. 5-Fluorouracil. Edited by: Srivastava R. Humana Press; 2007:263–278.View ArticleGoogle Scholar
- Davies L, Spiller D, White MRH, Grierson I, Paroaon L: PERP expression stabilizes active p53 via modulation of p53-MDM2 interaction in uveal melanoma cells. Cell Death & Disease 2011,2(3):e136. 10.1038/cddis.2011.19View ArticleGoogle Scholar
- Cheng J, Druzdzel MJ: AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks. Journal of Artificial Intelligence Research 2000, 13: 155–188.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.