Identifying transcription factors and microRNAs as key regulators of pathways using Bayesian inference on known pathway structures

Background Transcription factors and microRNAs act in concert to regulate gene expression in eukaryotes. Numerous computational methods based on sequence information are available for the prediction of target genes of transcription factors and microRNAs. Although these methods provide a static snapshot of how genes may be regulated, they are not effective for the identification of condition-specific regulators. Results We propose a new method that combines: a) transcription factors and microRNAs that are predicted to target genes in pathways, with b) microarray expression profiles of microRNAs and mRNAs, in conjunction with c) the known structure of molecular pathways. These elements are integrated into a Bayesian network derived from each pathway that, through probability inference, allows for the prediction of the key regulators in the pathway. We demonstrate 1) the steps to discretize the expression data for the computation of conditional probabilities in a Bayesian network, 2) the procedure to construct a Bayesian network using the structure of a known pathway and the transcription factors and microRNAs predicted to target genes in that pathway, and 3) the inference results as potential regulators of three signaling pathways using microarray expression profiles of microRNA and mRNA in estrogen receptor positive and estrogen receptor negative tumors. Conclusions We displayed the ability of our framework to integrate multiple sets of microRNA and mRNA expression data, from two phenotypes, with curated molecular pathway structures by creating Bayesian networks. Moreover, by performing inference on the network using known evidence, e.g., status of differentially expressed genes, or by entering hypotheses to be tested, we obtain a list of potential regulators of the pathways. This, in turn, will help increase our understanding about the regulatory mechanisms relevant to the two phenotypes.


Supplemental Methods
A. Pre-processing of raw microarray data The raw data were downloaded from the Gene Expression Omnibus 1 . Table S1 summarizes information on the microarray platforms used in this work. For the first six datasets a custom CDF was used to find a unique mapping between probesets and Entrez gene Ids. The Affymetrix microarrays used in these studies relied, at the time of their creation, on probes defined upon an earlier genome and transcriptome annotation. This annotation differs significantly from our current knowledge, therefore for our analysis we used a custom CDF that conforms to the latest genome and transcriptome available information. We used a Bioconductor package [19] containing the custom CDF "HGU133A_Hs_ENTREZG" version 13.0 released on July 2010. The Agilent and Illumina microarrays were analyzed using their proprietary annotation files.
The expression data for each study (dataset) were normalized within microarrays of the same study. Normalization varied according to the platform and Table S2 summarizes the steps that were performed: Used normalized data from GEO Buffa (miR) Used normalized data from GEO There was a very large variability in the Affymetrix microarrays. This can be seen in Figure S1 (first column) where the densities of expression values for the microarrays are plotted as curves. The second column shows the same curves after normalization. The third column shows, for one randomly chosen microarray in the study, how the normalized density curve changes shape when only the expression values of probesets marked as Present (P) or Marginal (M) are used. This is due to the fact that probesets marked as Absent (A) still have an expression value different from zero. By using P/M calls we followed best practices recommended for Affymetrix microarrays and obtained more reliable data to input to our system.
For the combined mRNA-microRNA studies, only the microRNA expression data in Enerly had to be normalized. The rest of the expression data were originally normalized by the authors of the study. Figure S2 shows the density curves of the expression values in Enerly and Buffa. The first column of the figure has the mRNA data while the second column has the microRNA data. From the next section (section C, Differential expression analysis) it will become clear that we could not base our integrative methodology solely on differentially expressed genes. We needed to use all the genes and microRNAs for which we were able to obtain expression data from a microarray. In fact, in order to integrate all eight datasets we had to follow two steps: 1. Determine what genes/microRNAs were common to all eight datasets. 2. Because BNs require discrete states for their nodes, we had to identify a method to convert the expression data obtained from a microarray into discrete values 1. The common genes between the first six datasets and the mRNA-Enerly and mRNA-Buffa datasets were kept. The first six datasets, all of them of the same microarray platform, had 12,025 unique Entrez gene IDs. The mRNA-Enerly and mRNA-Buffa datasets had 17,177 and 15,627 unique genes respectively. A total of 10,722 unique Entrez IDs were common across all mRNA datasets. The microRNA-Enerly and microRNA-Buffa datasets had 498 and 510 microRNAs respectively with 308 microRNAs that overlapped between both datasets.
2. Regarding data discretization, our choice of how many discrete states to use was five: 1=very low, 2=medium low, 3=medium, 4=medium high and 5=very high.
In order to determine the best discretization method for our microarray data, we analyzed three discretization algorithms: 1. Sigma-mu (σ and μ) 2. Quantile distribution 3. Partition Around Medoids (PAM) Figure S3 illustrates the overall approach to data discretization. In the figure we see that the expression data of all the M microarrays in a given dataset are discretized one by one.
Enerly mRNA data microRNA data Buffa Figure S3. Data discretization overview The discretization methods used, as their input, the normalized expression values in the microarrays of a dataset d . The discretization process was repeated for all datasets.

B.1 Sigma-mu (σ and μ)
For each microarray i , the mean and standard deviation of the expression values of genes were obtained. Then, the genes were assigned a discrete value depending on how many standard deviations away they were from the mean ( Figure S4). The five discrete values were: very low, and very high (≥2σ from μ); medium low and medium high (≥1σ from μ); and medium (< 1σ from μ). Figure S4. Discretization using sigma-mu (σ and μ)

B.2 Quantiles
We applied the quantiles method to obtain estimates of the intervals that accumulated 20%, 40%, 60%, 80% and 100% of the expression values in microarray i . As input we used the density function of the expression values in the microarray ( Figure S4 above). We used the function quantile from the R package stats that provides sample quantiles. We then discretized the expression values according to the quantile in which they belonged. Quantile 1 (0% to 20%) was assigned a discrete value of very low. Quantile 2 (20% to 40%) received a discrete value of medium low; all the way to quantile 5 (80% to 100%) which was assigned a discrete value of very high.

B.3 Partition Around Medoids (PAM)
The expression values of all the genes in microarray i were clustered into 5 clusters using PAM (pam function from R package cluster). The lowest cluster (1) contained genes whose expression values were in the lowest range of expression level. The remaining clusters, 2 through 5 contained genes with higher expression values than those in the previous cluster. Therefore, genes whose expression values were clustered in the lowest and highest end of the spectrum (clusters 1 and 5) were discretized as very low and very high respectively. Genes in clusters 2 and 4 were discretized as medium low and medium high. Genes in the remaining cluster (3) were discretized as medium.
Note on discretization of Affymetrix microarrays Regardless of the discretization method used, the above processes were applied only to probesets marked as either "Present" or "Marginal" (P/M). Probesets marked as "Absent" were always given a discrete value of very low.

B4. Evaluation Criteria
Each algorithm yielded a set of discrete values for all genes in all microarrays of a dataset. Our goal was to determine which of the sets of discrete values created more contrast between ER+ and ER-samples. Firstly, the contrast was measured for differentially expressed genes and, secondly, for non-differentially expressed genes. The unweighted pair-group method arithmetic averages (UPGMA) formula below was used to measure the difference in discrete values for a given gene k in ER+ and ERsamples: In summary, all the discrete values of gene k in ER+ samples were compared against the discrete values of gene k in ER-samples. This process was repeated for all the differentially expressed genes in dataset d , accumulating the Δgene k for all of them.
Our requirements for a good discretization algorithm were the following: For differentially expressed genes: 1. The sum of all Δgene k should be large: we wanted a discretization algorithm that maximized the distance between discrete values of different phenotypes. 2. The number of genes that get the same discrete value in both phenotypes should be minimized. For example, we wanted to avoid as much as possible a case where gene k was given a discrete value of, say very high, in all M microarrays of dataset d . This is an important consideration because even if a gene was determined to be differentially expressed in our analysis, the difference in expression values between phenotypes can be subtle enough that the discretization method could potentially generate the same discrete values for both phenotypes.
Conversely, the same requirements (with opposite criteria) were used for genes that were not differentially expressed.

For the rest of the genes (not differentially expressed):
1. The sum of all Δgene k should be low: in this case, it had to be lower than the value obtained for differentially expressed genes. 2. The number of genes with the same discrete value in both phenotypes should be maximized: it is fair to say that if the gene is not differentially expressed, there should not be much variation between phenotypes.
The sum of all Δgene k per dataset was divided by the number of genes to obtain an average Δgene k . If the discrete values of gene k were the same in all M microarrays, we counted it as same. Otherwise, we counted it as different. These counts as well as the average Δgene k can be seen as columns in Tables S3 and S4.
Computing these values for the differentially expressed genes was straightforward and the results are shown in Figure S5. But obtaining these values for the rest of the genes required taking random samples of genes (not differentially expressed), computing the sum of all Δgene k , obtaining the counts same and different and repeating the process for 100 iterations. For each dataset the sample size r differed based on the number of differentially expressed genes in it. For example, in the Boersma dataset r = 690 (See Table S5 in the next section). After 100 iterations, the counts were averaged and the results for non-differentially expressed genes can be found in Table S4. The column Δgene k contains the average UPGMA for all the genes analyzed. The column same is the count of genes for which the discrete values were the same across all samples of both phenotypes. The column diff indicates the count of genes for which at least one discrete value was different between phenotypes. The addition of same + diff for each dataset is equal to the number of differentially expressed genes in that dataset (see Table  S5 in the next section) When comparing the counts of same vs. different obtained for differentially expressed genes we can see that PAM outperformed the other two methods. This was of outmost importance to us because the differentially expressed genes in Enerly and Buffa were used as evidence in the BN inference. Figure S5 shows these counts as percentages. The definitions for the columns are the same as in Table S3. The counts same and diff are averages of the 100 runs. The addition of same + diff is equal to sample size of the dataset, which coincides with the number of differentially expressed genes in that dataset (see Table S5)

Figure S5. Same vs. different discrete values in differentially expressed genes
For the non-differentially expressed genes, the Quantiles method performed better than the other two. Because the genes were not differentially expressed, we expected to see a higher value for same and a smaller value for different. Figure S6 shows the results for the non-differentially expressed genes.
Surprisingly, Sigma-mu performed very poorly in both cases. When looking at the shape of the density functions in Figure S1, our intuition suggested that this method would be appropriate. But under close examination we can see there is a positive skew in those density functions which, although not as prominent as in Figure S2, it makes this discretization method to perform badly in comparison to PAM and Quantiles. Lists of differentially expressed genes between ER+ and ER-samples were obtained for all datasets. These differentially expressed genes were used in two different contexts: • Differentially expressed genes obtained from the first six datasets were used to evaluate different discretization methods (See previous section B in this Supplemental file) • Differentially expressed genes obtained from the Enerly and Buffa datasets were used as evidence in the BN inference process. The remainder of this section provides technical details about the process we followed to obtain differentially expressed genes in each microarray platform. We close this analysis by comparing lists of differentially expressed genes obtained in different datasets to show that they do not overlap well. The following table summarizes the steps mentioned above and shows the different number of probesets that remained after each processing step. The column Original lists the number of probeset/probes after removing those not associated to an Entrez gene Id. The Filtering method column indicates how the probes were further filtered. 1. Discard probesets with CV < 0.5 across samples 2. For genes with multiple probes, only the probe with maximum variance across all samples was kept. 3. Probes with detection signal in less than 10% of the samples or probes not associated with H.sapiens were discarded. 4. Discard probesets with CV < 0.2 across samples. 5. microRNA probes not associated with H.sapiens were discarded

C2. Overlaps among differentially expressed gene lists
We identified discrepancies when analyzing the lists of differentially expressed genes that were obtained for each dataset. Although the patients in these different studies were divided according to two distinct phenotypes such as ER+ and ER-the differentially expressed genes across studies did not have a large overlap.
For each dataset we ranked the genes according to an ascending order of adjusted pvalue. We examined the overlap across multiple datasets for all the differentially expressed genes as well as the overlap for the top 100-ranked genes, 200-ranked genes, and other top rankings. Additionally, we wanted to determine if a gene was reported with a different status in two datasets, e.g. as down-regulated in ER+ samples of one dataset and reported as up-regulated in ER+ of a different dataset. Table S6 shows the details. It can be seen that only 4 genes overlap when taking the top 100-ranked differentially expressed genes in the first 6 Affymetrix datasets. This was a clear sign, early on, that our integrative analysis could not simply rely on differentially expressed genes. As was mentioned before, for our Bayesian inference we used all genes in the microarrays regardless if they were differentially expressed or not. The information about differentially expressed genes in each of the six Affymetrix datasets was nonetheless useful as was illustrated in the previous section (Section B, Discretization of expression data) When comparing the Enerly and Buffa datasets, there were 907 genes that overlapped among the top 2000-ranked differentially expressed genes. These 907 genes were used as evidence for the BN inference process. In these two datasets there are four differentially expressed genes with opposed expression statuses (Table S7). These four genes were discarded from our analysis.

Supplemental Results
Gibbs sampling, approximation of marginal probability for Bayesian inference Figure S7. A toy BN with 36 nodes Figure S8. The Gibbs sampling error for the 36-node network Inference results on the breast cancer data    (de)* (de) *de: indicating that the gene is differentially expressed and was used as evidence. (de)* (de) *de: indicating that the gene is differentially expressed and was used as evidence.
In addition to the results mentioned in the main text, here we detail other findings obtained for the MAPK signaling pathway which are also related to the TF STAT5B identified in the p53 signaling pathway.
MAPK signaling pathway (KEGG Id 04010) As it was the case for the p53 signaling pathway, here we also had the chance to identify a regulator based on direct interactions between the regulator and genes in the pathway. We had 15 differentially expressed genes reported for this pathway. Here too, our strict selection criteria yielded few TFs (only 5 in Enerly). Yet, one of them, STAT5B might be considered a regulator for this pathway based on our results. We predicted 6 target genes of STAT5B: MAP3K12, NFKB2, RRAS2, FGF23, MAPK10 and PTPRR. These genes had no edges connecting them directly in the pathway. The marginals corresponding to the last 3 genes do not vary much between scenario #1 and #2 (Supplemental Tables S13 and S14). In contrast, the first 3 genes do show a difference between the scenarios and MAP3K12 is in fact differentially expressed and up-regulated in ER+ samples. The TF STAT5B shifts its certainty of being in state medium low in scenario #2 to a more uncertain state in scenario #1. In scenario #1 we see an increment in the marginals corresponding to higher levels of expression with respect to scenario #2. The fact that STAT5B may slightly increase its expression in scenario #1 provides a better explanation of why NFKB2 and RRAS2 also have a shift in marginals towards higher expression states from scenario #2 to #1. Therefore, we postulate that STAT5B is a potential regulator of the MAPK signaling pathway in breast cancer when comparing ER+ and ER-patients. Because STAT5B showed to be of importance in two different pathways, this reinforces our hypothesis that STAT5B is a key regulator in breast cancer patients.