#### Variability analysis

For each molecular phenotype (antibody) we used a statistical model to quantify the biological and experimental variation inherent in the phenotype. As mentioned before, the aim was to partition the total variability in a molecular phenotype into variability that is attributable to familiality (additive genetic, dominant genetic and common environmental), individual environmental, common visit, individual visit and experimental components, respectively. Note that familial and individual environmental variability together make up stable biological variability.

As mentioned previously, the measured quantity of interest was the intensity of fluorescence emitted when an immobilised antibody captured an antigen in a sample. For each antibody let Y

_{ijkl} denote the normalized and transformed intensity of the

*l* th aliquot replicate (1, 2), at the kth visit (1, 2) of the jth twin (1, 2) from the ith twin pair (1, 2, ...,77). The data for each antibody was modelled using a linear mixed-effects model [

30–

32]:

${\mathsf{\text{Y}}}_{\mathsf{\text{ijkl}}}={\mathrm{\mu}}_{\mathsf{\text{p}}\left(\mathsf{\text{i}},\mathsf{\text{j}},\mathsf{\text{k}},\mathsf{\text{l}}\right)}+{\mathsf{\text{A}}}_{\mathsf{\text{ij}}}+{\mathsf{\text{D}}}_{\mathsf{\text{ij}}}+{\mathsf{\text{C}}}_{\mathsf{\text{i}}}+{\mathsf{\text{E}}}_{\mathsf{\text{ij}}}+{\mathsf{\text{W}}}_{\mathsf{\text{ik}}}+{\mathsf{\text{V}}}_{\mathsf{\text{ijk}}}+{\mathrm{\epsilon}}_{\mathsf{\text{ijkl}}},$

(1)

where μ

_{p(i, j, k, l)} is the mean intensity for samples on plate p (the function p(i, j, k, l) maps sample aliquots to plates), A is the additive genetic effect, D is the dominant genetic effect, C is the common environment effect, E is the individual or unique environment effect, W is the common visit effect, V is the individual visit effect, and lastly ε represents the residual experimental effects. The plate mean μ

_{p(i, j, k, l)} is a fixed effect, while A, D, C, E, W, and V are random effects. It is assumed that the random effects are additive and independent. This implies that the total phenotypic variance can be decomposed as:

$\mathsf{\text{Var}}\left(\mathsf{\text{Y}}\right)=\mathsf{\text{Var}}\left(\mathsf{\text{A}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{D}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{C}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{E}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{W}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{V}}\right)+\mathsf{\text{Var}}\left(\mathrm{\epsilon}\right).$

The covariance between phenotypic measurements on a pair of samples gathered at the same visit from monozygotic twins is Var(A) + Var(D) + Var(C) + Var(W), and for dizygotic twins is (1/2)Var(A) + (1/4)Var(D) + Var(C) + Var(W). Familiality (fam) is the proportion of variance attributable to genetic and common environmental effects:

$\mathsf{\text{fam}}=\left(\mathsf{\text{Var}}\left(\mathsf{\text{A}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{C}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{D}}\right)\right)\u2215\mathsf{\text{Var}}\left(\mathsf{\text{Y}}\right).$

For twin data, the variances Var(A), Var(C) and Var(D) are not all identifiable. However familiality is an estimable function of these variance parameters.

For each antibody, maximum likelihood (ML) estimates of the variance components were obtained by fitting model (1) using the lmer function from the lme4 package in R [

33] after normalization, removal of outliers and Box-Cox transformations. The model was reduced prior to fitting. The variances of A, C and D are not identifiable. In order to address non-identifiability we re-parameterized the three non-identifiable variances, Var(A), Var(C) and Var(D) by defining two new random effects H and M (whose variances are identifiable) as follows:

$\mathsf{\text{Var}}\left(\mathsf{\text{H}}\right)=\left(1\u22152\right)*\mathsf{\text{Var}}\left(\mathsf{\text{A}}\right)+\left(1\u22154\right)*\mathsf{\text{Var}}\left(\mathsf{\text{D}}\right)+\mathsf{\text{Var}}\left(\mathsf{\text{C}}\right)$

$\mathsf{\text{Var(M)=(1/2)*Var}}\left(\mathsf{\text{A}}\right)+\left(3\u22154\right)*\mathsf{\text{Var}}\left(\mathsf{\text{D}}\right)$

Subsequently the familial variance (which represents the combined effects of genetics and common environment) was obtained as the sum of the estimated variances of H and M, i.e familial variance = Var(H) + Var(M).

Note that if x_{1} and x_{2} are the phenotype measurements of two monozygotic (MZ) twins then the covariance matrix of the measurements is such that: Var(x_{1}) = Var(x_{2}) = Var(H) + var(M) +Var(E) and covariance(x_{1}, x_{2}) = Var(H) + Var(M). Similarly if x_{1} and x_{2} are the phenotype measurements of two dizygotic (DZ) twins then covariance matrix of the measurements is such that: Var(x_{1}) = Var(x_{2}) = Var(H) + Var(M) + Var(E) and covariance(x_{1}, x_{2}) = Var(H). A similar approach has previously been used and elaborated on metabolomics data [34].

The PQN pre-processing procedure that we applied should mitigate any systematic differences between measurements in the different wells introduced by the instrumental time drift between plate well 1 through to plate well 96. However, in order to investigate if there were any substantial significant intensity-drift effects remaining after PQN we also fit a second model (hereafter referred to as model 2) by introducing a linear, plate-specific well location term into model 1, where well location is a number between 1 and 96 representing the position of a sample in the plate wells. That is in model 2, for each antibody we fit a linear mixed-effects model with linear, plate-specific intensity-drift fixed effects and the random effects A, D, C, E, W, and V as before. Subsequently Akaike's information criterion [

35,

36] was used to compare the fit of models 1 and 2. Akaike's information criterion (AIC) is a measure of the goodness of fit of an estimated statistical model and is defined as:

$\mathsf{\text{AIC}}=\mathsf{\text{2k-2ln(L),}}$

where k is the number of parameters in the model, and L is the maximized value of the likelihood function for the estimated model. For a given a data set, several competing models may be ranked according to their AICs and the model with the smallest AIC is chosen as the one that fits the data best. The pre-processed data and R script used to perform the variability analysis can be found as Additional Files 3 and 4.