In the following, two standard approaches and three recently proposed approaches for testing GxE interactions are discussed first. Afterwards, we propose two novel approaches that can be used to overcome the drawbacks of the initially discussed methods.

### Existing methods

First, existing methods for GxE interaction testing are discussed.

#### Single-SNP-based GxE interaction test

The GxE interaction test based on single SNPs tests each considered SNP independently for a GxE interaction^{3}. That is, for each (textrm{SNP}_j), (j in lbrace 1,ldots ,prbrace), a GLM

$$begin{aligned} g(mathbb {E}[Y mid textrm{SNP}_j, E, varvec{C}]) = beta _0 + beta _1 textrm{SNP}_j + beta _2 E + beta _3 textrm{SNP}_j times E + sum _{i=1}^m gamma _i C_i end{aligned}$$

(1)

is fitted, where also potential confounders (varvec{C} = begin{pmatrix} C_1&ldots&C_m end{pmatrix}) are included in the model to adjust the main effects of (textrm{SNP}_j) and the environmental variable *E* as well as their interaction effect for these variables. If a binary disease status is the considered outcome, logistic regression models via the link function (g = textrm{logit}) are fitted. If the considered outcome is continuous, the identity link (g = textrm{Id}) is used for fitting linear regression models.

In each of these models, the statistical hypothesis (H_0: beta _3 = 0) versus (H_1: beta _3 ne 0) is tested, i.e., whether there is an interaction effect of the SNP and the environmental variable *E* on the outcome. A Wald test is usually performed for testing this hypothesis. Alternatively, a score test or a likelihood-ratio test can be carried out for testing the same hypothesis^{19}. If, e.g., it should be tested whether a gene interacts with *E*, its SNPs are tested and the test decision for the whole gene is made by adjusting the individual SNP testing results for multiple testing. Usually, a Bonferroni correction is carried out^{3,20}. If, after the Bonferroni correction, for at least one SNP the null hypothesis could be rejected, the global null hypothesis of no GxE interaction on the gene is rejected as well.

This approach has the advantage of not having to train a model but to directly perform statistical testing. Moreover, it is very simple, straightforward, and computationally feasible, since the individual tests can also be parallelized. However, due to considering individual SNPs and performing adjustment for multiple testing, statistical power for detecting a GxE interaction is lost.

#### GRS-based GxE interaction test

In contrast to the single SNP test, the GRS-based GxE interaction test aggregates multiple SNPs into one model and uses this model to test if there is an interaction in the considered genomic region^{3}. Usually, the GRS is a linear combination of SNPs

$$begin{aligned} widehat{textrm{GRS}} = {hat{alpha }}_0 + {hat{alpha }}_1 {textrm{SNP}}_1 + ldots + {hat{alpha }}_p textrm{SNP}_p end{aligned}$$

(2)

that is either constructed internally or externally.

External GRS rely on summary statistics of independent studies and use the individual SNP effect sizes for determining their weights ({hat{alpha }}_1, ldots , {hat{alpha }}_p)^{21,22}. This approach, thus, requires the availability of appropriate study data, i.e., the same outcome, the same genomic region, and the same population type had to be analyzed^{23}. Furthermore, the external approach only allows the construction of linear GRS, i.e., in general not taking interactions between genetic loci into account.

Alternatively, GRS can be constructed internally^{3,24}, which means that the available data has to be divided into independent training and test data sets. The GRS is constructed using the training data and evaluated on the test data. This data splitting is crucial to avoid overfitting, i.e., to avoid detecting effects that are solely made up of statistical noise and are recognized due to the model adapting to this statistical noise. The internal approach also allows to generalize the task of constructing GRS to a statistical learning problem, in which a function is to be fitted that maps the SNPs to the outcome and that does not necessarily have to be linear^{6}.

When internally constructing GRS, usually a GLM-based procedure such as the elastic net^{25} is utilized^{17,26}. The elastic net also fits a linear model (2)—yielding the weights ({hat{alpha }}_1, ldots , {hat{alpha }}_p) and intercept ({hat{alpha }}_0)—and regularizes the effect coefficients (varvec{alpha } = begin{pmatrix} alpha _1&ldots&alpha _p end{pmatrix}). This is done by including the penalty term

$$begin{aligned} R_xi (varvec{alpha }) = frac{1}{2} (1-xi ) ||varvec{alpha }||_2^2 + xi ||varvec{alpha }||_1 end{aligned}$$

in the optimization problem

$$begin{aligned} min _{alpha _0, varvec{alpha }} left{ -frac{1}{N} ell (alpha _0, varvec{alpha }) + lambda R_xi (varvec{alpha }) right} , end{aligned}$$

in which (ell) is the log-likelihood function of the considered parameters, (lambda) is the penalty strength, and (xi in [0,1]) is a parameter controlling the balance between the (L_1) penalty and the (L_2) penalty, i.e., the lasso penalty^{27} and the ridge penalty^{28}, respectively. The lasso penalty leads to shrinking the coefficients of unimportant SNPs to zero while the ridge penalty assigns similar weights to highly correlated SNPs, which, e.g., might be the case for SNPs in high LD (linkage disequilibrium). Thus, the elastic net simultaneously performs a variable selection and a properly handling of SNPs in high LD. However, as for GLMs, only marginal SNP effects are modeled if no prior knowledge about which loci might interact is available, which is usually the case.

After constructing the GRS on training data, predictions (widehat{textrm{GRS}} = {hat{alpha }}_0 + {hat{alpha }}_1 textrm{SNP}_1 + ldots + {hat{alpha }}_p textrm{SNP}_p) on independent test data are performed. These predicted values of the GRS for the subjects in the test data set are then used to fit the GLM

$$begin{aligned} g(mathbb {E}[Y mid widehat{textrm{GRS}}, E, varvec{C}]) = beta _0 + beta _1 widehat{textrm{GRS}} + beta _2 E + beta _3 widehat{textrm{GRS}} times E + sum _{i=1}^m gamma _i C_i. end{aligned}$$

(3)

As for the single-SNP-based test, if a binary disease status is the phenotype of interest, the (textrm{logit}) is used as link function *g* for fitting logistic regression models in both the GRS construction step and the GxE testing step. For continuous phenotypes, linear regression models are fitted using the identity as link function *g*.

For testing if the considered genomic region interacts with *E* regarding the outcome *Y*, the statistical test (H_0: beta _3 = 0) versus (H_1: beta _3 ne 0) is performed. Similar to the single-SNP-based test, this hypothesis is most commonly tested using a Wald test. A GxE interaction is present if (H_0) is rejected at a prespecified level of significance. In contrast to the single-SNP-based test, this test result directly reflects the desired test decision such that no adjustment for multiple testing has to be performed.

The drawback of this GRS-based testing approach is the requirement for splitting the available data into independent training and test data sets, where simulation studies suggest that a random 50:50 split should be used^{23}. However, since in this case only 50% of the data can be used for actually testing the GxE interaction, substantial statistical power for detecting the GxE interaction is lost.

#### Set-based gene–environment interaction test

SBERIA^{9} is a GxE interaction test that also utilizes a weighted sum of SNPs, similar to the GRS-based procedure. In SBERIA, all SNPs are univariately screened for either the association with the environmental factor or the association with the outcome. The results of this screening are used for constructing a weighted sum of SNPs. More precisely, this sum is constructed as (widehat{textrm{GRS}} = w_1 textrm{SNP}_1 + ldots + w_p textrm{SNP}_p) with

$$begin{aligned} w_j = varepsilon + {left{ begin{array}{ll} 0, &{} text {if},p,text {value for},textrm{SNP}_j > theta \ -1, &{} text {if},p,text {value for},textrm{SNP}_j le theta ,text {and correlation of}, textrm{SNP}_j ,text {with},E,(text {or},Y) ,text {negative} \ +1, &{} text {if},p,text {value for},textrm{SNP}_j le theta ,text {and correlation of}, textrm{SNP}_j ,text {with},E,(text {or},Y) ,text {positive}. end{array}right. } end{aligned}$$

The offset (varepsilon) is usually chosen as 0.0001 and the significance threshold (theta) as 0.1. The GxE interaction is then tested as in the GRS-based test using the GLM from Eq. (3). However, in contrast to the GRS-based test, the weighted sum utilized in SBERIA only considers the magnitude of the genetic effects to a limited extent. Nonetheless, through this limited modeling, the overfitting problem of the GRS-based testing does not arise such that the full data can be utilized for constructing the weighted sum and testing the GxE interaction even in low sample size scenarios.

#### Gene–environment set association test

GESAT^{10} is a GxE interaction test that belongs to the class of variance component tests. In variance component tests, the GLM

$$begin{aligned} g(mathbb {E}[Y mid textbf{SNP}, E, varvec{C}]) = delta _0 + varvec{delta }_1^T textbf{SNP} + delta _2 E + varvec{delta }_3^T textbf{SNP} times E + sum _{i=1}^m gamma _i C_i end{aligned}$$

is considered for testing the GxE interaction, where (textbf{SNP} = begin{pmatrix} textrm{SNP}_1&ldots&textrm{SNP}_p end{pmatrix}) is the vector of all considered SNPs, (varvec{delta }_1) is the vector of corresponding main effects and (varvec{delta }_3) is the vector of corresponding GxE interaction effects. The GxE interaction effects are modeled as random effects with mean zero and a common variance (tau ge 0). Testing the presence of a GxE interaction anywhere in the considered set of SNPs is now equivalent to testing (H_0: tau = 0) versus (H_1: tau > 0). In GESAT, a score test is used for testing this hypothesis. For computing the score test statistic, the main effects have to be estimated under the null model only incorporating main effects. In GESAT, this is done by applying ridge regression. The authors have shown that the score test statistic follows—under the null distribution of no GxE interaction—asymptotically a mixture of (chi ^2)-distributions.

#### Adaptive combination of Bayes factor method

ADABF^{13} is a recently proposed GxE interaction testing approach that tries to overcome the issues of classical tests, i.e., the need for data splitting or for too conservative multiple testing adjustment, by considering Bayes factors. ADABF starts by individually screening all considered SNPs for associations with the outcome. Only the (p_S le p) SNPs passing this screening (e.g., only SNPs that are significantly associated with respect to a level of significance of 5%) are used for testing the GxE interaction itself. Similar to the single-SNP-based test, individual GLMs (see Eq. (1)) are fitted for each considered SNP. Then, Bayes factors

$$begin{aligned} textrm{BF} = frac{mathbb {P}(textrm{data} mid H_{1})}{mathbb {P}(textrm{data} mid H_{0})} end{aligned}$$

are computed for each SNP and the corresponding hypothesis (H_0: beta _3 = 0) versus (H_1: beta _3 ne 0) of the GxE interaction coefficient of this SNP. Prior knowledge from previous studies is used for configuring the variance of the prior distributions of both the main effects and the GxE interaction effects. Since it is of interest to test the whole considered set of SNPs for a GxE interaction and not just single SNPs, the Bayes factors are combined into summary scores

$$begin{aligned} S_k = sum _{l=1}^k log (textrm{BF}_{(l)}) end{aligned}$$

with (textrm{BF}_{(l)}) ((l in lbrace 1,ldots ,p_S rbrace)) being the decreasingly sorted Bayes factors for the considered SNPs such that (textrm{BF}_{(1)} ge ldots ge textrm{BF}_{(p_S)}). These summary scores are also computed under the null distribution of no GxE interaction, i.e., by randomly sampling GxE interaction effects from a multivariate normal distribution with mean zero (corresponding to no effect) and a covariance matrix incorporating LD (linkage disequilibrium) between the SNPs. Afterwards, the original summary scores and the sampled versions are compared for deriving *p* values for every (k in lbrace 1,ldots ,p_S rbrace). Minima of these *p* values are computed for deriving a final *p* value that tests the global null hypothesis of no GxE interaction across all considered SNPs.

#### Bootstrap aggregating

To overcome the loss in statistical power through limited modeling or data splitting, we propose employing bagging (bootstrap aggregating)^{16} for constructing the GRS in GxE interaction testing. Bagging is an ensemble approach that constructs *B* single models and combines them to one prediction model by averaging over the predictions of the individual models. The number of models *B* is chosen prior to fitting the model and should be set to a sufficiently high number such that more iterations do not considerably change the ensemble model. Each individual model is fitted by randomly drawing a bootstrap sample from the complete available data set, i.e., drawing *N* observations with replacement from a data set consisting of *N* observations, and using this sample to train the model such as a GLM via elastic net. A key property of bagging is that it reduces the variance of the predictions, thus, stabilizing the predictions^{29}.

Since in every iteration a bootstrap sample is used for training the model, there is complementary data left that was not used for training this sub model. These data are called OOB (out-of-bag) data. Utilizing this fact, unbiased predictions on the complete data set can be made. For each observation, those models are gathered that did not use this observation for training. These models are used to temporarily construct an ensemble and the OOB prediction is generated by calculating the average over these models, so that for an observation ((varvec{x}, y)), its OOB prediction ({hat{y}}_{textrm{OOB}}) is calculated by

$$begin{aligned} {hat{y}}_{textrm{OOB}} = frac{1}{|mathcal {F}_{(varvec{x}, y)}|} sum _{f in mathcal {F}_{(varvec{x}, y)}} f(varvec{x}) end{aligned}$$

where

$$begin{aligned} mathcal {F}_{(varvec{x}, y)} = left{ f in mathcal {F} mid (varvec{x}, y) notin T_f right} end{aligned}$$

is the set of all trained models that did not use the considered observation for training, (mathcal {F}) is the set of all trained models in the ensemble, and (T_f) is the training data set used for training *f*. Thus, the OOB prediction for each observation is constructed by models that never have seen this specific observation, resembling test data predictions.

### Proposed methods

In the following, two novel GxE interaction testing methods based on bagging are introduced.

#### GxE interaction testing through bagging

For avoiding the data splitting problem in GxE interaction testing, we propose constructing the GRS using bagging, e.g., bagging using the elastic net as the base learner, and computing the OOB prediction for all individuals in the whole data set. These predictions can then be used as a predictor in the GLM (3) as before and the statistical hypotheses (H_0: beta _3 = 0) versus (H_1: beta _3 ne 0) are tested using a Wald test analogously to the conventional GRS-based test. Note that in contrast to the conventional GRS-based test, the GLM (3) is fitted and tested using all available data. Similarly, the GRS is in this case also fitted using all available observations. Therefore, neither the modeling step nor the testing step suffer from reduced sample sizes in this approach.

Figure 1 illustrates the proposed bagged GxE interaction testing approach considering (N = 5) subjects and (B = 5) bagging iterations/bootstrap samples (note that both numbers should actually be much higher; here, we only consider these small numbers for illustration purposes). First, for each out of (B=5) bagging iterations, a bootstrap sample is drawn from the original sample consisting of (N = 5) observations and used for training the respective model such as a GLM through elastic net. For example, in the first iteration, the model is fitted using the observations 1, 3, 4, and 5. Next, for each of the (N=5) observations, those models are selected that did not use the respective observation for training and are used for generating the OOB predictions for the GRS. For example, for the first observation, the models 2 and 4 are used to predict its GRS by averaging their predictions, since observation 1 was not used for fitting the models 2 and 4. These predicted GRS values are then used as the values of the predictor to fit the GLM (3) and test whether the GxE interaction is associated with the outcome via its coefficient (beta _3) using a Wald test.

The GRS can be an arbitrary summary of genetic loci. Hence, if loci from multiple genes should be tested in a single GxE interaction test, the bagged GRS is constructed using all considered loci at once. For example in the real data application, a GxE interaction is tested for an association-based SNP selection that can potentially lead to loci from multiple different genes and for a gene-based SNP selection that was derived by considering multiple genes at once.

#### Random-forests-based GxE interaction test

Common GRS construction procedures such as the elastic net rely on linear modeling of genetic effects. Thus, these approaches usually model only marginal genetic effects unless prior knowledge about which loci might interact is available. Instead, more flexible modeling techniques such as random forests^{18}, which are theoretically able to model every possible interaction, can also be used to construct GRS. It has been previously shown^{6} that these predictions can substantially outperform the standard method elastic net in the construction of GRS.

Therefore, we propose using random forests for the GRS construction step in testing GxE interactions. Random forests is an extension of bagging that uses decision trees^{30} as its base learner. The individual decision trees are further randomized by selecting random subsets of the predictor set in the recursive fitting procedure. This additional randomization leads to an increased variance reduction. Due to employing bagging, random forests is a natural candidate for applying the OOB-predictions-based GxE interaction test discussed in the previous section. Here, the sub models that are used for computing OOB predictions are the individual randomized decision trees.

### Ethics approval and consent to participate

The study was conducted in accordance to the declaration of Helsinki. The SALIA cohort study has been approved by the Ethics Committees of the Ruhr-University Bochum and the Heinrich Heine University Düsseldorf. Written informed consent was received from all participants.