## Abstract

Genome-wide prediction approaches represent versatile tools for the analysis and prediction of complex traits. Mostly they rely on marker-based information, but scenarios have been reported in which models capitalizing on closely-linked markers that were combined into haplotypes outperformed marker-based models. Detailed comparisons were undertaken to reveal under which circumstances haplotype-based genome-wide prediction models are superior to marker-based models. Specifically, it was of interest to analyze whether and how haplotype-based models may take local epistatic effects between markers into account. Assuming that populations consisted of fully homozygous individuals, a marker-based model in which local epistatic effects inside haplotype blocks were exploited (LEGBLUP) was linearly transformable into a haplotype-based model (HGBLUP). This theoretical derivation formally revealed that haplotype-based genome-wide prediction models capitalize on local epistatic effects among markers. Simulation studies corroborated this finding. Due to its computational efficiency the HGBLUP model promises to be an interesting tool for studies in which ultra-high-density SNP data sets are studied. Applying the HGBLUP model to empirical data sets revealed higher prediction accuracies than for marker-based models for both traits studied using a mouse panel. In contrast, only a small subset of the traits analyzed in crop populations showed such a benefit. Cases in which higher prediction accuracies are observed for HGBLUP than for marker-based models are expected to be of immediate relevance for breeders, due to the tight linkage a beneficial haplotype will be preserved for many generations. In this respect the inheritance of local epistatic effects very much resembles the one of additive effects.

- haplotype
- epistasis
- local epistatic effect
- genome-wide prediction
- GenPred
- Shared Data Resources
- Genomic Selection

Genome-wide regression is a powerful tool to analyze and predict quantitative traits which are regulated by many genes (Meuwissen *et al.* 2001). Various genome-wide prediction approaches have been explored and applied for human (Yang *et al.* 2010, de los Campos *et al.* 2010, 2013b), animal (Hayes *et al.* 2013, de los Campos *et al.* 2013a), and plant populations (Crossa *et al.* 2014, Heslot *et al.* 2015, Hickey *et al.* 2017). In most genome-wide prediction models, effects of molecular markers such as single nucleotide polymorphisms (SNPs) were used as explanatory variables (Cuyabano *et al.* 2015a). Alternatively, molecular markers can be combined into haplotypes, which are then used to implement genome-wide prediction models (Calus *et al.* 2008). Haplotype-based prediction approaches are favored if alleles at quantitative trait loci (QTL) were more closely linked to haplotype alleles than individual SNPs (Zondervan and Cardon 2004). Moreover, it is hypothesized that haplotypes can capture epistatic interactions between SNPs (Clark 2004, Zhang *et al.* 2014). Therefore, haplotype-based approaches potentially boost prediction accuracies (Cuyabano *et al.* 2014, 2015a, b).

The potential to exploit local epistatic effects among markers in haplotype-based prediction is interesting with respect to two points. First, epistasis has been recognized as a biologically influential component contributing to the genetic architecture of quantitative traits (Carlborg and Haley 2004, Mackay 2014, Jiang *et al.* 2017). The role of epistasis in genome-wide prediction has been extensively studied, but mostly in terms of marker-based approaches. Several marker-based models either implicitly or explicitly including epistatic effects in addition to main effects were developed (Xu 2007, Gianola and van Kaam 2008, Wittenburg *et al.* 2011, Jiang and Reif 2015, Vitezica *et al.* 2017). Taking epistasis into account can increase prediction accuracies (Wang *et al.* 2012, Muñoz *et al.* 2014, He *et al.* 2016). Second, decomposing epistasis into global and local effects is pivotal for evaluating the long-term impact of epistasis in plant and animal breeding, as there is a reduced chance that local epistatic effects will disappear after generations of recombination (Akdemir and Jannink 2015). First attempts in exploiting additive and local epistatic effects for genome-wide prediction were carried out with marker-based models resulting in good predictive performance and useful explanatory information (Akdemir and Jannink 2015, Akdemir *et al.* 2017, He *et al.* 2017). Nevertheless, it has not been clarified why and how the haplotype-based approaches take local epistasis into account at the level of statistical models.

The aims of this study were 1) to provide a formal theoretical explanation how haplotype-based genome-wide prediction models intrinsically exploit local epistatic effects among markers, 2) to investigate with simulation studies under which circumstances haplotype-based models perform better than marker-based models, and 3) to explore the potential of haplotype-based genome-wide prediction models using three published empirical data sets.

## THEORY

This section was organized as follows: First we introduced two genome-wide prediction models. Haplotype effects were used as explanatory variables in the haplotype-based genomic best linear unbiased prediction (HGBLUP) model, while additive and local epistatic effects among markers were utilized as predictors in the locally extended genomic best linear unbiased prediction (LEGBLUP) model. Then we proved that the haplotype-based model HGBLUP exploits local epistatic effects among markers by establishing a link between HGBLUP and LEGBLUP for the case in which all loci are homozygous. At the end of section, two examples were given to illustrate the theoretical results.

Throughout the section, we made following conventions: Let be the number of genotypes, be the number of markers. In this study we only considered bi-allelic markers. Suppose that the whole genome is divided into non-overlapping haplotype blocks; local epistasis is defined as interaction effects among two or more markers within a defined haplotype block. Let be the number of blocks. For , let be the number of markers in the -th block. Let be the number of different haplotype alleles in the -th block. Linkage phases were assumed to be known. Vectors (matrices) are always denoted by lower (upper) case Latin or Greek letters in bold font.

### The HGBLUP model

This model has been used in previous studies (*e.g.*, Cuyabano *et al.* 2014, 2015a) and here we called it HGBLUP. Independent from the definition of haplotype blocks, the HGBLUP model can be described as follows:[1]where is the -dimensional vector of phenotypic records, is an -dimensional vector of one’s, is a common intercept term, is the -dimensional vector of haplotype effects in the -th haplotype block, is the corresponding design matrix of the -th block, the -entry of is the number of the -th haplotype allele in the -th genotype (hence, it is 0, 1 or 2), and is the residual term. In the model we assumed that is a fixed parameter, for any , and . We assumed no covariance structure among these variables.

The formulation of this model is similar to ridge regression best linear unbiased prediction (RR-BLUP, Meuwissen *et al.* 2001) except that the marker effects were replaced by haplotype effects. Note that there are in total unknown parameters in the model. This number can be even larger than the number of markers, which makes the computational load very high. However, the model can be implemented in an alternative way similar to the marker-based genomic best linear unbiased prediction model (GBLUP, VanRaden 2008):[2]where , , , and are the same as in Equation 1; is an -dimensional vector of genotypic values. We assumed that is a fixed parameter, , and , where . Setting , it becomes obvious that the two models are statistically equivalent, as the equivalence between GBLUP and RR-BLUP (Habier *et al.* 2007).

### The LEGBLUP model

This model is a local version of the extended GBLUP (EGBLUP) (Jiang and Reif 2015). EGBLUP exploits epistasis between any pair of markers while LEGBLUP only considers local epistasis inside each haplotype block. Assuming only digenic epistasis, the model can be described as follows:[3]where , , , and are the same as in Equation 1, is the matrix of marker profiles, the -entry of is the number of a specific allele of the -th marker carried by the -th genotype (hence, it is 0, 1 or 2), is the -dimensional vector of marker additive effects, is the design matrix for additive-by-additive epistatic effects for markers in the -th haplotype block, is the -dimensional vector of epistatic effects in the -th block. In the model we assumed that is a fixed parameter, , for any , and . We assumed no covariance structure among these variables.

Note that there are unknown variables in the model with , and this number can be very large. Hence, the model can be implemented in an alternative way as:[4]where , , , and are the same as in Equation 1, is an -dimensional vector of additive genotypic values, is an -dimensional vector of genetic values accounting for local epistasis. We assumed that is a fixed parameter, , and , where , , and is the Hadamard product, *i.e.*, entry-wise product, of matrices. Setting and , it reveals that the two models are statistically equivalent. The reason is the same as the equivalence between EGBLUP and an extended RR-BLUP model including epistasis (Jiang and Reif 2015).

Note that in the above descriptions the LEGBLUP model only includes digenic local epistasis, *i.e.*, local epistatic effects between two markers. In fact, the LEGBLUP model can be generalized to include all possible higher-order epistatic interaction effects within each haplotype block, as the EGBLUP model. Briefly, we only need to extend Equation 4 to:[5]where is the vector of genetic values accounting for (r-1)-th order epistasis, *i.e.*, epistatic interactions among r markers. The kinship matrix for can be derived using t-fold Hadamard product of (Jiang and Reif 2015). This model is denoted as full LEGBLUP.

### The link Between HGBLUP and LEGBLUP

We first concentrated on a single haplotype block, thus subscripts to differentiate blocks can be ignored. We assumed markers and haplotype alleles and then the HGBLUP model (Equation 1) reduces to:[6]where is an -dimensional vector of haplotype effects and is an design matrix. The assumptions were that is a fixed parameter, , and .

For LEGBLUP, a unified expression of Equation 3 was needed to extend it to full LEGBLUP. We defined ** to be a vector whose components are marker main effects together with epistatic effects up to the -th order, ***i.e.*, all possible epistatic effects among any number of markers in the block, not only digenic epistatic effects. Thus, the dimension of is:where denote the Gaussian binomial coefficients. Let be the corresponding design matrix. With these notations, the full LEGBLUP model can be simply written as:[7]The assumptions were that is a fixed parameter, , , and is a diagonal matrix containing different unknown variance parameters for additive effects and different orders of epistatic effects.

#### Claim:

If all loci under consideration are homozygous, then there exists a matrix such that .

The above claim was the key to bridge HGBLUP and LEGBLP. As its proof requires more techniques in linear algebra, we presented it as a separate subsection below.

Now we assumed all loci to be homozygous. Setting , HGBLUP (Equation 6) can be expressed as:[8]The newly defined vector has the same design matrix as in the LEGBLUP model (Equation 7). Thus, includes marker effects as well as epistatic effects among markers. Accordingly, Equation 8 is the same as Equation 7 and hence HGBLUP has the same base equation as LEGBLUP.

Nevertheless, there is one important difference between the two models. In LEGBLUP, the covariance matrix for is assumed to be a diagonal matrix , hence, no covariance between different variables is assumed. But in HGBLUP, although the distribution of is still multivariate normal, its covariance structure is:In general, the matrix is semi-positive definite but not diagonal. Thus, HGBLUP implicitly assumes a non-trivial covariance structure.

Now it is straightforward to generalize the results to the case of a full model including all blocks, since no inter-block effects are modeled and the linear transformation can be independently applied to each block.

### The proof of the claim

As the loci under consideration are homozygous, there are independent variables in the LEGBLUP model (Equation 7). For the HGBLUP model, there are at most different haplotype alleles, *i.e.*, . But note that if there are haplotype alleles, the number of independent variables in the model is because of collinearity, similar to the biallelic case (*e.g.*, SNP markers) in which there is only one independent variable. Hence, we can assume . We shall consider two cases.

#### Case 1: All possible haplotype alleles occur:

We assumed that all possible haplotype alleles occur in the data, then . We started from the HGBLUP model (Equation 6). Recall that for any and , the -entry of , denoted by , is the number of the -th haplotype allele carried by the -th individual. Since all marker loci are homozygous, must be 0 or 2. As we assumed that all possible haplotype alleles occur in the data, for any () there exists () such that and for all and . Combining the rows of the design matrix results in an submatrix . It is clear that is invertible because it can be transformed to by row permutation. Correspondingly, we took the rows of the design matrix in the LEGBLUP model (Equation 7). This also yielded an submatrix . We observed that is also invertible. The proof of this fact was presented separately at the end of this subsection. As is invertible, we can define and hence with being invertible.

We then claimed that . In fact, for any and , the -th row of must coincide with for some because exhaust all the possibilities of row vectors for . Correspondingly, the -th row of must coincide with . Since and , are corresponding rows in and , . As it holds for any , .

#### Case 2: Not all possible haplotype alleles occur:

Now we assumed that not all possible haplotype alleles occur in the data (). In contrast to the case that considers all haplotype alleles, the submatrix in the LEGBLUP is not but . So we need to adjust our arguments. In fact, has full row rank: using the results in the previous case, can be viewed as a submatrix of a full rank matrix. Hence, there exists a right inverse which is a matrix such that . Defining , we still obtain , and hence .

#### The proof of the fact that is invertible:

Recall that * is an matrix, where . The columns of can be naturally indexed by the setIn fact we can denote the entries in by **. When , is just the number of alleles of the -th marker carried by the -th genotype, which serves as the coefficient for the main additive effect of the -th marker. When , is the coefficient of the epistatic effects among the markers , ,… and for the -th genotype. With the above notations, the column vectors of * can be denoted by .

The rows of * can also be labeled by the set , which is trivial because ** has the same number of rows as columns. But we can introduce the following natural labeling: If a genotype is coded as 2 in the markers , ,…, and 0 in the remaining ones, then we label the corresponding row as . With these notations, the entries in ** can be written as , where , , . By definition we have:[9]To show that is invertible, it is sufficient to show that the column vectors (, ) span the space , where denotes the set of rational numbers. The space has a natural basis , , where is the vector whose -entry is 1 and all other entries are zeros.*

We first considered . In this case we have only one vector , which is the coefficient of the epistatic effects among all markers. From Equation 9 we know that the only non-zero entry in is and it equals . So .

Next we considered the case . In this case we have vectors , where denotes that is absent in the sequence . Again using Equation 9, we know that there are only two non-zero entries in , namely and , both values are . Hence, .

Repeating the procedure for smaller, we can see that all basis vectors can be written as linear combinations of the vectors , which completes the proof.

### The case of heterozygous loci

Recall Equation 6 for HGBLUP and Equation 7 for LEGBLUP. Different from the case in which homozygous loci are considered, now the elements in the design matrix can take the value 1, in addition to 0 and 2. More precisely, when the paternal and maternal haplotypes are different, the corresponding row vector of will have two non-zero entries, both being 1. This essential difference makes it impossible to find a matrix such that holds in general. So there does not exist any linear transformation such that the base equations of HGBLUP and LEGBLUP become the same. This result was proved by giving a counterexample (see Example 2 in the next subsection).

### Illustration of the theoretical results

In this section, two examples were provided illustrating the theoretical findings for homozygous (Example 1; Table 1) and heterozygous loci (Example 2; Table 2).

#### Example 1:

We considered six individuals and one haplotype block with two SNP markers (Table 1). As outlined above the two homozygous genotypes were coded as 0 and 2 resulting in four different haplotype alleles.

The vector of haplotype effects is and the corresponding design matrix is . As the fourth column of can be obtained by subtracting the sum of the other three columns in the vector , the last variable can be dropped resulting in and . Then the HGBLUP model has the following form:[10]with the assumptions , .

The vector of marker effects is with the design matrix with the third column of being the element-wise product of the first two columns; so the LEGBLUP model has the form:[11]with the assumptions , .

We took the first three rows in and formed the submatrix , as each of the first three individuals carries a different haplotype allele. So . Then we accordingly took the first three rows in to form the submatrix ** and defined , thus, . Assuming resulted in withHence, the HGBLUP model (Equation 10) is equivalent to[12]with the assumptions , .**

Since in Equation 12 the parameters have exactly the same design matrix as in Equation 11, the base equations of HGBLUP and LEGBLUP are indeed the same. Setting and , we can see that the only difference between the models is that in LEGBLUP (Equation 11) the covariance between additive and epistatic effects was zero while in HGBLUP (Equation 12) the covariance was .

#### Example 2:

We considered six genotypes and a haplotype block with two SNP markers (Table 2). In contrast to Example 1, we assumed presence of heterozygous loci. The vector of haplotype effects is with the design matrix . As outlined in the first example, we can simply set owing to linear dependency and . The vector of marker effects is with the design matrix .

In the following, we showed that there does not exist any matrix such that , *i.e.*, the HGBLUP and LEGBLUP model have the same base equations. For the proof, we assumed the contrary, that there exists a matrix such that . Then for any submatrix of and the corresponding submatrix ** of , must hold. Let be the submatrix of consisting of the first three rows, so . Accordingly, . For **** to be true, the only choice for is that ****. Nevertheless,which is a contradiction. In fact, we can clearly see that only the last row of differs from . So the problem occurs when at least two loci are heterozygous for some genotypes.**

## Materials and Methods

### Simulation study

Based on the genomic data of a panel of maize lines belonging to the flint heterotic pool (Bauer *et al.* 2013), simulated traits were generated. Six scenarios were considered with different types and patterns of epistatic QTL effects (Table 3).

In all scenarios, 100 markers were randomly sampled as QTL for each of the 10 chromosomes, resulting in 1,000 QTL per scenario. In scenario 1, only additive effects were simulated. Hence, the genetic values are , where is the vector of additive QTL effects and is the marker design matrix. The additive effects were independently sampled from a normal distribution of mean 0 and variance , *i.e.*, . In scenario 2, we simulated additive and global epistatic effects. 1,000 pairs of markers were randomly selected to present digenic epistatic effects. Hence, the genetic values are , where and ** are the same as in scenario 1, and is the vector of epistatic effects with design matrix ****. The epistatic effects were also independently sampled from a normal distribution, ***i.e.*, **.**

In scenarios 3 to 6, we simulated additive effects and local epistatic effects. For local epistasis, we first randomly divided each chromosome into non-overlapping blocks, each consisting of 2 to 5 markers. Epistatic effects were simulated only inside individual blocks. Thus, the simulated genetic values , where w is the number of blocks, is the vector of additive and epistatic effects inside the k-th block and is the corresponding design matrix. In scenarios 3 and 5, only digenic epistatic effects were simulated. In scenarios 4 and 6, all possible epistatic effects were considered, hence, including higher-order epistasis. In scenarios 3 and 4, all effects were assumed to be independent. In scenarios 5 and 6, epistatic effects inside individual blocks were assumed to be correlated.

For each scenario, we considered one trait with two different simulated heritabilities ( = 0.7 or 0.5) and two ratios of variances ( = 4:3 or 3:1). In case of = 4:3, the covariance matrices of genetic effects in the individual haplotype blocks were directly derived using the method described in the Theory section, *i.e.*, the covariance matrix equals the matrix , which gave the ratio 4:3 and determined the variances for higher-order epistatic effects. To simulate a situation in which the variance of epistatic effects was less relevant, we considered also a 3:1 ratio. In this case, we modified the matrix as follows; we changed from to and accordingly modified all variance terms of higher-order epistasis by keeping the ratio of any two epistatic variance terms (*e.g.*, ). The variance of additive effects and all correlations were not changed. When only digenic epistatic effects were simulated, the rows and columns corresponding to higher-order epistasis were deleted.

As the final step, the phenotypic values were simulated as , where is the simulated genetic value as described above and is the environmental error term. The error terms were independently sampled from a normal distribution, *i.e.*, , where and is the genetic variance calculated from the simulated genetic values. For each scenario, trait heritability and variance ratio, simulations were repeated 20 times.

### Empirical data

#### Mouse data:

The mouse data set used for this study comprised 1,940 heterogeneous stock mice genotyped with 12,545 SNP markers. The measured traits were body weight at age of six weeks and growth slope between six and ten weeks of age (Valdar *et al.* 2006).

#### Rice data:

The rice data set comprised a diversity panel of 413 varieties genotyped with an Affymetrix 44K SNP array (Zhao *et al.* 2011). Individuals were highly homozygous. After quality control, 39,601 SNP markers were used in this study. Phenotypic data of 26 traits with contrasting genetic architectures were available.

#### Maize data:

The maize data set comprised a large half-sib maize panel from the flint heterotic pool generated within the European PLANT-KBBE CornFed project (Bauer *et al.* 2013). The panel consisted of 11 half-sib families with 833 doubled haploid (DH) lines. After quality control for missing rate and minor allele frequency, 29,466 SNP markers were used for subsequent analyses. Phenotypic traits under consideration were dry matter yield, dry matter content, plant height, days to tasseling and days to silking (Lehermeier *et al.* 2014).

### Genome-wide prediction

For the simulated and empirical data, we considered three marker-based models, GBLUP, EGBLUP and LEGBLUP, and one haplotype-based model, HGBLUP (Figure 1). For LEGBLUP and HGBLUP, we defined haplotype blocks using fixed lengths, varying from 2 to 5 (10) SNPs for the simulated (empirical) data. For the mouse data set, in which the linkage phase of the marker data are unknown, we treated each allele of a heterozygous locus as having equal probability (*i.e.*, 50%) to be maternal or paternal. The prediction accuracy (ability) was defined as the Pearson correlation between the predicted and the simulated (observed) genetic values for simulated (empirical) data. For each model the mean prediction accuracy was estimated with fivefold cross validation. All models were implemented using the statistical software R (R Core Team 2016) with the package BGLR (Peréz and de los Campos 2014).

### Data Availability

All empirical data used in this study have been published. The mouse data set was included in the R package SynbreedData (Wimmer *et al.* 2015, https://cran.r-project.org/web/packages/synbreedData/index.html). The rice data set was published in Zhao *et al.* (2011) and can be downloaded from https://ricediversity.org/data/sets/44kgwas/. The genomic data of the maize data set was published in Bauer *et al.* (2013) and can be downloaded from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50558. The phenotypic data of the maize data set was published as File S1 in Lehermeier *et al.* (2014) and can be downloaded from http://www.genetics.org/content/198/1/3.supplemental. Figure S1 shows the prediction accuracies of GBLUP, EGBLUP, LEGBLUP and HGBLUP for simulated traits with heritability 0.5 and = 4:3. Figure S2 shows the prediction accuracies of the four models for simulated traits with heritability 0.7 and = 3:1. Figure S3 shows the prediction accuracies of the four models for simulated traits with heritability 0.5 and = 3:1. Table S1 provides the prediction accuracies of the four models for the 26 agronomic traits in the rice data set. File S1 contains the R code used to generate the data for the simulation study. File S2 and File S3 contain sample genomic and physical map data sets for running the code.

## Results and Discussion

### Modeling haplotype effects exploit local epistasis among markers

We compared two genome-wide prediction models to study whether local epistatic effects among markers are formally taken into account by modeling haplotype effects. The first model utilizes haplotype effects as predictors and has been used in previous studies (Cuyabano *et al.* 2014, 2015a), here we called it HGBLUP. The HGBLUP model is similar to the well-known GBLUP model (VanRaden 2008) which exploits a marker-derived relationship matrix among genotypes. In HGBLUP the marker-derived relationship matrix is replaced by the haplotype-derived relationship matrix. Note that modeling a haplotype-derived relationship matrix is equivalent to explicitly modeling haplotype effects (Equation 1, 2), just like the equivalence between GBLUP and RRBLUP (Habier *et al.* 2007). The second model we considered takes into account additive effects as well as additive-by-additive local epistatic effects among markers and was termed LEGBLUP. LEGBLUP is a modified version of EGBLUP (Jiang and Reif 2015). EGBLUP exploits epistasis between any pair of markers while LEGBLUP only considers local epistasis inside each haplotype block (Equation 3, 4). Note that local higher-order epistatic effects can either be included (Equation 5) or excluded (Equation 3, 4) in the LEGBLUP model. The relationship between the different models was illustrated in Figure 1.

A theoretical link between HGBLUP and the full LEGBLUP including local higher-order epistatic effects was established for the case in which all marker loci were assumed to be homozygous. Then the HGBLUP model was proven to be almost statistically equivalent to the LEGBLUP model (Figure 2, and see **Theory** for details). More precisely, the base equation of HGBLUP (Equation 6) is linearly transformable to the one of LEGBLUP (Equation 7). After transformation, only one difference remains; the HGBLUP model assumes non-trivial covariance structure for the additive and local epistatic effects (Equation 8), while in the LEGBLUP model all effects are assumed to be independent (Equation 7). This theoretical derivation provided a formal explanation why and how haplotype-based genome-wide prediction models exploit local epistatic effects among markers.

Note that although almost all genome-wide prediction models assume independent marker effects, it was anticipated that some of the effects may be spatially correlated within chromosomes (Gianola *et al.* 2003). Moreover, it was reported that the prediction accuracy can be increased by the Bayesian antedependence model considering correlated marker effects (Yang and Tempelman 2012). Hence, the covariance structure among the additive and local epistatic effects suggested by the HGBLUP model can be beneficial and is interesting for further study.

A counterexample showed that the base equation of HGBLUP cannot be linearly transformed into the one of LEGBLUP in case that heterozygous loci need to be considered (Figure 2, Example 2 in **Theory**). Hence, further empirical studies are needed to compare HGBLUP with marker-based models to provide more insight into the similarities between marker- and haplotype-based prediction approaches for non-inbred populations.

Our theoretical derivations did not rely on a specific definition of the haplotype blocks in the HGBLUP model. This is important to note since the performance of haplotype-based models has been shown to depend on the method to define the haplotype blocks in experimental studies (Calus *et al.* 2008, 2009, Cuyabano *et al.* 2014, Jónás *et al.* 2016).

### Simulation studies showed that haplotype-based models indeed capture local epistatic effects

Simulation studies were used to scrutinize that the HGBLUP model exploits local epistatic effects among markers. Six scenarios which differed with respect to the nature and pattern of epistatic effects were utilized (Table 3) to compare the performance of HGBLUP with those of GBLUP and EGBLUP (Figure 3). In scenarios in which no local epistatic effects were simulated, the highest prediction accuracies were achieved by GBLUP (Figure 3a) and EGBLUP (Figure 3b) and no benefit was observed for HGBLUP. In the four scenarios in which local epistasis was simulated, considering window sizes from 3 to 5 HGBLUP clearly outperformed GBLUP and EGBLUP in two cases (Figure 3d, f), but not in the scenarios in which only digenic local epistatic effects were simulated (Figure 3c, e). According to the theoretical derivation, the HGBLUP model assumes correlated local epistatic effects and considers not only local digenic but also higher-order epistatic effects among markers. This explains why the HGBLUP model did not perform well in scenarios in which the latter assumption was not fulfilled.

The results shown in Figure 3 were obtained for a trait with a simulated heritability of 0.7 and a ratio of 4:3 for the simulated variance of additive effects to that of epistatic effects, . The ratio 4:3 represents an optimized ratio for HGBLUP as it was derived in the linear transformation from HGBLUP to LEGBLUP (see **Materials and Methods** for details). We observed that the findings in case of a lower heritability of 0.5 in conjunction with equaling 4:3 followed the same pattern (Figure S1), suggesting that the conclusions are valid for traits with a range of heritabilities. If was set to 3:1, the advantage of HGBLUP was reduced in scenarios in which higher-order local epistatic effects were simulated (Figure S2d, f and Figure S3d, f). These results are expected as the relevance of epistasis was purposely weakened by the applied ratio of 3:1 for . In summary, the results of the simulation studies confirm that local epistasis is indeed exploited by the HGBLUP model.

### Haplotype-based models are especially useful when local higher-order epistasis is important

Our theoretical derivations showed that the haplotype-based model HGBLUP is able to exploit local epistatic effects among markers, since HGBLUP and the marker-based model LEGBLUP were shown to be almost statistically equivalent. As a next step, we asked under which circumstances the haplotype-based model outperforms the marker-based model. In order to minimize the demand on computational resources, discussed in detail in the next subsection, we implemented the LEGBLUP model such that only additive and digenic local epistatic effects were considered (Equation 3). Under these constraints, two differences exist between HGBLUP and LEGBLUP. First, higher-order local epistasis is considered in HGBLUP but not in LEGBLUP. Second, HGBLUP assumes correlated local epistatic effects, while LEGBLUP assumes independent effects. The relative impact of these factors was assessed by comparing the performances of HGBLUP and LEGBLUP in our simulation study. In scenarios in which higher-order local epistasis was simulated, HGBLUP outperformed LEGBLUP regardless whether correlated or independent local epistatic effects were simulated (Figure 3d, f). In contrast, in scenarios in which only digenic local epistasis was simulated, the prediction accuracies of LEGBLUP were higher than those of HGBLUP (Figure 3c, e). In scenario 4 (Figure 3d), the assumption that local epistatic effects were independent should have favored LEGBLUP, nonetheless HGBLUP outperformed LEGBLUP suggesting that the influence of the effect pattern was masked by the inclusion of higher-order local epistasis. In scenario 5 (Figure 3e), local epistatic effects were assumed to be correlated, this should have favored HGBLUP, yet LEGBLUP yielded higher prediction accuracies than HGBLUP, indicating that the exclusion of higher-order epistasis had a stronger effect than the effect pattern. Thus, among the assumptions favoring HGBLUP, the presence of higher-order local epistasis was found to be the most important. This conclusion holds for a range of simulated heritabilities (Figure S1). However, when the ratio of the simulated variance of additive effects to that of epistatic effects increased the advantage of HGBLUP decreased (Figure S2d, f) and/or even disappeared at certain window sizes (Figure S3d, f).

The contribution of higher-order epistasis to the phenotypic variation of complex traits is poorly understood because higher-order epistasis is difficult to detect in genetic mapping studies (Taylor and Ehrenreich 2015). Nevertheless, evidences for higher-order gene interactions from model organisms were reported (Pettersson *et al.* 2011, Taylor and Ehrenreich 2014) and new approaches were developed to detect them (Sailer and Harms 2017). The comparisons of the prediction accuracies of HGBLUP *vs.* single marker-based approaches pave the way for a new approach to provide insights into the relevance of higher-order epistasis for complex traits.

### Haplotype-based models are computationally efficient in exploiting local epistasis

In the analyses of the experimental data, the LEGBLUP model was implemented in a way that only additive and digenic epistatic effects were included (Equation 3). Thus, two kinship matrices were considered in the LEGBLUP model, the additive kinship matrix and the digenic local epistatic kinship matrix. In contrast, the HGBLUP model is based on a single kinship matrix. We compared the speed of HGBLUP and LEGBLUP with 100 cross validations using a maize data set with 833 individuals and 29,466 SNP markers (see **Materials and Methods**). The computer used for the test was equipped with Intel(R) Core(TM) i7-6700 CPU (3.40 GHz) and 32.0 GB RAM. The computational time was with 51 min for the LEGBLUP model nearly twice as long compared to the HGBLUP model which took only 28 min. Although the full LEGBLUP model potentially may yield comparable prediction accuracies as HGBLUP when higher-order epistasis is relevant, it would be far less efficient than HGBLUP, therefore we did not implement the full LEGBLUP model which includes local higher-order epistasis in our data analyses. In summary, the haplotype-based model HGBLUP is computationally much more efficient in exploiting local epistasis compared to marker-based models. This point may be of particular relevance for future studies since ultra-high density SNP data sets are emerging for plant and animal populations owing to the rapid progress with regard to genotyping-by-sequencing approaches (Scheben *et al.* 2017).

### The performance of haplotype-based genome-wide prediction models in empirical data sets

Our theoretical and simulation results have shown that the HGBLUP model increases the prediction accuracy for inbred populations when local epistasis is abundant. To explore the potential of HGBLUP, we compared the performance of HGBLUP with the three marker-based models GBLUP, EGBLUP, and LEGBLUP using one animal data set and two crop data sets.

The mouse data set comprised non-inbred genotypes. For both analyzed traits (Figure 4), the HGBLUP model clearly outperformed the other three models, suggesting that the HGBLUP model also exploits local epistasis in case heterozygous loci need to be considered. This result is of particular relevance since it was not possible to prove theoretically that the HGBLUP model is able to exploit local epistatic effects in case of non-inbred populations (Figure 2). Given that haplotype-based genome-wide prediction models have been successfully applied in non-inbred cattle populations and outperformed alternative marker-based models (Boichard *et al.* 2012, Cuyabano *et al.* 2014, 2015a, b, Jónás *et al.* 2016), the haplotype-based genome-wide prediction model is an attractive tool for non-inbred populations.

For the rice data set, 26 agronomic traits (Zhao *et al.* 2011) with different genetic architectures were evaluated (Table S1). We observed that for three traits, such as protein content, HGBLUP outperformed all other models (Figure 5a). For two traits, including flowering time, HGBLUP gave slightly higher prediction accuracies than GBLUP and LEGBLUP, but lower ones than EGBLUP (Figure 5b). There were six traits for which only EGBLUP outperformed the other models, as shown for panicle fertility (Figure 5c). For the remaining fifteen traits, including plant height, GBLUP yielded the best prediction accuracies (Figure 5d).

For the maize data set, HGBLUP provided no benefit for the five traits under consideration. In fact, in all cases the best prediction accuracy was observed for the GBLUP model which only takes additive effects into account (Figure 6).

The contrasting results we observed for different traits in the crop data sets indicated that the haplotype-based model will not generally boost prediction accuracies in crop populations. Instead, the effectiveness of HGBLUP may depend on the complexity of the trait. Analysis of the trait flowering time in rice and maize revealed that HGBLUP increased prediction accuracies in rice (Figure 5b), but not in maize (Figure 6d, e). As a matter of fact, HGBLUP failed to increase prediction accuracies regardless which trait was analyzed for the maize data set, in contrast to the results for the rice data set. These findings are in accordance with those obtained in a recent study (Akdemir and Jannink 2015) where a semiparametric mixed model with multiple marker-derived local epistatic genomic relationship matrices was applied to wheat, barley, and maize data. It was observed that the local epistatic model performed well in the wheat and barley data sets but not in the maize data set, possibly indicating the different relevance of epistasis in selfing and outcrossing species (Garcia *et al.* 2008).

As the models GBLUP, EGBLUP and HGBLUP capitalize on different genetic effects in prediction, comparing the prediction accuracies of these models provides a first insight into the genetic architecture of a particular trait in a given organism. There is however a risk of misinterpreting local epistatic effects due to “apparent epistasis” (Wood *et al.* 2014), a phenomenon which refers to the fact that multi-locus genotype tags may mimic tight linkage disequilibrium with an unobserved functional variant in the genome for a single marker. In such a case, the HGBLUP model would actually exploit the hidden additive effects of the unobserved variants, instead of the local epistatic effects. The fact that HGBLUP incorporates both additive and local epistatic effects for prediction is of particular relevance for breeders; in cases in which HGBLUP outperforms GBLUP, local epistatic effects or effects that are due to apparent epistasis are expected to be passed on for several generations, very much like additive effects.

## Conclusions

In this study, we investigated the relationship between haplotype-based and marker-based genome-wide prediction models. We provided a mathematical proof that modeling haplotype effects is equivalent to modeling main and local epistatic effects of markers, but with a different covariance matrix. Our simulation study confirmed the theoretical results and revealed that haplotype-based models are superior to marker-based models when there is abundant higher-order local epistasis. The fact that haplotype-based models exploit local epistasis among markers is especially relevant for applied breeding as the local additive-by-additive epistatic effects can last for generations like the additive effects. Thus, haplotype-based models have the potential to increase the accuracy of genomic selection. This hypothesis was partly supported by our empirical data analyses as we observed in certain cases that modeling local epistasis is indeed better than only modeling main effects. Further studies are needed to find out for which traits and in which species the haplotype-based models can be beneficial in genomic selection.

## Acknowledgments

We thank Dr. Yusheng Zhao for fruitful discussions about the mathematical proof of the relationship between LEGBLUP and HGBLUP. Y.J. was supported by the Federal Ministry of Education and Research of Germany (Grant FKZ031B0184A).

## Footnotes

*Communicating editor: J. Ross-Ibarra*Supplemental material available at Figshare: https://doi.org/10.25387/g3.5986933

- Received December 21, 2017.
- Accepted March 11, 2018.

- Copyright © 2018 Jiang
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.