Because it is a mixed effect model, it contains both fixed effects and random effects. The so-called fixed effect means that all possible grades or levels are known and observable, such as gender, age and variety. The so-called random effect refers to the level that may be reached when randomly sampling samples from the population, which is uncertain, such as individual additive effect and maternal effect (Formula 2).
Where y is the observation vector; β is a fixed effect vector; μ is a random effect vector, which obeys the normal distribution μ ~ n (0, g), the mean vector is 0, and the variance covariance matrix is g; X is the incidence matrix of fixed effect; Z is the incidence matrix of random effect; ? Is a random error vector, and its elements do not need to be independent and identically distributed, that is. ~ N(0, r). At the same time, it is assumed that Cov(G, R)=0, that is, g is not related to r, and the variance covariance matrix of y becomes var (y) = zgz+r. If Zμ does not exist, it is a fixed effect model. If Xβ does not exist, it is a random effect model.
In the traditional linear model, in addition to the linear relationship, the response variables have the assumptions of normality, independence and homogeneity of variance. Mixed linear model not only retains the assumption of normal distribution of phenotype in traditional linear model, but also does not require independence and homogeneity of variance, thus expanding the application scope and being widely used in genome selection.
A long time ago, C.R.Henderson put forward the statistical method of best linear unbiased prediction (BLUP) in theory, but its application was limited due to the lag of computing technology. Until the mid-1970s, the development of computer technology made the application of BLUP in breeding possible. BLUP combines the advantages of least square method. BLUP is an ideal method to analyze the target traits of animal and plant breeding when the covariance matrix is known. Its name and meaning are as follows:
In the mixed linear model, BLUP is the prediction of random factors in random effects, and Blue (Best Linear Unbiased Estimation) is the estimation of fixed factors in fixed effects. Fixed effect and random genetic effect can be estimated and predicted in the same equation group.
BLUP method was originally applied to animal breeding. The traditional animal model is called ABLUP because it solves the mixed model equation (MME) based on the pedigree information. The MME proposed by Henderson is as follows:
Where x is a fixed effect matrix, z is a random effect matrix and y is an observation matrix. Where r and g:
Where a is the genetic relationship matrix, so the convertible formula is:
It can be further transformed into:
In the formula, the x, y and z matrices are all known, so the inverse matrix of the kinship A-1 can be calculated, and the k value is calculated as follows:
By solving the equation and calculating the variance components of residual and additive variance, we can get the fixed factor effect value (blue) and the random factor effect value (BLUP).
As a traditional BLUP method, ABLUP constructs a genetic relationship matrix based on pedigree information, and then obtains breeding values. This method was widely used in early animal breeding, but it is not used alone now.
In 2008, VanRaden proposed GBLUP (Genome Optimal Linear Unbiased Prediction) method based on G matrix, which is composed of all SNP markers, and the formula is as follows:
In the formula, p i represents the minimum allele frequency of locus I, and z represents the individual genotype matrix.
GBLUP directly estimates individual breeding value by constructing genome relation matrix G based on pedigree information instead of genetic relation matrix A..
The solving process of GBLUP is different from the traditional BLUP method, only the G matrix is constructed. Besides the genomic relationship of VanRaden, there are other methods to construct G matrix, but the method proposed by VanRaden is the most widely used. For example, Yang et al. proposed to calculate G matrix by weight:
Calculate the G matrix based on the pedigree A matrix proposed by Goddard et al.
At present, GBLUP has been widely used in animal and plant breeding, and it is still favored because of its high efficiency and robustness. GBLUP assumes that all markers have the same effect on the G matrix, but in the actual genome, only a few markers play a major role, and most markers have little effect, so GBLUP still has a lot of room for improvement.
In animal breeding, for various reasons, a large number of individuals with pedigree records and phenotypic information have no genotype. One-step GBLUP (ssGBLUP) is to solve the problem of genome breeding value estimation for individuals without genotype and individuals with genotype in breeding population.
SsGBLUP combines traditional BLUP and GBLUP, that is, it integrates genetic relationship matrix A and genome relationship matrix G based on family information to establish a new relationship matrix H, so as to estimate the breeding value of individuals with genotype and without genotype at the same time.
Construction method of h matrix;
In the formula, A and G are a matrix and G matrix respectively, and subscripts 1 and 2 are individuals without genotype and individuals with genotype respectively. Since G is a singular matrix, it is impossible to find the inverse. VanRaden proposed that G is defined as G w = (1-w)G+wA 22, and the H inverse matrix can be transformed into:
Where w is the weighting factor, that is, the proportion of polygenic genetic effects.
After constructing the H matrix, the process of solving the MME is the same as the traditional BLUP:
SsBLUP is usually more accurate than GBLUP because genotyping individuals contain pedigree records and phenotypic data. This method has become one of the most commonly used animal models in animal breeding. In plant breeding, there is often a lack of comprehensive pedigree information, and the genotype of individuals in the population is easy to determine, so it has not been popularized.
If the relationship matrix of covariates constructed in GBLUP is replaced by the relationship matrix composed of SNP markers, the model is constructed, and then individuals are predicted, which is the idea of RRBLUP (Ridge Regression Optimal Linear Unbiased Prediction).
Why not use the least square method? The least square method assumes that the marker effect is a fixed effect, regresses all SNP segments, and then adds up the significant SNP effects in each segment to get the individual genome breeding value. This method only considers the influence of a few significant SNPs, which easily leads to multicollinearity and over-fitting.
RRBLUP is an improved least square method, which can estimate the effect value of all SNP. This method assumes that the marker effect is random and obeys normal distribution, estimates the effect value of each marker with a linear mixed model, and then adds up each marker effect to get the individual estimated breeding value.
Generally speaking, the number of markers in genotypic data is much larger than the number of samples (p>& gtn).RRBLUP is calculated in terms of labels, and its running time is longer than that of GBLUP, with the same accuracy. (PS: This situation is slowly changing in various countries, especially in the United States, where there are more than 4 million bovine chip data, so it may be one of the future development directions)
GBLUP is the representative of direct method. It takes individuals as random effects, takes the genetic relationship matrix constructed by reference population and prediction population as variance covariance matrix, estimates the variance component by iterative method, and then solves the mixed model to get the estimated breeding value of individuals to be predicted. RRBLUP is the representative of indirect method, which first calculates the effect value of each marker, then accumulates the effect value and then obtains the breeding value. The following figure compares the similarities and differences between the two methods:
Direct method estimates, and indirect method estimates the sum m of labeling effects. When K=M'M and the marker effect g obeys an independent normal distribution (as shown in the above figure), the breeding values estimated by the two methods are the same, that is, =M.
The genome selection method based on BLUP theory assumes that the genetic variance of all markers is the same, but in fact, only a few SNPs are effective in the whole genome and linked to QTLs affecting traits, and most SNPs are ineffective. When we assume that the variance of labeling effect is a priori distribution, the model becomes a Bayesian method. Common Bayesian methods were also put forward by Meuwissen (that is, the person who put forward GS), mainly including Bayes, Bayes, Bayes, Bayes, Bayesian lasso and so on.
BayesA assumes that each SNP has an effect and obeys a normal distribution, and the variance of the effect obeys a scaled inverse chi-square distribution. BayesA method presupposes two genetic parameters, degree of freedom V and scale parameter S, and introduces Gibbs sampling into Markov chain Monte Carlo theory (MCMC) to calculate labeling effect.
BayesB assumes that a few SNPs are effective, and the variance of effect obeys inverse chi-square distribution, while most SNPs have no effect (in line with the actual situation of the whole genome). BayesBBian method uses mixed distribution for the prior distribution of variance of labeling effect, and it is difficult to construct the complete conditional posterior distribution of labeling effect and variance, so Bayes uses Gibbs and MH(Metropolis-Hastings) sampling to jointly sample labeling effect and variance.
The BayesB method introduces a parameter π in the operation process. Assume that the probability of zero variance of labeling effect is π, and the probability of obeying inverse chi-square distribution is 1-π. When π is 1, all SNPs have effects, that is, they are equivalent to BayesA. When genetic variation is controlled by a few QTLs with great influence, BayesB method has high accuracy.
The parameter π in BayesB is artificially set, which will bring subjective influence to the results. Bayesian, Bayesian π, Bayesian π and other methods are used to optimize Bayesian. BayesC method takes π as an unknown parameter, assuming that it obeys the uniform distribution of U(0, 1), and assuming that the variance of effective SNP is different. BayesCπ method assumes that the variance of SNP effect is the same on the basis of Bayes and is solved by Gibbs sampling. BayesDπ method is used to calculate the unknown parameter π and scale parameter S. Assuming that the prior distribution and posterior distribution of S obey the (1, 1) distribution, we can directly sample from the posterior distribution.
The following figure vividly illustrates the variance distribution of labeling effects by different methods:
Bayesian cable (minimum absolute contraction and selection operator) assumes that the variance of label effect obeys the normal distribution of exponential distribution, that is, Laplace distribution. The difference between Bayes and Bayes is that the labeling effect obeys different distribution, and Bayes assumes that the labeling effect obeys normal distribution. Laplacian distribution can allow the maximum or minimum to appear with greater probability.
As can be seen from the above Bayesian method, the emphasis and difficulty of Bayesian method lies in how to make reasonable assumptions about the prior distribution of superparameters.
Bayesian model often has more parameters to be estimated than BLUP method, which improves the prediction accuracy and brings more calculation. MCMC needs tens of thousands of iterations, and each iteration needs to reevaluate all the marking effect values. This process is continuous and non-parallel, which consumes a lot of calculation time and limits its application in animal and plant breeding practice with strong timeliness requirements.
In order to improve the speed and accuracy of operation, many scholars have optimized the prior assumptions and parameters in Bayes method, and put forward fastBayesA, BayesSSVS, fBayesB, emBayesR, EBL, BayesTA, etc. But at present, the most commonly used Bayesian methods are the above.
The prediction accuracy of various models depends largely on whether their model assumptions are suitable for genetic construction of prediction phenotype. Generally speaking, the accuracy of Bayesian method after parameter adjustment is slightly higher than that of BLUP method, but its operation speed and robustness are not as good as that of BLUP method. Therefore, we should weigh the pros and cons according to our own needs and make a reasonable choice. (PS: BLUP method is used in animal breeding and actual production)
In addition to parametric methods based on BLUP and Bayesian theory, genome selection also includes semi-parametric methods (such as RKHS, see next article) and nonparametric methods, such as Machine Learning (ML). Machine learning is a branch of artificial intelligence, which focuses on predicting the results of unobserved individuals (unlabeled data) by applying highly flexible algorithms to the known attributes (features) and results (labeled data) of observed individuals. The results can be continuous, classified or binary. In animal and plant breeding, the marked data corresponds to the training population with genotype and phenotype, while the unlabeled data corresponds to the test population, and the feature used for prediction is SNP genotype.
Compared with traditional statistical methods, machine learning method has many advantages:
Support Vector Machine (SVM) is a typical nonparametric method and belongs to supervised learning method. It can not only solve the classification problem, but also be used for regression analysis. Based on the principle of structural risk minimization, SVM takes into account the complexity of model fitting and training samples, especially when we don't know enough about our own population data. SVM may be another method of genome prediction.
The basic idea of SVM is to solve the separation hyperplane which can correctly divide the training data set and has the largest geometric interval. In support vector regression (SVR), approximate error is usually used to replace the difference between optimal separation hyperplane and support vector, such as SVM. Suppose ε is a linear loss function in the insensitive region, and when the measured value and the predicted value are less than ε, the error is equal to zero. The goal of SVR is to minimize the square norm of empirical risk and weight. That is, hyperplane is estimated by minimizing empirical risk.
The following figure 1 compares the differences between SVM regression (Figure A) and classification (Figure B). Where ξ and ξ * are slack variables, c is a user-defined constant, and w is the weight vector norm. Represents a feature space mapping.
When SVM is used in prediction analysis, high-dimensional large data sets bring great complexity to calculation, and the application of kernel function can greatly simplify the inner product, thus solving the dimension disaster. Therefore, the selection of kernel function (which needs to consider the distribution characteristics of training samples) is the key to SVM prediction. At present, the most commonly used kernel functions are linear kernel function, Gaussian kernel function (RBF) and polynomial kernel function. Among them, RBF has wide adaptability and can be applied to any distributed training samples (with appropriate width parameters). Although it sometimes leads to over-fitting problems, it is still the most widely used kernel function.
Ensemble learning is also one of the most common algorithms in machine learning. It learns through a series of learners and integrates the learning results with some rules, thus obtaining better results than a single learner. In layman's terms, it is a group of weak learners combined into a strong learner. In the field of genetic algorithm, random forest and gradient propulsion machine are two widely used integrated learning algorithms.
RF is an integration method based on decision tree, that is, a classifier containing multiple decision trees. In genome prediction, RF, like SVM, can be used as both classification model and regression model. When used for classification, it should be noted that individuals in the population need to be divided according to phenotypic values in advance. The RF algorithm can be divided into the following steps:
Finally, RF will combine the output of classification tree or regression tree to make prediction. In classification, the unobserved categories are predicted by counting the number of votes (usually one vote per decision tree) and pointing out the category with the highest number of votes. In regression, the output of ntree is average.
There are two important factors that affect the results of RF model: one is the number of covariates randomly sampled by each node (mtry, that is, the number of SNP). When building a regression tree, the mtry defaults to p/3(p is the number of predicted trees), and when building a classification tree, the mtry is [picture upload failed ... (picture-10f515438+0612450396027)]; The second is the number of decision trees. Many studies show that the more trees are not the better, and building trees is also very time-consuming. When GS is applied to plant breeding, the ntree of RF is usually set between 500- 1000.
When GBM is based on decision tree, it is a gradient advancing decision tree (GBDT), which, like RF, contains multiple decision trees. But there are many differences between them. The biggest difference is that RF is based on bagging algorithm, which means that it will vote on multiple results, or simply calculate the average value to choose the final result. Based on boosting algorithm, GBDT makes up for the deficiency of the original model by constructing weak learners in each iteration. GBM deals with various learning tasks by setting different loss functions.
Although many studies have tried to apply many classic machine learning algorithms to genome prediction, the accuracy of improvement is still limited and time-consuming. Among numerous machine learning algorithms, no method can generally improve predictability, and different applications and their optimization methods and parameters are also different. Compared with the classical machine learning algorithm, deep learning (DL) may be a better choice for future genome prediction.
Traditional machine learning algorithms, such as SVM, are generally shallow models. In addition to the input and output layers, deep learning also contains many hidden layers, and the depth of the model structure explains the meaning of its name. The essence of DL is to learn more useful features by establishing a machine learning model with many hidden layers and massive training data, thus ultimately improving the accuracy of classification or prediction. The modeling process of DL algorithm can be simply divided into the following three steps:
In the field of GS, there are many DL algorithms, including Multilayer Perceptron (MPL), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).
MLP is an artificial neural network (ANN) model, which maps multiple input datasets to a single output dataset. MLP includes at least one hidden layer, as shown in Figure 2 below. In addition to an input layer and an output layer, it also includes four hidden layers, each of which is connected with the nodes of the upper layer and given different weights (w). Finally, the input is mapped to the output by activating function transformation.
CNN is a feedforward neural network with convolution calculation and deep structure, which usually has the ability of representation learning and can classify the translation invariants of input information according to its hierarchical structure. The hidden layer of CNN includes three types: convolution layer, pool layer and fully connected layer, and each layer has different functions. For example, the convolution layer is mainly used to extract the features of the input data, and the pool layer performs feature selection and information filtering on the feature map output after the feature extraction of the convolution layer, while the fully connected layer is similar to the hidden layer in ANN and is generally located in the hidden layer of CNN. The structure of CNN is shown in Figure 3 below.
It should be noted that deep learning is not everything. The premise of using DL is to have a large enough training data set with good quality. According to the study of GS in animals and plants, some DL algorithms have no obvious advantages compared with traditional genome prediction methods. However, there is consistent evidence that DL algorithm can capture nonlinear patterns more effectively. Therefore, DL can integrate the traditional GS model for assisted breeding based on data from different sources. In a word, the application of DL will become more and more important in the face of massive breeding data in the future.
The above are common prediction models in GS, and different classification methods may be different. The following is a brief introduction to the above but important methods, some of which are extensions of the above three methods.
Reproducing kernel Hilbert space (RKHS) is a typical semiparametric method. It uses Gaussian kernel function to fit the following model:
Where α is multivariate normal distribution, the mean value is 0, and the covariance matrix is kH σ α 2; ε ~ N(0,I Nσ2); K h is a kernel function representing individual correlation, where d ij is the square of Euclidean distance calculated by individuals I and J according to genotype, and the smoothing parameter h is defined as half of the mean value of d ij.
RKHS model can be solved by Gibbs sampler in Bayesian framework or mixed linear model.
GBLUP is still a widely used method in animal and plant breeding, which assumes that all markers have the same effect. However, in practice, any marker unrelated to the target trait used to estimate the genetic relationship matrix will dilute the role of QTL. Many studies have improved it, mainly in several aspects:
Following the above ideas, the method of SBLUP (Super Blup) further refines TABLUP into traits controlled by a few genes, so that only the markers related to the traits are used to construct the genotype relationship matrix.
If we want to consider the influence of population structure in the genetic relationship matrix, we can group individuals according to the similarity of genetic relationship, and then use the compressed population instead of the original individuals as covariates, and the genetic relationship of individuals in the group is the same. Therefore, when constructing the genome relationship matrix, the genetic effect value of the population can be used instead of the individual value, and the population corresponding to the individual can be used for prediction, which is called cBLUP (compressed BLUP).
All the above ideas mentioned the integration of verified and newly discovered sites into the model. Where did these websites come from? The most common source is naturally genome-wide association studies. There is a natural connection between GS and GWAS. Incorporating significant association sites of GWAS into GS has the direct benefit of maintaining multi-generation prediction ability and the indirect benefit of increasing the number of verified mutations.
The following figure compares various methods of GWAS-assisted genome prediction. A stands for molecular marker-assisted selection (MAS), and only a few major sites are used; B stands for the classic GS method, and all the markers are used, and the marking effect is the same; C, distributing scores according to weights; D, regarding significant association markers as fixed effects; E) Considers significant association markers as another random effect (with its own nuclear derivation); F divides chromosomes into segments, and the G matrix constructed by each segment is given different random effects.
The results of GWAS-assisted genome prediction will be more complicated, and the accuracy may not be improved simply by considering the relevant signals in the model. The specific performance should be related to the gene construction of traits.
GS has two different strategies to estimate genetic effects. One is to estimate the breeding value and pass the additive effect from parents to offspring. However, non-additive effects (such as dominance and epistasis) are related to specific genotypes and cannot be directly inherited. When estimating variance components, non-additive effects are usually regarded as noise together with random environmental effects. The other strategy focuses on additive and non-additive effects and is usually used to explore heterosis. Heterosis is generally considered to be the result of both dominant and epistatic effects, so if the non-additive effects are obvious and you just ignore them, the genetic estimation will be biased.
The utilization of heterosis is an important research topic in plant breeding, especially in main crops such as rice and corn. It is also one of the hot issues of genome prediction in crop breeding to consider non-additive genetic effects in GS model for hybridization prediction.
Of course, the composition of heterosis effect varies with traits, and the genome prediction of different traits needs to be combined with the identification of heterosis QTL loci. GCA (the reflection of additive effect) and SCA (the reflection of non-additive effect) may come from different genetic effects, so GCA and SCA should be considered separately when predicting hybrid F 1. GCA model can be based on GBLUP, focusing on the construction of genotypic genetic relationship matrix. There are two methods in SCA model: one is to integrate the panel of heterozygous SNP sites into GBLUP model as a fixed effect; The second is to use nonlinear models, such as Bayesian and machine learning methods. It is reported that machine learning is consistent with the general statistical model for low heritability traits in additive models. But in the non-additive model, the machine learning method performs better.
The traditional GS model often only pays attention to a single phenotypic trait in a single environment, but ignores the relationship between multiple traits or multiple environments in actual situations. Some studies can also improve the accuracy of genome prediction by modeling multiple traits or multiple environments at the same time. Taking the multi-trait (MT) model as an example, the multi-variable model (MV) can be expressed by the following formula:
Where y = [y 1 T, y 2 T, …, YST] t; b = [b 1 T,b 2 T,…,b s T]T; a = [a 1 T,a 2 T,…,a s T]T; ε = [ε 1 T, ε 2 T, …, ε s T] T, and s stands for s personality. Non-genetic effect B is a fixed effect, additive effect A and residual ε are random effects, and they obey multivariate normal distribution: a ~ N(0, G a0? Gσ a 2),ε ~ N(0,R ε? I m σ ε 2), where g is a g matrix,? Is the product of Kroneck matrix, m is the phenotypic observation number, I m is the identity matrix of m×m, and X and Z a are the correlation matrices of fixed effect and random additive effect respectively. The covariance matrix of additive effects of G a0 and R ε can be expressed as:
σ AI _ 2 and σ ε I _ 2 are additivity and residual variance of the ith character, respectively. ρ aij and ρ ij are additive variance and residual variance related to I and J traits, respectively.
Multi-trait selection is generally used for genetic construction in which traits share to some extent, that is, they are genetically related. It is especially suitable for low heritability traits (related to high heritability traits) or traits that are difficult to measure.
The environmental conditions of crops are not as easy to control as those of animals, and most of the characters are quantitative and easily influenced by the environment. Multi-environment experiments have played an important role, and the environment-determined genotype (G × E) effect is also the focus of current genome selection.
In addition to GBLUP, multivariate models can also be based on linear regression or nonlinear machine learning methods of Bayesian framework.
As we know, only through transcription and translation and a series of regulation can genes finally be reflected in phenotypic characteristics, and it can only reflect the potential of phenotypic events to a certain extent. With the development of multi-pathological technology, it is also an important direction of GS research to integrate multi-pathological data for genome prediction.
In plant breeding, besides genome, transcriptomics and metabonomics are two genomics that are studied relatively more in GS at present. Transcriptome predicts the correlation between gene expression and traits, and metabolomics predicts the correlation between the content of small molecules regulating phenotype and traits, which may improve the prediction ability of some specific traits. The best way is to integrate the data of each group into the model, but this will greatly increase the complexity of the model.
The accuracy of phenotype judgment directly affects the construction of the model. For some complex traits, it is obviously not advisable to observe and record with naked eyes, and phenotypic investigation is time-consuming and laborious, and the cost is very high. Therefore, Qualcomm Scale Group is also an important direction for the development of GS. Phenotype has a wide range. When individual traits cannot be simply measured, we can also use multiple sets of data, such as protein omics and metabonomics.
Considering the cost-effectiveness, omics technology is still in the research stage in animal and plant breeding, but it represents the future application direction.