Bayesian definition is a theorem in probability theory, which is related to the conditional probability and marginal probability distribution of random variables.
Usually, the probability of event A under the condition of event B (occurrence) is different from that of event B under the condition of event A (occurrence). However, there is a definite relationship between them, and Bayesian theorem is the statement of this relationship. One purpose of Bayesian formula is to derive the fourth probability function from the known three probability functions.
Give the formula directly:
Where P(A|B) refers to the probability (conditional probability) that event A occurs when event B occurs. In Bayesian theorem, every noun has a regular name:
According to these terms, Bayesian theorem can be expressed as:
Posterior probability = (likelihood * prior probability)/normalized constant
In other words, posterior probability is proportional to the product of prior probability and similarity.
At the same time, the denominator P(B) can be decomposed into:
If P(X, Y|Z)=P(X|Z)P(Y|Z), or equivalently, P(X|Y, Z)=P(X|Z), then events X and Y are said to be conditionally independent for a given event Z, that is, when Z occurs, whether X occurs or not has nothing to do with whether Y occurs or not.
It is applied to natural language processing, that is, under certain article category conditions, each feature (word) of an article is independent and irrelevant under certain article category conditions. Generally speaking, there is no correlation (in fact, it is not established) under certain article categories. This is a very strong assumption, but it becomes easy to solve the problem.
Let the input space be a set of n vectors and the output space be a set of category labels. The input is a feature vector and the output is a category label. X is a random variable defined in the input space X and Y is a random variable defined in the output space Y.. P(X, y) is the joint probability distribution of x and y. Training data set:
It is generated by the independent identical distribution of P(X, y), so the naive Bayesian model is also a generating model.
Naive Bayesian algorithm learns joint probability distribution P(X, y) through training set, specifically, learns prior probability distribution and conditional probability distribution, where the prior probability distribution
Conditional probability distribution
,k= 1,2,...,K
The joint probability distribution P(X, Y) = P(X|Y)P(Y) is obtained from two probabilities.
The conditional probability distribution P(X=x|Y=c_k) has exponential parameters, and its estimation is actually infeasible. Suppose there is a value, j= 1, 2, ..., n, y has k values, then the number of parameters is.
In fact, exponential parameter estimation is not feasible. Therefore, Naive Bayesian algorithm makes assumptions between features, that is, makes conditional independence assumptions on conditional probability distribution, which is a strong assumption. Through this assumption, our parametric solution becomes feasible, which is the origin of naive Bayes. In this case, we also assume that there is a value, j = 1, 2, ...
In the classification of naive Bayesian algorithm, for a given input X, the posterior probability distribution is calculated through the learned model, and the class with the largest posterior probability is output as the class of input X, and the posterior probability is calculated according to Bayesian theorem:
The above formula is a kind of posterior probability distribution, because for the same input x, the denominator of posterior probability of different categories is the same, and the final category output is the category with the largest probability in posterior probability distribution, so we can simplify it to determine the final result only by comparing the size of molecules, that is, the final category output is:
.
If we record the probability of product on the right, then the continuous product can be converted into sum, and the calculation is simpler (addition is always simpler than multiplication). There is a variant of the appeal formula:
.
At the same time, this form can also be regarded as a linear regression with a weight coefficient of 1.
After introducing the probability model of naive Bayes, our main problem at present is how to estimate the parameters of this model. After estimating the parameters, we can predict the input vector x. There are different types of naive Bayes used to solve these parameters. Three types are introduced in detail: Bernoulli Naive Bayes, Polynomial Naive Bayes and Gaussian Naive Bayes. Different types of naive Bayes have different solutions to parameters. The fundamental reason is that the hypothetical distribution of P conditional probability (X=x|Y=c_k) is different, that is to say, in the case of a given category, the distribution of X hypothesis is different: Bernoulli hypothesis is Bernoulli distribution (actually it should be multivariate Bernoulli distribution), polynomial hypothesis is polynomial distribution, and Gaussian hypothesis is Gaussian distribution (actually it is multivariate Gaussian distribution). Then, we refine it into three different types.
Bernoulli naive Bayes, in fact, should be called "multivariate naive Bayes", assuming that P(X=x|Y=c_k) is multivariate Bernoulli distribution. Before we understand the multivariate Bernoulli distribution, let's first introduce what is a (one-dimensional) Bernoulli distribution.
Bernoulli distribution, also known as two-point distribution or 0- 1 distribution, is a discrete probability distribution. The random variable x is called Bernoulli distribution, and the parameter is p (0
The simplest example is to flip a coin, and the result of the coin is positive or negative.
It's easier to change the power operation into multiplication operation. When x= 1, the probability is P(X= 1)=p, and when x=0, the probability is P(X=0)= 1-p, so the two situations can be combined.
After knowing what the Bernoulli distribution is, let's look at what the multivariate Bernoulli distribution is.
Multivariate Bernoulli distribution, in layman's terms, is to carry out several different Bernoulli experiments at the same time, where X is a vector and also a vector, indicating the parameters of different Bernoulli experiments.
The Bernoulli polynomial assumes that the document generation model P(X=x|Y=c_k) is a multivariate Bernoulli distribution, which is a vector form because of the assumption of feature independence we made before, in which, that is, the x vector is a hot vector (each dimension has a value of 0 or 1), indicating whether the features of this dimension appear or not. The feature set has n features, and the dimension of the feature set determines it.
Because of the independence between features, multivariate Bernoulli becomes a continuous product of Bernoulli distribution. It should be noted that because it is a Bernoulli distribution, 0- 1, the probability of feature appearing is p, and the probability of feature not appearing is also1-p. After the parameter estimation of the final model is completed, if a feature does not appear, it needs to be multiplied by the feature that does not appear. ! ! Direct multiplication of two vectors can't get the final result.
The corresponding Bernoulli naive Bayesian model is:
To simplify the operation, we can ignore the denominator. Although the corresponding results are not true probabilities, the magnitude relationship between the posterior probabilities of the same sample remains unchanged. At the same time, if both parties perform logarithmic operations at the same time, the relationship between posterior probabilities will remain unchanged. Therefore,
.
After understanding the multivariate Bernoulli distribution, the next work is to estimate and calculate the parameters.
The process of parameter estimation is also the learning process of naive Bayesian classifier, and maximum likelihood estimation can be used for parameter estimation. The maximum likelihood estimate of prior probability is
,k= 1,2,...,K
Where I(x) is a indicator function, if X is true, the result of I(x) is 1, if X is false, I(x)=0. Described in language, this probability is equal to the proportion of samples with categories in a data set of n samples.
The maximum likelihood estimation of conditional probability is:
Described in language, the conditional probability is equal to the probability that the i-th feature in the sample set (subset of data set) is equal to 0 or 1, which obeys the Bernoulli distribution, so only one, such as p, needs to be calculated, because the sum of the two probabilities is 1 (this is the same variable).
After these parameters are estimated, naive Bayes completes the learning process, and then it can be used to predict (application is the ultimate goal).
Because it is a Bernoulli distribution, the parameter p is between [0, 1], which may exist, that is, the probability is 0.
For example, in all samples under the current category, feature I appears (= 1). According to the above conditional probability maximum likelihood estimation, we can know that, correspondingly, when a new sample comes, there is a record X, which just doesn't have the ith feature (is this unfortunate? No), because of the existence of the probability of 0, the probability of belonging to a certain column will be 0 by using the Bayesian formula above, but this situation should be avoided, so how to avoid it?
When we estimate the conditional probability with maximum likelihood, we make some small changes to the numerator and denominator.
Among them, the number of different values representing the i-th feature is one-hot, and the value is 2. Therefore, multiplication ensures that the sum of conditional probabilities corresponding to different values is 1, without favoring any case and treating them equally.
To be continued.
Polynomial naive Bayes, assuming that P(X=x|Y=c_k) is a polynomial distribution. What is polynomial distribution before learning polynomial naive Bayes?
The univariate Bernoulli distribution is extended to the D-dimensional vector, where, assuming the probability is, the discrete distribution is obtained:
.
Where x is in the form of a d-dimensional vector. On this basis, binomial distribution is extended to polynomial distribution to describe the probability of words appearing in n independent experiments, and its density function can be expressed as:
The expected variance of polynomial distribution is as follows:
Polynomial distribution applied to naive Bayes. For document classification, it is assumed that the document generation model is based on a polynomial distribution of a given document type. This correspondence is:
It should be noted that before being applied to the polynomial naive Bayes model of text classification, the general polynomial conditional probability is as follows:
Our polynomial naive Bayesian probability model is:
For convenience, we assume that there is no correlation between the length of the article and the category of the article (which is not true, for example, relatively long emails are more likely to be normal than spam), that is to say, the distribution of P(|x|) has nothing to do with the category of the article. On the other hand, because the article belongs to the category with the greatest posterior probability, we can take the article length P(|x|).
Furthermore, in order to be more convenient, we usually take the logarithmic operation on both sides and convert the power operation into a linear operation:
We can also omit the article length factorial, and then become:
.
This becomes a linear operation, just like linear regression, which is efficient and simple.
The document model is mapped to polynomial distribution to obtain polynomial naive Bayes. After we make the hypothetical distribution, the remaining work is to estimate the D conditional probability and prior distribution of each class under the hypothetical distribution. In addition, it should be noted that the polynomial naive Bayes model adopts the word bag model, and each word bag represents the frequency of the I-th feature, that is, the word frequency $ term-frequency, and sometimes tf-idf can be used as the value.
The process of parameter estimation is also the learning process of naive Bayesian classifier, and maximum likelihood estimation can be used for parameter estimation. The maximum likelihood estimate of prior probability is
,k= 1,2,...,K
Where I(x) is a indicator function, if X is true, the result of I(x) is 1, if X is false, I(x)=0. Described in language, this probability is equal to the proportion of samples in a data set of n samples.
The maximum likelihood estimation of conditional probability is:
Described in language, the conditional probability is equal to the ratio of the total number of times the T feature appears (considering the word frequency, it is no longer 0, 1) to the total number of words in the sample set with categories (the length of the article and the word characteristics of the article are fixed, considering the sum of word frequencies).
For the convenience of understanding, the total number of times that the t-th feature appears in the k-th sample set is expressed as the total number of words of the k-th sample in all samples (the sum of the lengths of the k-th sample, considering the frequency), which is abbreviated as:
Similar to Bernoulli's naive Bayesian model, there may be a dimension in which the data set is 0, which corresponds to document classification, that is, the word has never appeared in all articles (poor dictionary selection and poor feature selection), and this situation will have a probability of 0. So we need to change the conditional probability a little:
Where d means that the data dimension is d (there are d features, and the sum of each feature is guaranteed to be 1, which needs to be multiplied by d). When it is, it is called Laplacian Smoothing, but it can also be less than 1.
to be continued
Gaussian naive Bayes, assuming that P(X=x|Y=c_k) is multivariate Gaussian distribution. Before knowing Gaussian Naive Bayes, what are Gaussian distribution and multivariate Gaussian distribution?
Gaussian distribution, also known as normal distribution, is the most widely used in practical applications. For univariate, Gaussian distribution has two parameters, namely mean and variance, and its probability density function is
Where is the determinant of the sum of the D-dimensional mean vector and the covariance matrix of DxD. The expectation of multivariate Gaussian distribution is that the variance is.
Especially, if the D dimensions are independent of each other, the multivariate Gaussian distribution can be expressed as the continuous product of the probability density function of the unit Gaussian distribution.
Gaussian Naive Bayesian Model assumes that conditional probability P(X=x|Y=c_k) is multivariate Gaussian distribution. On the other hand, based on the hypothesis of conditional independence of the previous features, we can model the conditional probability of each feature, and the conditional probability of each feature also obeys Gaussian distribution.
Under the category, the Gaussian distribution corresponding to the ith word is:
Among them, it represents the mean and variance of the I-th feature under class C.
Since the features are assumed to be independent of each other, we can get the conditional probability:
Have d characteristics.
Gaussian naive Bayes becomes:
.
After knowing the multivariate Gaussian distribution, the next work is to estimate, calculate and sum the parameters.
The prior probability is the same as the previous estimation method, so I won't repeat it here. It mainly estimates the mean and variance of Gaussian distribution, and the method adopted is still maximum likelihood estimation.
The estimation of the mean is the average of all samples in the sample category;
Variance estimation is all variances in the sample category.
For a continuous sample value, the probability distribution can be obtained by introducing Gaussian distribution.
After all the parameters are estimated, the conditional probability of a given sample can be calculated, and then the sample category can be determined to complete the model prediction.