Text Classification Principle of Naive Bayesian Algorithm

  nlp

Order

This paper mainly studies how Naive Bayesian algorithm classifies text.

Bayesian algorithm

Bayesian method converts the probability of “belonging to a certain type under certain characteristics” into the probability of “belonging to a certain type under certain characteristics”, which belongs to supervised learning.

后验概率 = 先验概率 x 调整因子

This is the meaning of Bayesian inference. We first estimate a “prior probability” and then add the experimental results to see whether the experiment strengthens or weakens the “prior probability”, thus obtaining a “posterior probability” closer to the truth.

Formula

p(yi|x) = p(yi) * p(x|yi) / p(x)

P (belonging to category yi| belonging to category yi) = p (belonging to category yi) * p (belonging to category yi) /p (belonging to category x)

According to the formula, the probability of “belonging to a certain category under certain characteristics” can be converted into the probability of “belonging to a certain category under certain characteristics”.

  • prior probability

Where p(yi) is called prior probability, that is, the probability of yi event occurring before x event occurs

  • posterior probability

P(yi|x) is called posterior probability, that is, the probability of yi event occurring after the occurrence of x event, which is an observable value

  • Adjustment factor

P(x|yi)/p(x) is the adjustment factor and also becomes the possibility function (Likelyhood), making the estimated probability closer to the real probability

Naive Bayesian algorithm

Naive Bayesian theory originates from the independence of random variables: in terms of text classification, from the perspective of Naive Bayesian, the relationship between two words in a sentence is independent of each other, that is, each dimension in the feature vector of an object is independent of each other. This is the ideological basis of naive Bayesian theory. The process is as follows

- 第一阶段,训练数据生成训练样本集:TF-IDF。
- 第二阶段,对每个类别计算P(yi)。
- 第三阶段,对每个特征属性计算所有类别下的条件概率p(ai|yi)。
- 第四阶段,对每个类别计算p(x|yi)p(yi)。
- 第五阶段,以p(x|yi)p(yi)的最大项作为x的所属类别。

Problem

Suppose x is a text to be classified, which is characterized by {a1, a2, …, am}; Known category set {y1, y2, … yn}; Find the category x belongs to

Basic idea

如果p(yi|x) = max{p(y1|x),p(y2|x),p(y3|x),...,p(yn|x)},则x属于yi类别

How to calculate p(yi|x)

Using Bayesian formula

p(yi|x) = p(x|yi)*p(yi) / p(x)

The problem is converted to calculating p(x|yi) for each categoryP(yi) in p(x|yi)The maximum term of p(yi) is taken as the category of x

Calculate p(x|yi)

Since Naive Bayes assumes that each feature is independent of each other, therefore

p(x|yi) = p(a1|yi)*p(a2|yi)...*p(am|yi)

P(x|yi)/p(x) is the adjustment factor.

Calculate p(ai|yi)

While p(ai|yi) can be determined by training set (There are already good categories.), the conditional probability p(ai|yi) of various features under each category is calculated.

Since then, the category p(yi|x) to which x belongs is solved step by step, which can be obtained by calculating the conditional probability p(ai|yi) of various features under each category in the training set.

In the training process, the influencing factor p(x|yi)=p(a1|yi) of the adjustment factor is calculated according to the training set.p(a2|yi) …P(am|yi), so the quality of the training set directly affects the accuracy of the prediction results.

TF-IDF

Tf-idf (termfrequency-inverse document frequency) is a common weighting method for information retrieval and data mining.

TF-IDF = TF * IDF

The main idea of TF-IDF is that if a word or phrase appears frequently in one article TF and rarely in other articles (IDF value is large), then it is considered that the word or phrase has good classification ability and is suitable for classification.

TF

It means Term Frequency, that is, the frequency of a word appearing in a document.

TFi = Ndti / Ndt1..n

That is, the sum of the number of occurrences of the word in the document/the number of occurrences of all words in the document

IDF

It means Inverse Document Frequency, which is an adjustment coefficient of the importance of a word and measures whether a word is a common word or not. If a word is rare, but it appears many times in this article, then it probably reflects the characteristics of this article, which is exactly the keyword we need.

IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient obtained.

IDF = log(Nd/Ndt)

Nd is the total number of files and Ndt is the number of files containing the word

If a word is very commonly used, the Ndt will become larger, then the IDF value will become smaller, thus the TF-IDF value will become smaller, thus weakening the characteristics of commonly used words.

Summary

Naive Bayesian algorithm solves the problem step by step and finally solves it through training set, which is worth studying and considering.

doc