Current location - Trademark Inquiry Complete Network - Trademark registration - What is Chinese word segmentation?
What is Chinese word segmentation?

Main methods of Chinese word segmentation

Existing word segmentation algorithms can be divided into three major categories: word segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. .

1. Word segmentation method based on string matching

This method is also called the mechanical word segmentation method. It combines the Chinese character string to be analyzed with a sufficiently large machine according to a certain strategy. The entries in the dictionary are matched. If a certain string is found in the dictionary, the match is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the priority matching of different lengths, it can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether it is related to the part-of-speech tagging process Combined, it can be divided into simple word segmentation method and integrated method that combines word segmentation and annotation. Several commonly used mechanical word segmentation methods are as follows:

1) Forward maximum matching method (from left to right);

2) Inverse maximum matching method (from right to left) direction);

3) Minimum segmentation (minimize the number of words cut out in each sentence).

The above methods can also be combined with each other. For example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characteristics of Chinese single-character word formation, forward minimum matching and reverse minimum matching are generally rarely used. Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and fewer ambiguities are encountered. Statistical results show that the error rate of simply using forward maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. However, this accuracy is far from meeting actual needs. The actually used word segmentation systems all use mechanical word segmentation as a preliminary segmentation method, and it is necessary to further improve the accuracy of segmentation by using various other linguistic information.

One method is to improve the scanning method, which is called feature scanning or mark segmentation. It prioritizes identifying and segmenting some words with obvious characteristics in the string to be analyzed, and using these words as breakpoints. , the original string can be divided into smaller strings and then mechanical word segmentation can be performed, thereby reducing the matching error rate. Another method is to combine word segmentation and part-of-speech tagging, use rich part-of-speech information to help word segmentation decisions, and in turn check and adjust the word segmentation results during the tagging process, thereby greatly improving the accuracy of segmentation.

For the mechanical word segmentation method, a general model can be established. There are professional academic papers on this aspect, which will not be discussed in detail here.

2. Word segmentation method based on understanding

This word segmentation method achieves the effect of word recognition by allowing the computer to simulate human understanding of sentences. The basic idea is to perform syntactic and semantic analysis while segmenting words, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: word segmentation subsystem, syntax and semantics subsystem, and overall control part. Under the coordination of the overall control part, the word segmentation subsystem can obtain syntactic and semantic information about words, sentences, etc. to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This word segmentation method requires the use of a large amount of language knowledge and information. Due to the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the word segmentation system based on comprehension is still in the experimental stage.

3. Word segmentation method based on statistics

From a formal point of view, a word is a stable combination of words, so in the context, the more adjacent words appear at the same time, the more the more likely it is to form a word. Therefore, the frequency or probability of adjacent words can better reflect the credibility of the word. The frequency of combinations of adjacent words in the corpus can be counted and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters and calculate the adjacent occurrence probability of two Chinese characters X and Y. The mutual occurrence information reflects the closeness of the combination relationship between Chinese characters. When the closeness is higher than a certain threshold, it can be considered that this word group may form a word. This method only needs to count the frequency of word groups in the corpus and does not need to segment the dictionary, so it is also called the dictionary-free word segmentation method or the statistical word extraction method. However, this method also has certain limitations. It will often extract some frequently used word groups that appear frequently but are not words, such as this, one, some, mine, many, etc. The recognition accuracy of common words is poor and the time and space overhead is large.

Practical statistical word segmentation systems must use a basic word segmentation dictionary (common word dictionary) for string matching and word segmentation, and at the same time use statistical methods to identify some new words, that is, combine string frequency statistics and string matching, which not only plays the role of matching word segmentation, but also uses statistical methods to identify some new words. It has the characteristics of fast segmentation and high efficiency, and also takes advantage of dictionary-free word segmentation combined with context to identify new words and automatically eliminate ambiguities.

There is no conclusion yet on which word segmentation algorithm is more accurate. For any mature word segmentation system, it is impossible to rely on one algorithm alone to achieve it, and it requires a combination of different algorithms. The author understands that the word segmentation algorithm of Massive Technology uses the compound word segmentation method. The so-called compound is equivalent to the concept of compound in traditional Chinese medicine, that is, using different medicinal materials to combine to treat diseases. Similarly, for the recognition of Chinese words, multiple algorithms are needed. Deal with different problems.

Problems in word segmentation

With mature word segmentation algorithms, can the problem of Chinese word segmentation be easily solved? Nothing could be further from the truth. Chinese is a very complex language, and it is even more difficult for computers to understand the Chinese language. In the process of Chinese word segmentation, there are two major problems that have not been completely overcome.

1. Ambiguity identification

Ambiguity refers to the same sentence, which may have two or more segmentation methods. For example: surface, because surface and surface are both words, then this phrase can be divided into surface and surface. This is called cross-ambiguity. Cross ambiguities like this are very common. The kimono example mentioned above is actually an error caused by cross ambiguities. Makeup and clothing can be divided into makeup and clothing or makeup and clothing. Without human knowledge to understand, it is difficult for a computer to know which solution is correct.

If computers can solve both cross-ambiguity and combined ambiguity, there is another difficult problem in ambiguity, which is true ambiguity. True ambiguity means that given a sentence, people cannot tell which one should be a word and which one should not be a word. For example: table tennis auction is over, it can be divided into table tennis racket sold out, or table tennis auction over. If there are no other sentences in the context, I am afraid no one will know whether auction is a word here.

2. New word recognition

New words are called unregistered words in professional terms. That is, those words that are not included in the dictionary but can indeed be called words. The most typical one is a person's name. People can easily understand the sentence "Wang Junhu went to Guangzhou." Wang Junhu is a word because it is a person's name, but it is difficult for a computer to recognize it. If Wang Junhu is included as a word in the dictionary, there are so many names in the world, and there are new names every moment. Including these names is a huge project in itself. Even if this work can be completed, there will still be problems, such as: in the sentence Wang Jun Hu Tou Hu Nao, can Wang Junhu still count words?

In addition to personal names, new words also include organization names, place names, product names, trademark names, abbreviations, abbreviations, etc., which are difficult to deal with, and these are words that people often use. , so for search engines, new word recognition in word segmentation systems is very important. At present, the accuracy of new word recognition has become one of the important indicators for evaluating the quality of a word segmentation system.

Application of Chinese word segmentation

At present, in natural language processing technology, Chinese processing technology lags far behind Western processing technology. Many Western processing methods cannot be directly translated into Chinese. It is adopted because Chinese must have the process of word segmentation. Chinese word segmentation is the basis for other Chinese information processing, and search engines are just an application of Chinese word segmentation. Others, such as machine translation (MT), speech synthesis, automatic classification, automatic summarization, automatic proofreading, etc., all require the use of word segmentation. Because Chinese requires word segmentation, it may affect some research, but it also brings opportunities to some companies, because if foreign computer processing technology wants to enter the Chinese market, it must first solve the problem of Chinese word segmentation. In terms of Chinese research, the Chinese have very obvious advantages compared to foreigners.

Word segmentation accuracy is very important to search engines, but if the word segmentation speed is too slow, no matter how high the accuracy is, it will not be usable for search engines because search engines need to process hundreds of millions of words. For web pages, if word segmentation takes too long, it will seriously affect the speed of search engine content update. Therefore, for search engines, both the accuracy and speed of word segmentation need to meet very high requirements.

At present, most of the people who study Chinese word segmentation are scientific research institutions. Tsinghua University, Peking University, Harbin Institute of Technology, Chinese Academy of Sciences, Beijing Language Institute, Northeastern University, IBM Research, Microsoft China Research, etc. all have their own research teams. However, there are commercial companies that really specialize in Chinese word segmentation. The company has almost nothing left except massive technology.