Extra points for the working mechanism of search engines

Search engine is a product that relies on technology to win. Each component of search engine, including page collector, indexer, retriever, etc., is the focus of competition among search engine product providers.

In recent years, the commercialization of search engines has achieved great success, such as the famous search engine companies Google, Yahoo (when Yahoo is mentioned in this article, it refers specifically to the English Yahoo), Baidu, etc. have successfully gone public. This has triggered many companies to get involved in this field, leading to a large investment in manpower and capital. Even the software giant Microsoft cannot resist the temptation and actively build its own search engine. However, in terms of performance, the current search engines are not satisfactory. The results returned by search are often far from the user's retrieval requirements, and the effectiveness is not very high. This article will analyze the working principle of search engines and their implementation technologies, so as to understand the factors that limit the improvement of search engine user experience.

The working process of search engines

The data centers of large Internet search engines generally run thousands or even hundreds of thousands of them. computers, and dozens of machines are added to the computer cluster every day to keep pace with network development. The collection machine automatically collects web page information at an average speed of dozens of web pages per second, and the retrieval machine provides a fault-tolerant and scalable architecture to handle tens of millions or even hundreds of millions of user query requests every day. Enterprise search engines can be deployed according to different application scales, from a single computer to a computer cluster.

The general working process of a search engine is: First, it collects web pages on the Internet, then preprocesses the collected web pages, establishes a web page index library, responds to user query requests in real time, and performs search queries The obtained results are sorted according to certain rules and returned to the user. The important function of a search engine is to provide full-text retrieval of text information on the Internet.

Figure 1? Workflow of search engine

Search engines receive retrieval requests from users through client programs. The most common client program now is the browser. In fact, it also It could be a much simpler web application developed by a user. The search request input by the user is usually a keyword or multiple keywords connected with logical symbols. The search server converts the search keyword into wordID according to the system keyword dictionary, and then obtains the docID in the index database (inverted file) list, scan the objects in the docID list and match them with wordID, extract the web pages that meet the conditions, then calculate the relevance of the web pages and keywords, and combine the top K results (each page of different search engines) based on the correlation value (different number of search results) are returned to the user, and the processing flow is shown in Figure 1.

Figure 2 describes the system architecture of a general search engine, which includes page collectors, indexers, retrievers, index files, etc. The functional implementation of the main parts is introduced below.

Figure 2? The relationship between the various components of the search engine

Collector

The function of the collector is to roam on the Internet, discover and collect information, it collects There are various types of information, including HTML pages, XML documents, Newsgroup articles, FTP files, word processing documents, multimedia information, etc. A searcher is a computer program whose implementation often uses distributed and parallel processing technologies to improve the efficiency of information discovery and update. Commercial search engine collectors can collect millions or more web pages every day. Searchers generally have to run constantly, collecting as much and as fast as possible new information of all types on the Internet. Because information on the Internet updates quickly, old information that has been collected must be updated regularly to avoid dead links and invalid links. In addition, because Web information changes dynamically, collectors, analyzers, and indexers must update the database regularly, and the update cycle is usually about weeks or even months. The larger the index database, the more difficult it is to update.

There is too much information on the Internet, and even a powerful collector cannot collect all the information on the Internet. Therefore, the collector uses a certain search strategy to traverse the Internet and download documents. For example, a search strategy generally uses a breadth-first search strategy as the main one and a linear search strategy as a supplement.

When the collector is implemented, a hyperlink queue, or stack, is maintained in the system, which contains some starting URLs. The collector starts from these URLs, downloads the corresponding pages, and extracts new hyperlinks from them. The chain is added to the queue or stack, and the above process is repeated until the queue is empty. In order to improve efficiency, search engines divide the Web space according to domain names, IP addresses or country domain names, and use multiple collectors to work in parallel, so that each searcher is responsible for searching a subspace. To facilitate future expansion of the service, the collector should be able to change the search scope.

1. Linear collection strategy

The basic idea of ??the linear search strategy is to start from a starting IP address and search for each subsequent IP address in an increasing manner. Information, regardless of the hyperlink addresses pointing to other Web sites in the HTML files of each site. This strategy is not suitable for large-scale searches (the main reason is that the IP may be dynamic), but it can be used for comprehensive searches in a small range. A collector using this strategy can find other HTML files that are less cited or have not been used. Referenced source of new HTML file information.

2. Depth-first collection strategy

Depth-first collection strategy is a method commonly used in early development of collectors. Its purpose is to reach the leaf structure of the searched structure. point. Depth-first search follows the hyperlinks on the HTML file until it can go no further, then returns to the previous HTML file, and then continues to select other hyperlinks in the HTML file. When there are no more hyperlinks to choose from, the search has ended. Depth-first search is suitable for traversing a specified site or a deeply nested set of HTML files, but for large-scale searches, since the Web structure is quite deep, it may never be able to get out.

3.?Breadth-first collection strategy

The breadth-first collection strategy is to search the content in the same layer first, and then continue to search the next layer. If there are three hyperlinks in an HTML file, select one of them and process the corresponding HTML file, then go back and select the second hyperlink of the first web page, process the corresponding HTML file, and then return. Once all hyperlinks on the same layer have been processed, you can start searching for the remaining hyperlinks in the HTML file you just processed. This ensures that the shallow layer is processed first, and when an endless deep branch is encountered, it will not get stuck again. The breadth-first collection strategy is easy to implement and widely adopted, but it takes a long time to reach deep HTML files.

4.?Inclusion and collection strategy

Some web pages can be collected through user submission. For example, some commercial websites submit an application for inclusion to a search engine, and the collector can collect and submit in a targeted manner. Apply for website web page information and add it to the search engine's index database.

Analyzer

Generally, the web page information or downloaded documents collected by the collector must be analyzed first to build an index. Document analysis technology generally includes:? Word segmentation (some Extracting words only from certain parts of the document (such as Altavista), filtering (using stoplists), and conversion (some perform singular and plural conversion, affix removal, synonym conversion, etc. on the entries), these technologies are often related to specific languages. And closely related to the indexing model of the system.

Indexer

The function of the indexer is to analyze and process the information searched by the searcher, extract index items from it, and use it to represent the document and generate the index table of the document library.

There are two types of index items: metadata index items and content index items: Metadata index items have nothing to do with the semantic content of the document, such as author name, URL, update time, encoding, length, link popularity, etc.; Content index items are Used to reflect the content of the document, such as keywords and their weights, phrases, words, etc. Content index items can be divided into two types: single index items and multiple index items (or phrase index items). For English, single index items are English words, which are easier to extract because there are natural separators (spaces) between words; for continuously written languages ??such as Chinese, words must be segmented. In search engines, a weight is generally assigned to a single index item to indicate the degree of discrimination of the document by the index item, and is also used to calculate the relevance of the query results. The methods used generally include statistical methods, information theory methods and probability methods. The methods for extracting phrase index items include statistical methods, probability methods and linguistic methods.

In order to quickly find specific information, building an index database is a common method, that is, the document is expressed in a way that facilitates retrieval and stored in the index database. The format of the index database is a special data storage format that relies on the indexing mechanism and algorithm. The quality of the index is one of the key factors for the success of the Web information retrieval system. A good index model should be easy to implement and maintain, fast to retrieve, and have low space requirements. Search engines generally draw on index models from traditional information retrieval, including inverted documents, vector space models, probability models, etc. For example, in the vector space index model, each document d is represented as a normalized vector V(d)=(t1, w1?(d)…ti, w1(d)…tn, wn(d)). Where ti is the entry item, wi(d) is the weight of ti in d, which is generally defined as a function of the frequency of ti in d, tfi(d).

The output of the indexer is an index table, which generally uses the inversion form (Inversion?List), that is, the corresponding document is found by the index item. The index table may also record the position where the index items appear in the document so that the crawler can calculate the adjacency or proximity relationship (proximity) between the index items. Indexers can use centralized indexing algorithms or distributed indexing algorithms. When the amount of data is large, real-time indexing (Instant Indexing) must be implemented, otherwise it will not be able to keep up with the rapid increase in the amount of information. Indexing algorithms have a great impact on indexer performance (such as response speed during large-scale peak queries). The effectiveness of a search engine depends largely on the quality of its index.

Searcher

The function of the searcher is to quickly check out documents in the index database according to the user's query, evaluate the relevance of the document and the query, and sort the results to be output. , and implement some kind of user relevance feedback mechanism. Information retrieval models commonly used by search engines include set theory models, algebraic models, probability models, and hybrid models. They can query any word in text information, whether it appears in the title or the text.

The retriever finds documents related to the user's query request from the index, and processes the user's query request in the same way as analyzing indexed documents. For example, in the vector space index model, the user query q is first expressed as a normalized vector V(q)=(t1,w1(q);?…;?ti,wi(q);?…;?tn,wn (q)), and then calculate the correlation between the user query and each document in the index database according to a certain method, and the correlation can be expressed as the sandwich between the query vector V(q) and the document vector V(d) angle cosine, and finally all documents with relevance greater than the threshold are arranged in order of decreasing relevance and returned to the user. Of course, the search engine's relevance judgment does not necessarily fully match the user's needs.

User interface

The function of the user interface is to provide users with a visual query input and result output interface, which facilitates users to input query conditions, display query results, and provide user relevance feedback mechanisms, etc. , its main purpose is to facilitate users to use search engines and obtain effective information from search engines efficiently and in multiple ways.

The design and implementation of user interfaces must be based on the theories and methods of human-computer interaction to adapt to human thinking and usage habits.

In the query interface, users formulate the terms to be retrieved and various simple or advanced search conditions according to the query syntax of the search engine. The simple interface only provides a text box for users to enter query strings, while the complex interface allows users to restrict query conditions, such as logical operations (AND, OR, NOT), proximity relationships (adjacent, NEAR), domain name ranges (such as edu, com ), appearance position (such as title, content), time information, length information, etc. Some companies and institutions are currently considering developing standards for query options.

In the query output interface, the search engine displays the search results as a linear document list, which includes the document's title, abstract, snapshot, hyperlink and other information. Since relevant and irrelevant documents are mixed in the search results, users need to browse one by one to find the required documents.

Chinese word segmentation technology for search engines

Automatic Chinese word segmentation is the basis for web page analysis. In the process of web page analysis, Chinese and English are processed differently. This is because there is an obvious difference between Chinese information and English information: There are spaces between English words, but there is no separation between words in Chinese text. symbol. This requires that before analyzing the Chinese web page, the sentences in the web page must be cut into a sequence of words. This is Chinese word segmentation. Chinese automatic word segmentation involves many natural language processing technologies and evaluation standards. In search engines, we are mainly concerned about the speed and accuracy of Chinese automatic word segmentation. The accuracy of word segmentation is very important for search engines, but if the word segmentation speed is too slow, no matter how high the accuracy is, it will not be usable for search engines, because search engines need to process hundreds of millions of web pages. If word segmentation consumes If the time is too long, it will seriously affect the speed of search engine content update. Therefore, search engines have high requirements on the accuracy and speed of word segmentation.

At present, the relatively mature technology for automatic Chinese word segmentation is the mechanical word segmentation method based on word segmentation dictionaries. This method is to match the Chinese character string to be analyzed with the entries in the dictionary according to a certain strategy. According to different matching strategies, mechanical word segmentation methods include the following algorithms: forward maximum matching algorithm, reverse maximum matching algorithm, least word segmentation algorithm, etc. The advantage of this method is that it can segment words quickly and has a certain accuracy, but it has poor processing effect on unregistered words. Experimental results show that: The error rate of forward maximum matching is about 1/169, and the error rate of reverse maximum matching is about 1/245. Another commonly used Chinese automatic word segmentation method is the word segmentation method based on statistics. This method counts the frequency of word groups in the corpus and does not require dictionary segmentation. Therefore, it is also called the dictionary-free word segmentation method. However, this method often regards common word groups that are not words as words, and the recognition accuracy of common words is poor, and the time and space overhead is relatively large. In practical applications in the field of search engines, mechanical word segmentation methods are generally combined with statistical word segmentation methods. First, string matching word segmentation is performed, and then statistical methods are used to identify some unregistered new words. This not only achieves fast matching word segmentation and high efficiency. It also takes advantage of the characteristics of automatic recognition of new words and automatic elimination of word segmentation ambiguity in statistical word segmentation.

The word segmentation dictionary is an important factor affecting automatic Chinese word segmentation. Its size is generally around 60,000 words. It is inappropriate whether the dictionary is too large or too small;? The dictionary is too small and some words will be segmented. If not, the dictionary is too large, and the uprising phenomenon will be greatly increased during the segmentation process, which will also affect the accuracy of word segmentation. Therefore, the selection of entries in the word segmentation dictionary is very strict. For the Internet field where new words are constantly appearing, it is not enough to use a word segmentation dictionary with about 60,000 words. However, adding new words to the word segmentation dictionary at will will lead to a decrease in word segmentation accuracy. The general solution is to use an auxiliary dictionary, which has a scale of About 500,000 entries. In addition, the difficulty of automatic Chinese word segmentation lies in the processing of word segmentation ambiguity and the identification of unregistered words. How to deal with these two issues has always been a hot topic in this field.

1. Ambiguity processing

Ambiguity means that there may be two or more segmentation methods.

For example: The phrase "surface", because "surface" and "surface" are both words, then this phrase can be divided into "surface" and "surface". This is called cross-ambiguity. Cross-ambiguity like this is very common, and "makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing." Without human knowledge to understand, it is difficult for a computer to know which solution is correct.

Cross ambiguity is relatively easy to handle compared to combination ambiguity, which must be judged based on the entire sentence.

For example, in the sentence "This doorknob is broken", "handle" is a word, but in the sentence "Please take your hand away", "handle" is not a word;? In the sentence "General" "A lieutenant general was appointed", "lieutenant general" is a word, but in the sentence "output will triple in three years", "lieutenant general" is no longer a word. How can computers identify these words?

Even if computers can solve cross-ambiguity and combined ambiguity, there is still another difficult problem in ambiguity, which is true ambiguity. True ambiguity means that given a sentence, people cannot tell which one should be a word and which one should not be a word. For example: "The table tennis auction is over" can be divided into "the table tennis auction is over" or "the table tennis auction is over". If there are no other sentences in the context, I am afraid no one will know whether "auction" counts here. Count as a word.

The method of dealing with ambiguity generally uses an algorithm similar to dynamic programming to transform the solution of the ambiguity problem into the solution of an optimization problem. In the solution process, auxiliary information such as word frequency or probability is generally used to obtain the maximum possible word segmentation result, which is the best in a certain sense.

2. Unregistered word processing

Unregistered words are words that are not in the word segmentation dictionary, also called new words. The most typical ones are names of people, places, professional terms, etc. For example, people can easily understand that in the sentence "Wang Junhu went to Guangzhou", "Wang Junhu" is a word because it is a person's name, but it is difficult for a computer to recognize it. If "Wang Junhu" is included as a word in the dictionary, there are so many names in the world, and there are new names every moment. Including these names is a huge project in itself. Even if this work can be completed, there will still be problems, such as: In the sentence "Wang Junhutouhunao", can "Wang Junhu" still count as a word?

In addition to personal names, unregistered words also include organization names, place names, product names, trademark names, abbreviations, abbreviations, etc., which are difficult to deal with, and these happen to be commonly used by people. words, so for search engines, the identification of new words in the word segmentation system is very important. At present, the processing of unregistered words generally adopts statistical methods. First, the word groups with higher frequency of occurrence are counted from the corpus, and then they are added to the auxiliary dictionary as new words according to certain rules.

At present, Chinese automatic word segmentation technology has been widely used in search engines, and the word segmentation accuracy has reached more than 96%. However, when analyzing and processing large-scale web pages, the existing Chinese automatic word segmentation technology is still insufficient. There are many shortcomings, such as the ambiguity problem mentioned above and the processing of unregistered words. Therefore, domestic and foreign scientific research institutions, such as Peking University, Tsinghua, Chinese Academy of Sciences, Beijing Language Institute, Northeastern University, IBM Research, Microsoft China Research, etc., have been paying attention to and studying Chinese automatic word segmentation technology. This is mainly because of the Chinese on the Internet. With more and more information, the processing of Chinese information on the Internet will become a huge industry and a broad market, with unlimited business opportunities. However, in order for Chinese automatic word segmentation technology to better serve the processing of Chinese information on the Internet and form products, a lot of work needs to be done in basic research and system integration.

Challenges faced by search engines

It is impossible for current search engines to be "broad and profound" because they are two contradictory aspects and cannot have both.

With the rapid growth of Internet information, it is becoming more and more difficult to achieve the "broadness" of search engines, and it is completely unnecessary from the perspective of utilizing information. On the contrary, "profoundness" is an indicator that people pay more and more attention to and pursue. In addition, a multi-level search service system is far from being established. Traditional search emphasizes navigation and neglects precise information services. It is like a pedestrian asking for directions. Pedestrians need not only directions, but also specific road signs.

Nowadays, people often talk about the next generation of search engines. So, what is the difference between the next generation of search engines and the second generation of search engines? What does it matter? What features should it include? These are all questions that should be answered, but the answers vary. Perhaps the next generation of search engines incorporates more powerful intelligence, human-computer interaction and other methods to improve the calculation of relevance. Perhaps the next generation of search engines not only run on large-scale servers, but are more likely to run on shared servers. On PC clusters with computing resources, or implanted in "search chips", maybe the boundaries of its index database have been blurred, maybe they are clearer, maybe the commercial barriers that the current search giants are constantly artificially erecting through funds, brands, etc., will not be able to resist after all. Live the subversion of innovative search technology, just as Google silently disintegrated Altavista.

———————————————————————————— [Related links]

Technical schools of search engines

The technical schools of search engines can be divided into three categories: the first category is the automation school that uses computer programs to automatically process information, with typical representatives such as Google and Ghunt; the second category is the manual classification of information The processing-oriented manual processing school, the typical representative in this regard is the early Yahoo, and the emerging community-based search such as Web? 2.0 and online excerpts are new developments of this school; the third category is the emphasis on intelligent human-computer interaction, Collaborative fusion school, the current English Yahoo search engine is developing this technology, MSN Live also shows that it pays more attention to fusion technology, IFACE professional search incorporates user knowledge and machine learning methods, which can be regarded as A typical representative of the fusion faction in Chinese search engines.

If divided according to the capacity of the web page library, relevance calculation technology, user search experience and business model, the development of search engines has gone through approximately two generations so far. The number of indexed web pages in the first generation of search engines (1994 to 1997) was generally in the millions, using full-text search technology and distributed parallel computing technology, but rarely re-collected web pages and refreshed the index, and other The retrieval speed is slow, and it usually takes 10 seconds or more to wait. At the same time, the retrieval requests it can bear are also greatly limited. The business model is in the exploratory period and has not yet been formed.

Most of the second generation search engines (1998 to present) adopt distributed collaborative processing solutions. Their web index databases generally have tens of millions of web pages or more, and adopt a scalable index database architecture. , capable of responding to tens of millions or even hundreds of millions of user retrieval requests every day. In November 1997, the most advanced search engines at the time claimed to be able to index 100 million web pages. The second-generation search engine represented by Google has achieved great success by calculating the relevance (webpage authority) through link analysis and click analysis (webpage popularity). In addition, search engines that use natural language to answer questions have improved the user experience to a certain extent. More importantly, the second generation of search engines has established a mature business model commonly used by search engines, such as Google, Overture, Baidu, etc. Search services all benefit from this business model.

Explanation of related terms

Full-text search engine is a robot program called a spider that automatically collects and discovers information on the Internet using a certain strategy. The indexer The collected information builds a webpage index database, and the searcher retrieves the index database according to the query conditions entered by the user, and returns the query results to the user. The service method is a full-text search service for web pages.

Directory index search engine mainly collects information manually. After editors review the information, they manually form a summary of the information and place the information into a predetermined classification framework. Most of the information is website-oriented, providing directory browsing services and direct retrieval services. Users can search without keywords, and can find the information they need only by relying on categories.

Metasearch engine refers to a system that shares the resource libraries of multiple search engines to provide users with information services in the form of a unified user query interface and information feedback. The metasearch engine works with the help of other search engines and does not have its own index library. It submits the user's query requests to multiple search engines at the same time, and after repeatedly excluding and reordering the returned results, it serves as its own The results are returned to the user.

Automatic classification technology is a computer that automatically classifies documents into a specific category under an existing category system (or theme) based on classification standards. At present, automatic classification cannot completely replace the related work done by humans, but only provides a less expensive alternative method.

Text clustering technology is a fully automatic process that uses computers to group existing large amounts of text (many documents). Clustering can provide an overview of the content of a large text collection, identify hidden similarities, and make it easy to browse similar or related texts.

Web article excerpts, also known as web excerpts, have the functions of collecting, classifying, excerpting, tagging, saving to information databases, and information database sharing functions for content pages, mainly to satisfy users. The need to read online content and accumulate information knowledge.

How to withdraw a trademark registration application

Where is Mao Xuewang’s variety show broadcast?

How to apply for a registered trademark online?

What information does the trademark registration need to provide? Where can I go to handle it?

If you want marble material, you need wood grain.

How to check the trademark number, click to enter!

What reliable bookkeeping agencies are there in Shanghai?

Term of exclusive right to use a trademark

Which country's brand is Giant Mouth Monkey?

How to clean the yellow crotch?