Human society has entered the era of big data, and the traditional information storage and communication media are gradually replaced by computers, showing an exponential growth trend and becoming one of the most important economic resources in 2 1 century. As a commercial bank with a large number of real transaction data, how to realize the close combination of internal and external information, structured and unstructured data, more accurately identify information, effectively mine information, and transform data value into economic value has become one of the important ways for commercial banks to enhance their core competitiveness. The rapid development of web crawler technology provides a brand-new strategy for commercial banks to improve their ability to obtain information accurately and integrate applications effectively.
Overview of Web Crawler Technology
Web Crawler is a free translation of words such as Spider (or Robots, Crawler), and it is an efficient information capture tool. It integrates search engine technology, searches, crawls and saves any HTML (Hypertext Markup Language) standardized webpage information from the Internet through technical means optimization. The mechanism is: send a request to a specific Internet site, interact with the site after establishing a connection, get information in HTML format, then move to the next site and repeat the above process. Through this automatic working mechanism, the target data is saved in the local data for use. When a web crawler accesses a hypertext link, it can automatically obtain address information pointing to other web pages from HTML tags, so it can automatically achieve efficient and standardized information acquisition.
With the increasingly extensive application of Internet in human economy and society, the scale of information it covers is growing exponentially, and the form and distribution of information are diversified and globalized. The traditional search engine technology can no longer meet the increasingly refined and specialized information acquisition and processing needs, and it is facing great challenges. Since its birth, web crawler has developed rapidly and become the main research hotspot in the field of information technology. At present, the mainstream web crawler search strategies are as follows.
Depth-first search strategy
In the early crawler development, depth is the priority, that is, in an HTML file, select a hyperlink tag for deep search until the hyperlink reaches the lowest level, judge that the search at this level is over through logical operation, then quit the loop at this level and return to the loop at the upper level to start searching for other hyperlink tags until the hyperlinks in the initial file are traversed. The advantage of depth-first search strategy is that it can search all the information of a website, especially the deeply nested document set; However, the disadvantage is that in the case of more and more complex data structure, the vertical levels of the site will increase indefinitely, cross-references will appear between different levels, and infinite loops will appear. Only by forcibly closing the program can you exit the traversal, and because of a lot of repetition and redundancy, the quality of the obtained information is difficult to guarantee.
Width first search strategy
Corresponding to the depth-first search strategy is the width-first search strategy, whose mechanism is to start a cycle from top to bottom, first search all hyperlinks in the first-level page, and then start the search cycle of the second-level page after completing the traversal of the first-level page to the bottom. When all the hyperlinks of a certain layer are selected, a new round of retrieval will be started based on the hyperlinks of the next layer obtained in the information retrieval process of that layer (taking them as seeds), and shallow links will take precedence. One advantage of this model is that no matter how complicated the vertical structure of the search object is, it will avoid the infinite loop to a great extent; Another advantage is that it has a specific algorithm to find the shortest path between two HTML files. Generally speaking, we expect that most of the functions of the crawler can be easily realized by the current width-first search strategy, so it is considered to be optimal. However, its disadvantage is that because it takes a lot of time, the width-first search strategy is not suitable for traversing specific sites and deep nesting of HTML files.
Focus search strategy
Different from depth-first and width-first, the focused search strategy is to access data sources according to the "matching priority principle", actively select data documents related to the demand theme based on a specific matching algorithm, and limit the priority to guide the subsequent data capture. This focused crawler will judge a priority score for hyperlinks in any page it visits, and insert links into the circular queue according to this score. This strategy can help the crawler to track the pages with high potential matching priority until it obtains sufficient quantity and quality of target information. It is not difficult to see that the focused crawler search strategy mainly lies in the design of priority scoring model, that is, how to distinguish the value of links. Different scoring models will give different scores for the same link, which will directly affect the efficiency and quality of information collection. Under the same mechanism, the scoring model for hyperlink tags can naturally be extended to the evaluation of HTML pages, because each page is composed of a large number of hyperlink tags. Generally speaking, the higher the value of a link, the higher the value of its page, which provides theoretical and technical support for the search specialization and wide application of search engines. At present, the common focused search strategies are "consolidated learning" and "context map".
From the application point of view, at present, the mainstream search platforms in China mainly adopt the width-first search strategy, mainly considering that the vertical value density of information in the domestic network system is low, while the horizontal value density is high. However, this will obviously miss some network documents with low citation rate, and the horizontal value enrichment effect of width-first search strategy will lead to these information sources with few links being ignored indefinitely; On this basis, supplementing the linear search strategy will alleviate this situation, constantly introducing updated data information into the existing data warehouse, and deciding whether to continue to save this information through multiple rounds of value judgment, instead of simply and rudely eliminating it and keeping new information out of the closed cycle.
Development trend of web crawler technology
In recent years, with the continuous development of web crawler technology, the search strategy is constantly optimized. At present, the future development of web crawler mainly presents the following trends.
Dynamic web page data
The traditional web crawler technology is mainly limited to the capture of static page information, and the mode is relatively simple. In recent years, with Web2.0/AJAX technology becoming the mainstream, dynamic pages have become the mainstream of network information dissemination because of their strong interactive ability, and have replaced static pages as the mainstream. AJAX uses asynchronous request and response mechanism driven by JavaScript to continuously update data without refreshing the whole web page. However, the traditional crawler technology lacks the interface and interactive ability of JavaScript semantics, so it is difficult to trigger the asynchronous calling mechanism that dynamically does not refresh the page and analyze the returned data content, and it is impossible to save the required information.
In addition, various front-end frameworks that encapsulate JavaScript, such as JQuery, will make many adjustments to the DOM structure. Even the main dynamic content on the web page does not need to be sent from the server to the client in the form of static tags when the request is established, but it is dynamically drawn by asynchronous calling mechanism in response to the user's operation. On the one hand, this mode greatly optimizes the user experience, on the other hand, it greatly reduces the interaction burden of the server, but it is a great challenge for crawler programs accustomed to DOM structure (relatively unchanged static pages). The traditional crawler program is mainly based on "protocol driver", but in the Internet 2.0 era, under the dynamic interactive technology environment based on AJAX, the crawler engine must rely on "event driver" to obtain continuous data feedback from the data server. To realize event-driven, crawler programs must solve three technical problems: first, interactive analysis and interpretation of JavaScript; Secondly, the handling, interpretation and distribution of DOM events; Thirdly, semantic extraction of dynamic DOM content.
Data collection and distribution
Distributed crawler system is a crawler system running on a computer cluster. The crawler programs running on each node of the cluster have the same working principle as the centralized crawler system, but the difference is that the distributed crawler system needs to coordinate the task division, resource allocation and information integration among different computers. A master node is implanted in the computer terminal of the distributed crawler system, through which the local centralized crawler work is called. On this basis, the information interaction between different nodes is very important, so the key to the success of distributed crawler system lies in whether it can design and realize task coordination. In addition, the underlying hardware communication network is also very important. Because multiple nodes can be used to crawl web pages and dynamic resource allocation can be realized, the search efficiency of distributed crawler system is much higher than that of centralized crawler system.
After continuous evolution, various distributed crawler systems have their own characteristics in system composition, and their working mechanisms and storage structures are constantly innovating. The mainstream distributed crawler system generally adopts the internal structure of "master-slave combination", that is, a master node controls other slave nodes to grab information through task division, resource allocation and information integration; In the working mode, based on the cheap and efficient characteristics of cloud platform, distributed crawler system widely uses cloud computing to reduce costs and reduce the cost investment required for large-scale software and hardware platform construction; In terms of storage mode, distributed information storage is popular at present, that is, files are stored in distributed network system, which makes it more convenient to manage data on multiple nodes. The commonly used distributed file system is HDFS system based on Hadoop.
Application of Web Crawler Technology in Commercial Banks
As far as commercial banks are concerned, the application of web crawler technology will help commercial banks realize four "best understandings", namely, "banks that know their own best", "banks that know their customers best", "banks that know their competitors best" and "banks that know their business environment best". The specific application scenarios are as follows.
Network public opinion monitoring
Internet public opinion is one of the manifestations of mainstream public opinion in current society. Mainly to collect and display the public's views and comments on some social focus and hot issues, and then spread them through the Internet. For commercial banks, monitoring online public opinion is an important technical means for their own brand management and crisis public relations, taking the network as a "mirror" to create a "bank that knows itself best".
As one of the mainstream information media in current society, online public opinion has the characteristics of fast spread and great influence. For commercial banks, it is necessary to establish an automatic network public opinion monitoring system. On the one hand, it can enable commercial banks to obtain more accurate social demand information, on the other hand, it can enable commercial banks to spread service concepts and service features on the new public opinion platform and improve their business development level. Because web crawler plays an irreplaceable role in monitoring online public opinion, its work quality will greatly affect the breadth and depth of online public opinion collection. According to the types of collected objects, web crawler can be divided into "general web crawler" and "topic web crawler". The general web crawler focuses on collecting larger data scale and wider data range, regardless of the order of web page collection and the theme matching of the target web page. Under the background of the exponential growth of network information scale, the use of general web crawler is limited by the speed of information collection, information value density and information specialization. In order to alleviate this situation, topic-based web crawler came into being. Different from the general web crawler, the topic web crawler pays more attention to the matching degree between the target and the web page information and avoids irrelevant redundant information. This screening process is dynamic and runs through the whole workflow of the topic web crawler technology.
Using crawler technology to monitor online public opinion can help us understand customers' attitudes and comments on banks more comprehensively and deeply, gain insight into the advantages and disadvantages of banks' own operations, and at the same time play a role in defending reputation risks and enhancing brand effect.
Panoramic portrait of customers
With the increasingly fierce competition of commercial banks, the profit space is further compressed, and the requirements for customer marketing and risk control are getting higher and higher. In the current banking management system, marketing process management and risk process management, especially the identification and management of potential customers and post-loan risks, often require a lot of manpower, material resources and time costs. By introducing web crawler technology, we can effectively build a panoramic portrait of customers and build a "bank that knows customers best", which is a useful supplement to the traditional "customer relationship management" and "off-site risk control" technologies and will greatly promote the marketing and risk management of bank customers.
Web crawler can be used to build a full-dimensional information view of bank customers, that is, taking simple personal customer identity information or enterprise customer network address as input, and after crawler processing, outputting customer information in a specific format that meets preset rules. Bank data personnel take specific basic data as raw materials, input keywords into the crawler system, combine the website information related to customer information, package them into crawler seeds, and pass them to the crawler program. Then, the crawler program starts the corresponding business process, grabs the web page of customer-related information and saves it. In addition, starting from the level of network public opinion monitoring, the monitoring object will be extended from itself to bank customers, and the customers' evaluation of bank customers will be known for the first time through the network, and the customer's public opinion dynamics will be grasped in time to guide the bank's business decision.
By using the above-mentioned web crawler system to collect, monitor and update customer-related information in real time, we can not only understand the real-time situation of customers more comprehensively, but also predict the potential marketing opportunities and credit risks of customers, effectively improve the efficiency of customer marketing and post-loan risk management, enhance the comprehensive benefits of commercial banks, and form a win-win situation for banks and customers.
Opponent analysis
At present, with the advent of interest rate marketization and the impact of internet finance, the competition among commercial banks is becoming increasingly fierce, and new market participants and new products emerge one after another, which intensifies business competition. In this context, it is becoming more and more important for commercial banks to fully understand the dynamics of competitors, build "banks that know competitors best", adjust themselves in time and seize opportunities.
By constructing a whole network information analysis and display platform based on web crawler technology, real-time data of the whole network can be effectively captured, product information and news trends of other banks can be obtained in time, and the situation of other competitors can be known at the first time, which is convenient for the integration and analysis of local intra-bank data. Web crawler builds a dynamic data platform by collecting data in real time, grabs network data and stores it locally, which is convenient for in-depth data mining analysis and application in the future. Web crawler technology not only makes it easier for decision makers of commercial banks to formulate accurate policies to support the company's operation, but also extends the monitoring object of online public opinion information from themselves and customers to competitors, so as to grasp the market competition situation and its advantages and disadvantages of competitors in real time, realize "knowing ourselves and knowing ourselves" and truly achieve information symmetry.
Industry vertical search
Vertical search refers to subdividing the search scope into a certain professional field, integrating the web information obtained for the first time at a deeper level, and finally forming more pure professional field information. Using this method, bank data personnel can greatly improve the efficiency of obtaining effective information. By grasping and analyzing the financial theme, commercial banks can fully understand the development trend of regulatory policies, understand the development of regional economy and industrial economy, grasp the dynamics of the financial industry's own operating environment, timely check and adjust their own strategies, keep up with market trends, and become "the bank with the best understanding of the operating environment".
The application of vertical search in the financial field can improve the information processing ability of financial subjects. The biggest highlight of vertical search technology is that it can carry out targeted and specialized subdivision operations on data with various forms and huge scale, reduce junk information, gather effective information, improve search efficiency, and even provide real-time data under certain conditions, and maximize the integration of a large number of existing complex web data, so that users can obtain more convenient, complete and efficient information retrieval services.
label
With the development of Internet technology and data explosion, web crawler technology provides a new technical path for data collection and information integration application of commercial banks. From the application practice of commercial banks, web crawler has great development potential in the daily operation and management of banks. The application of web crawler technology can help banks to transform into "smart banks" that know themselves, customers, competitors and business environment best. It can be predicted that web crawler technology will become an important technical means for commercial banks to improve their refined management ability and intelligent decision-making level.