What is big data? Is it business model, capability, technology or data collection? What's the difference between what we call "big data" today and "data" in the traditional sense in the past? What are the characteristics of big data? What are the sources? In what way, and so on. Next, Bian Xiao will take you to learn about Big Data.
& gt& gt& gt& gt& gt big data concept
"Big data" refers to a data set with a particularly large amount of data and data categories, which cannot be captured, managed and processed by traditional database tools. "Big data" first refers to the amount of data? Large refers to a large data set, usually in 10TB? Regarding the scale, but in practical application, many enterprise users put multiple data sets together, which has formed a PB-level data volume; Secondly, it means that there are many kinds of data, which come from various data sources, and the types and formats of data are increasingly rich, which has broken through the previously defined category of structured data, including semi-structured and unstructured data. Secondly, the data processing speed (Velocity) is fast, and the data can be processed in real time under the condition of huge data. The last feature refers to the high authenticity of the data. With people's interest in new data sources such as social data, enterprise content, transaction and application data, the limitations of traditional data sources have been broken, and enterprises increasingly need effective information power to ensure their authenticity and security.
Baidu knows-the concept of big data
Bigdata, or huge amount of data, refers to the information that involves so much data that it can't be captured, managed, processed and sorted by current mainstream software tools to help enterprises make more active decisions within a reasonable time. 4V characteristics of big data: quantity, speed, change and accuracy.
Internet Weekly-Big Data Concept
The concept of "big data" is far more than a large amount of data (TB) and the technology of processing a large amount of data, or a simple concept like the so-called "four Vs", but covers what people can do on the basis of large-scale data, which cannot be realized on the basis of small-scale data. In other words, big data allows us to analyze massive data in an unprecedented way, so as to obtain valuable products and services or profound insights, and finally form the power of change.
Research Institute Gartner-Big Data Concept
"Big data" is a massive, high-growth and diversified information asset, which needs a new processing mode to have stronger decision-making, insight and discovery, and process optimization capabilities. In terms of data, "big data" refers to information that cannot be processed or analyzed by traditional processes or tools. It defines those data sets that are beyond the normal processing range and size, forcing users to adopt unconventional processing methods. Amazon Web Services (AWS) and big data scientist JohnRauser mentioned a simple definition: Big data is any massive data that exceeds the processing power of a computer. R&D team's definition of big data: "Big data is the biggest propaganda technology and the most fashionable technology. When this phenomenon occurs, the definition becomes very confusing. " Kelly said: "Big data may not contain all the information, but I think most of it is correct. Part of the view on big data is that it is so big that analyzing it requires multiple workloads, which is the definition of AWS. When your technology reaches the limit, it is also the limit of data. " Big data is not how to define it, but how to use it. The biggest challenge is which technologies can make better use of data and how to apply big data. Compared with traditional databases, the rise of open source big data analysis tools such as Hadoop and the value of these unstructured data services.
& gt& gt& gt& gt& gt big data analysis
As we all know, big data is not a simple big data fact. The most important reality is to analyze big data. Only through analysis can we get a lot of intelligent, in-depth and valuable information. Then more and more applications involve big data, and the properties of these big data, including quantity, speed and diversity, all show the increasing complexity of big data, so the analysis method of big data is particularly important in the field of big data, which can be said to be the decisive factor to determine whether the final information is valuable. Based on this understanding, what are the commonly used methods and theories of big data analysis?
& gt& gt& gt& gt& gt big data technology
Data collection: ETL tool is responsible for extracting data from distributed and heterogeneous data sources, such as relational data and plane data files, to the temporary middle layer, cleaning, transforming and integrating them, and finally loading them into data warehouse or data mart, which becomes the basis of online analytical processing and data mining.
Data access: relational database, NOSQL, SQL, etc.
Infrastructure: cloud storage, distributed file storage, etc.
Data processing: NLP (NaturalLanguageProcessing) is a subject that studies the language problems of human-computer interaction. The key to natural language processing is to make the computer "understand" natural language, so natural language processing is also called NLU (natural language understanding), also called computational linguistics. On the one hand, it is a branch of language information processing, on the other hand, it is one of the core topics of artificial intelligence (AI).
Statistical analysis: hypothesis test, significance test, difference analysis, correlation analysis, t test, variance analysis, chi-square analysis, partial correlation analysis, distance analysis, regression analysis, simple regression analysis, multiple regression analysis, stepwise regression, regression prediction and residual analysis, ridge regression, logistic regression analysis, curve estimation, factor analysis, cluster analysis, principal component analysis, factor analysis, rapid clustering method and clustering method.
Data mining: classification, estimation, prediction, affinity grouping or association rules, clustering, description and visualization, description and visualization, complex data type mining (text, Web, graphic images, video, audio, etc. ).
Model prediction: prediction model, machine learning, modeling and simulation.
Presented results: cloud computing, tag cloud, relationship diagram, etc.
& gt& gt& gt& gt& gt big data features
To understand the concept of big data, we must first start with "big", which refers to the size of data. Big data generally refers to the amount of data above10tb (1TB =1024gb). Big data is different from the previous massive data, and its basic characteristics can be summarized by four V's (volume, diversity, value and speed), namely, large volume, diversity, low value density and high speed.
First, the amount of data is huge. Jump from TB to PB.
Second, there are many types of data, such as web logs, videos, pictures, geographic information and so on.
Third, the value density is low. Take video as an example, in the process of continuous monitoring, the data that may be useful is only one or two seconds.
Fourth, the processing speed is fast. 1 the second law. This last point is also fundamentally different from the traditional data mining technology. Internet of Things, cloud computing, mobile Internet, car networking, mobile phones, tablet computers, PCs, and various sensors all over the world are all data sources or bearing methods.
Big data technology refers to the technology of quickly obtaining valuable information from all kinds of massive data. The core of solving big data problems is big data technology. At present, "big data" not only refers to the scale of data itself, but also includes tools, platforms and data analysis systems for collecting data. The purpose of big data research and development is to develop big data technology and apply it to related fields, and promote its breakthrough development by solving huge data processing problems. Therefore, the challenge brought by the era of big data is not only how to deal with massive data to obtain valuable information, but also how to strengthen the research and development of big data technology and seize the forefront of the development of the times.
At present, China's big data R&D construction should focus on the following four aspects.
The first is to establish a set of operating mechanism. Big data construction is an orderly, dynamic and sustainable system engineering. It is necessary to establish a good operating mechanism, promote the formal and orderly construction of all links, achieve integration and do a good job in top-level design.
The second is to standardize a set of construction standards. Without standards, there is no system. Big data construction standards should be established for different topics, covering all fields, and constantly updated dynamically, laying the foundation for realizing network interconnection, information exchange and resource sharing of various information systems at all levels.
The third is to build a * * * platform. Only when data is constantly flowing and fully enjoyed can it have vitality. On the basis of the construction of thematic database, data exchange and data sharing of various command information systems at all levels are realized through data integration.
The fourth is to cultivate professional teams. Every link of big data construction needs professionals to complete. Therefore, it is necessary to cultivate and bring up a professional team of big data construction who understands command, technology and management.
& gt& gt& gt& gt& gt The role of big data
With the advent of the era of big data, more and more people agree with this judgment. So what does big data mean and what will it change? It is not enough to answer from a technical point of view. Big data is just an object. Without people as the subject, no matter how big things are, they are meaningless. We need to put big data in people's environment and understand why it is a force for change in the times.
The power to change value
In the next decade, the core significance standard ("thinker") that determines whether China has great wisdom is national happiness. One is in terms of people's livelihood, making meaningful things clear through big data to see if we have done more meaningful things in interpersonal relationships than before; Second, it is reflected in ecology. Through big data, make clear the meaningful things, and whether we have done more meaningful things in the relationship between heaven and man than before. In a word, let's move from the chaotic era 10 years ago to the bright era 10 years in the future.
The power to change the economy
Producers are valuable, and consumers are the meaning of value. What is meaningful is valuable, and what consumers don't agree with can't be sold and can't realize value; Only what consumers agree with can be sold and value can be realized. Big data helps us identify meaning from the source of consumers, thus helping producers realize value. This is the principle of starting domestic demand.
Change the power of the organization
With the development of data infrastructure and data resources with the characteristics of semantic web, organizational change becomes more and more inevitable. Big data will prompt the network structure to produce unorganized organizational power. The first embodiment of this structural feature is various decentralized WEB2.0 applications, such as RSS, wiki, blog and so on.
Big data has become the transformative force of the times because it gains wisdom by following meaning.
& gt& gt& gt& gt& gt big data processing
Three major changes in the concept of big data processing data era: all should not be sampled, efficiency should not be absolutely accurate, and correlation should not be causal.
The process of big data processing
There are indeed many specific big data processing methods, but according to the author's long-term practice, a universally applicable big data processing process is summarized, which should be helpful for everyone to straighten out the processing of big data. The whole processing flow can be summarized as four steps, namely acquisition, import and pretreatment, statistics and analysis, and finally data mining.
One of Big Data Processing: Acquisition
The collection of big data refers to the use of multiple databases to receive data from clients (Web, App or sensor, etc.). ), users can make simple queries and processing through these databases. For example, e-commerce companies use traditional relational databases such as MySQL and Oracle to store the data of each transaction. In addition, NoSQL databases such as Redis and MongoDB are also commonly used for data collection.
In the process of big data collection, its main feature and challenge is high concurrency, because thousands of users may visit and operate at the same time, such as the train ticket ticketing website and Taobao, whose concurrent visits reach millions at the peak, so a large number of databases need to be deployed at the collection end to support it. How to balance and fragment the load among these databases really needs in-depth thinking and design.
Big Data Processing II: Import/Pretreatment
Although the collection terminal itself has many databases, in order to effectively analyze these massive data, it is necessary to import these data from the front end into a centralized large-scale distributed database or distributed storage cluster, and some simple cleaning and preprocessing can be done on the basis of the import. There are also some users who use Storm from Twitter to stream data when importing to meet the real-time computing needs of some businesses.
The characteristics and challenges of the import and pretreatment process are mainly the large amount of imported data, which often reaches the level of 100 megabytes or even gigabytes per second.
The third largest data processing: statistics/analysis
Statistical analysis mainly uses distributed databases or distributed computing clusters to analyze and classify the massive data stored in them to meet most common analysis needs. In this regard, some real-time requirements will use GreenPlum from EMC, Exadata from Oracle, Infobright based on MySQL, and some batch processing or semi-structured data requirements can use Hadoop.
The main feature and challenge of statistics and analysis is that the analysis involves a large amount of data, which will occupy a lot of system resources, especially I/O.
The fourth largest data processing: mining
Different from the previous statistical and analysis process, data mining generally has no preset theme, and mainly calculates the existing data based on various algorithms, so as to achieve the prediction effect and meet the requirements of some high-level data analysis. Typical algorithms include Kmeans for clustering, SVM for statistical learning and NaiveBayes for classification. The main tool used is Hadoop's Mahout. The characteristic and challenge of this process is that the algorithm used for mining is very complex, involving a large amount of data and calculation. Commonly used data mining algorithms are mainly single-threaded.
The general process of the whole big data processing must meet at least these four steps to be considered as a relatively complete big data processing.
& gt& gt& gt& gt& gt big data application and case analysis
The key and necessary condition of big data application lies in the integration of "IT" and "operation". Of course, the connotation of operation here can be very extensive, from the operation of a retail store to the operation of a city. The following are the application cases of big data in different industries and organizations that I have compiled. It is hereby declared that the following cases are all from the Internet. This article is for reference only, and on this basis, I simply sort out the classification.
Application Case of Big Data: Medical Industry
[1] Seton Healthcare is the first customer to use IBM's latest Watson technology to analyze and predict healthcare content. This technology allows enterprises to find a large number of clinical medical information related to patients and better analyze patient information through big data processing.
[2] In a hospital in Toronto, Canada, premature babies have more than 3,000 data readings per second. Through the analysis of these data, hospitals can know in advance which premature babies have problems and take targeted measures to prevent premature babies from dying.
[3] It makes it easier for more entrepreneurs to develop products, such as health applications that collect data through social networks. Maybe in the next few years, the data they collected will make your diagnosis more accurate. For example, instead of taking one tablet three times a day for adults, it will automatically remind you to take the medicine again when it is detected that the drugs in your blood have been metabolized.
One of Big Data Application Cases: Energy Industry
[1] Smart Grid Now Europe has realized the terminal, which is the so-called smart meter. In Germany, in order to encourage the use of solar energy, solar energy will be installed at home. In addition to selling electricity to you, you can buy back the surplus electricity from your solar energy. Data is collected every five minutes or ten minutes through the power grid, and the collected data can be used to predict customers' electricity consumption habits, so as to infer how much electricity the whole power grid needs in the next 2-3 months. With this forecast, you can buy a certain amount of electricity from power generation or power supply enterprises. Because electricity is a bit like futures, it will be cheaper to buy it in advance and more expensive to buy it in stock. Through this forecast, the procurement cost can be reduced.
[2] Vestas wind power system relies on BigInsights software and IBM supercomputer, and then analyzes the meteorological data to find out the best location for installing wind turbines and the whole wind farm. Using big data, which used to take several weeks to analyze, can now be completed in less than 1 hour.
One of Big Data Application Cases: Communication Industry
[1] XO Communication has reduced the customer churn rate by nearly half by using IBM SPSS prediction and analysis software. XO can now predict customers' behavior, discover behavior trends and find out defective links, thus helping enterprises to take timely measures to retain customers. In addition, IBM's new Netezza network analysis accelerator will help communication enterprises make more scientific and reasonable decisions by providing an extensible platform with a single end-to-end network, service and customer analysis view.
[2] Telecom operators can analyze a variety of user behaviors and trends through tens of millions of customer data and sell them to enterprises in need. This is a brand-new information economy.
[3] China Mobile conducts targeted monitoring, early warning and tracking of the whole business operated by enterprises through big data analysis. The system automatically captures the market changes at the first time, and then pushes them to the designated person in charge in the fastest way, so that he can understand the market situation in the shortest time.
[4] NTT DoCoMo combines the location information of mobile phones with the information on the Internet to provide customers with information about nearby restaurants, and provide the last bus information service when the last bus time approaches.
One of Big Data Application Cases: Retail Industry
[1] "One of our customers is a leading professional fashion retailer, who provides services to customers through local department stores, the Internet and its mail-order catalog business. The company hopes to provide differentiated services for customers. How to position the company's differentiation? By collecting social information from Twitter and Facebook, they have a deeper understanding of the marketing model of cosmetics. Then, they realize that they must retain two types of valuable customers: high consumers and high influencers. I hope that by accepting free makeup services, users can conduct word-of-mouth publicity, which is a perfect combination of transaction data and interactive data, providing a solution to business challenges. " Informatica's technology helps retailers use data on social platforms to enrich customer master data and make their business services more targeted.
[2] Retail enterprises also monitor customers' in-store walking and interaction with commodities. They combine these data with transaction records for analysis, so as to give advice on which goods to sell, how to place them and when to adjust the price. This method has helped a leading retail enterprise to reduce its inventory by 65,438+07%, and at the same time, it has increased the proportion of its own brand products with high profit margin while maintaining its market share.