IT can be said that big data is one of the hottest trends in the IT industry, which has spawned a batch of new technologies for big data. New technology has brought new hot words: acronyms, technical terms and product names. Even the word "big data" itself is confusing. When many people hear "big data", they think it means "a lot of data", and the meaning of big data involves more than just the amount of data.
Here are some popular words that we think you should be familiar with, in alphabetical order.
Sour
The full name of ACID is atomicity, consistency, isolation and persistence, and it is actually a set of requirements or attributes: if these four aspects are observed, the data integrity of database transactions can be guaranteed during processing. Although ACID has existed for some time, the rapidly growing transaction data pays more attention to meeting the requirements of ACID when dealing with big data.
Three elements of big data
Today's IT system is generating "huge" data in quantity, speed and variety.
Quantity: IDC predicts that the global information volume will reach 2.7 terabytes (equivalent to 2.7 billion terabytes) this year, doubling every two years.
Speed: IT is not only the amount of data that worries IT managers, but also the increasing speed of data from financial systems, retail systems, websites, sensors, radio frequency identification (RFID) chips and social networks such as Facebook and Twitter.
Category: If we go back five years ago, or maybe 10 years ago, IT personnel mainly deal with alphanumeric data, which can be easily stored in rows and columns in relational databases. This is no longer the case now. Nowadays, unstructured data such as posts on Twitter and Facebook, various documents and web content are all part of the big data portfolio.
Column database
Some new generation databases (such as open source Cassandra and HP's Vertica database) are designed to store data in columns, instead of storing data in rows like traditional SQL databases. This design provides faster disk access speed and improves the performance when dealing with big data. For data-intensive business analysis application systems, columnar databases are particularly popular.
data warehouse
The concept of data warehouse has existed for about 25 years, which refers to copying the data of multiple operating IT systems into an auxiliary offline database for business analysis application systems.
However, with the rapid growth of data volume, the data warehouse system is also changing rapidly. They need to store more data and more kinds of data, so data warehouse management has become a big problem. 10 or 20 years ago, data may be copied to the data warehouse system every week or every month; Nowadays, data warehouses are updated much more frequently, and some even update in real time.
Extract, transform and load to the destination (abbreviation of extract-transform-load)
ETL software is needed when transferring data from one database (such as the database supporting the transaction processing system applied by banks) to another database (such as the data warehouse system for business analysis). When data is transferred from one database to another, it is usually necessary to reformat and clean up the data.
Due to the rapid growth of data, the speed of data processing is greatly accelerated, and the performance requirements of ETL tools are also greatly improved.
watercourse
Flume is a technology belonging to Apache Hadoop family (other technologies include HBase, Hive, Oozie, Pig and Whirr). This framework is used to populate data for Hadoop. This technology uses software agents scattered on application servers, Web servers, mobile devices and other systems to collect data and transmit the data to Hadoop system.
For example, companies can use Apache Flume running on a Web server to collect data from Twitter posts for analysis.
Geospatial analysis
One trend driving the trend of big data is that more and more geospatial data are generated and collected by today's IT systems. As the saying goes, the information content of a picture is worth 1000 words; Therefore, it is no wonder that more and more maps, charts, photos and other content based on geographical location are the main drivers of the explosive growth of big data today.
Geospatial analysis is a special form of data visualization (see "Visualization" below), which overlays data on a geographic map to help users understand the results of big data analysis more clearly.
Hadoop
Hadoop is an open source platform for developing distributed and data-intensive applications. It is controlled by Apache Software Foundation.
The inventor of Hadoop is Doug Cutting, the developer of Yahoo! He developed Hadoop based on the MapReduce concept of Google Lab, which was named after his son's toy elephant.
In addition, HBase is a non-relational database, which was developed as a part of Hadoop project. Hadoop Distributed File System (HDFS) is a key component of Hadoop. Hive is a data warehouse system based on Hadoop.
Memory database
When a computer processes a transaction or executes a query, it usually gets data from a disk drive. But when IT systems deal with big data, this process may be too slow.
The main memory database system uses the main memory of the computer to store frequently used data, thus greatly shortening the processing time. In-memory database products include SAP HANA and Oracle Bone Inscriptions Timeten in-memory database.
Java language (a computer language, especially for creating websites)
Java is a programming language developed by Sun, a subsidiary of Oracle Bone Inscriptions Company, and published in 1995. Many big data technologies, such as Hadoop, are developed in Java and are still the main development technologies in the field of big data.
Kafka
Kafka is a high-throughput distributed messaging system, originally developed in LinkedIn, which is used to manage activity flow (data about website usage) and operate the data processing pipeline of service websites (data about server component performance).
Kafka is very effective in processing a large number of streaming data, which is a key problem in many big data computing environments. Storm developed by Twitter is another popular stream processing technology.
The Apache Software Foundation has listed Kafka as an open source project. So, don't think this is defective software.
Delay time
Latency refers to the delay of data transmission from one point to another, or the delay of one system (such as an application) responding to another system.
Although delay is not a new term, with the increasing amount of data, IT systems are trying to keep up with the pace, and now you will hear this term more often. Simply put, "low latency" is a good thing and "high latency" is a bad thing.
Mapping/simplification
Map/Reduce is a method to break down a complex problem into smaller parts, then distribute them to multiple computers, and finally recombine them into an answer.
Google's search system uses the concept of mapping/simplification, and the company has a framework called MapReduce.
A white paper published by Google in 2004 describes its use of mapping/simplification. Doug Catin, the father of Hadoop, fully realized its potential and developed the first Hadoop version that also borrowed the concept of mapping/simplification.
NoSQL database
Most mainstream databases (such as Oracle Bone Inscriptions Database and Microsoft SQL Server) are based on relational architecture and use structured query language (SQL) for development and data management.
But the new generation database system named "NoSQL" (now some people call it "not just SQL") is based on the architecture that supporters think is more suitable for dealing with big data.
Some NoSQL databases are designed to improve scalability and flexibility, while others are effective in processing documents and other unstructured data. Typical NoSQL databases include Hadoop/HBase, Cassandra, MongoDB, CouchDB, etc., and some well-known developers such as Oracle Bone Inscriptions have also launched their own NoSQL products.
Elephant handler
Apache Oozie is an open source workflow engine that helps manage Hadoop-oriented processing. With Oozie, you can define a series of tasks in multiple languages (such as Pig and MapReduce) and then correlate them. For example, once the work of collecting data from operating applications is completed, programmers can start the task of data analysis and query.
pig
Pig is another project of Apache Software Foundation. This platform is used to analyze huge data sets. Essentially, Pig is a programming language that can be used to develop parallel computing queries running on Hadoop.
Quantitative data analysis
Quantitative data analysis refers to the use of complex mathematical or statistical models to explain financial and commercial behaviors and even predict future behaviors.
Due to the rapid increase of data collected today, quantitative data analysis becomes more complicated. However, if companies know how to use massive data, get better visibility, get a deeper understanding of the company's business and gain insight into the market development trend, then more data is expected to bring more opportunities in data analysis.
One problem is that there is a serious shortage of talents with this analytical ability. McKinsey, a well-known consulting firm, said that the United States alone needs 65,438+0.5 million analysts and managers with big data analysis skills.
relational database
Relational database management system (RDBM) is the most widely used database at present, including IBM's DB2, Microsoft's SQL Server and Oracle Bone Inscriptions database. Most enterprise transaction processing systems run on RDBM, from bank application system, retail store point-of-sale system to inventory management application software.
However, some people think that relational databases may not keep up with the explosive growth of data volume and types. For example, RDBM was originally designed to deal with alphanumeric data, but it is not equally effective when dealing with unstructured data.
separate into parts
As the database becomes bigger and bigger, it becomes more and more difficult to deal with it. Fragmentation is a database partition technology, which divides the database into smaller and more manageable parts. Specifically, the database is partitioned horizontally to manage different rows in the database tables.
Fragmentation method allows the fragments of a huge database to be distributed on multiple servers, thus improving the overall running speed and performance of the database.
In addition, Sqoop is an open source tool for transferring data from non-Hadoop sources (such as relational databases) to Hadoop environment.
Text analysis
One of the factors leading to the problem of big data is that more and more texts are collected from social media sites such as Twitter and Facebook, external news sources and even internal companies for analysis. Because text is unstructured data (different from structured data usually stored in relational databases), mainstream business analysis tools are often helpless when faced with text.
Text analysis adopts a series of methods (keyword search, statistical analysis and language research, etc.). ) Gain insight from text-based data.
unstructured data
Not long ago, most data were structured data, and this kind of alphanumeric information (such as financial data from sales transactions) was easily stored in relational databases and analyzed by business intelligence tools.
But nowadays, a large part of 2.7 gigabytes of stored data is unstructured data, such as text-based documents, Twitter messages, photos posted on Flickr, videos posted on YouTube and so on. (Interestingly, 35 hours of video content are uploaded to YouTube every minute. Processing, storing and analyzing all these messy unstructured data is usually a difficult problem in today's IT system.
visualize
With the growth of data, it is more and more difficult for people to understand data with static charts and graphs. This has led to the development of a new generation of data visualization and analysis tools, which can present data in new ways, thus helping people understand massive information.
These tools include: color-coded heat map, three-dimensional graphics, animation visualization showing changes over a period of time, and geospatial demonstration of overlaying data on geographic map. Today's advanced data visualization tools are also more interactive, such as allowing users to enlarge a subset of data and examine it more carefully.
hoop
Apache Whirr is a set of Java class libraries running big data cloud services. More specifically, it can accelerate the process of developing Hadoop clusters on virtual infrastructures such as Amazon Elastic Computing Cloud (EC2) and Rackspace.
extensible markup language (XML)
Extensible Markup Language (XML) is used to transmit and store data (not to be confused with HTML for displaying data). With the help of XML, programmers can create universal data formats and enjoy information and formats through the Internet.
Because XML documents can be very large and complex, they are usually considered to lead IT departments to face big data challenges.
Jaubert
Yottabyte is a data storage metric, which is equivalent to 1000 zeta bytes. According to the estimation of IDC, a well-known research organization, the total amount of data stored in the world is expected to reach 2.7 zebytes this year, which is 48% higher than 20 1 1. Therefore, we still have a long way to go to reach the Yaobyte mark, but judging from the current growth rate of big data, that day may come earlier than we thought.
By the way, 1 zebytes is equivalent to 102 1 byte of data. Equivalent to 1000 terabyte (EB), 1000 terabyte (PB), 1000 terabyte (TB).
Zoo keeper
ZooKeeper is a service created by Apache Software Foundation to help Hadoop users manage and coordinate Hadoop nodes across distributed networks.
ZooKeeper is closely integrated with HBase, which is a database related to Hadoop. ZooKeeper is a centralized service for maintaining configuration information, naming service, distributed synchronization and other group services. IT managers use it to realize reliable messaging mechanism, synchronous process execution and redundant services.