Inventory of six tools for big data analysis:
1. Apache Hadoop
Hadoop is a software framework that can distribute a large amount of data. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data to ensure that processing can be redistributed for failed nodes. Hadoop is efficient because it works in a parallel way, which speeds up the processing. Hadoop is also scalable and can handle PB-level data. In addition, Hadoop relies on community servers, so its cost is relatively low and anyone can use it.
Hadoop has a framework written in Java language, so it is ideal to run on Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.
second, hpcc
the abbreviation of hpcc, high performance computing and communications. In 1993, the report "Major Challenge Project: High Performance Computing and Communication" was submitted to Congress by the Federal Coordinating Council for Science, Engineering and Technology of the United States, which is also known as the report of HPCC Plan, that is, the scientific strategy project of the President of the United States. Its purpose is to solve a number of important scientific and technological challenges by strengthening research and development. HPCC is a plan for the United States to implement the information superhighway. The implementation of this plan will cost tens of billions of dollars. Its main objectives are to develop scalable computing systems and related software to support the transmission performance of Ethernet, develop gigabit network technology, and expand research and education institutions and network connection capabilities.
The project is mainly composed of five parts:
1. High-performance computer system (HPCS), including the research of future generations of computer systems, system design tools, advanced typical systems and evaluation of original systems, etc.
2. Advanced Software Technology and Algorithm (ASTA), which includes software support, new algorithm design, software branches and tools, computing and high-performance computing research center, etc. with great challenges;
3. National Research and Education Grid (NREN), which includes the research and development of docking stations and 1 billion-bit transmission;
4. Basic Research and Human Resources (BRHR), which includes basic research, training, education and course materials, is designed to increase the stream of innovation consciousness in scalable high-performance computing by rewarding investigators-initial and long-term investigations, to increase the joint venture of skilled and trained personnel by improving education and high-performance computing training and communication, and to provide necessary infrastructure to support these investigations and research activities;
5. Information Infrastructure Technology and Application (IITA) aims to ensure the leading position of the United States in the development of advanced information technology.
third, Storm
Storm is a free open source software, a distributed and fault-tolerant real-time computing system. Storm can handle huge data streams very reliably, and it can be used to handle batch data of Hadoop. Storm is simple, supports many programming languages and is very interesting to use. Storm comes from Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Music Element, Admaster and so on.
Storm has many application fields: real-time analysis, online machine learning, non-stop computing, distributed RPC (Remote Procedure Call Protocol, which requests services from remote computer programs through the network), ETL (short for extraction-transformation-loading) and so on. Storm's processing speed is amazing: after testing, each node can process 1 million data tuples per second. Storm is extensible, fault-tolerant and easy to set up and operate.
fourth, Apache Drill
in order to help enterprise users find more effective ways to speed up Hadoop data query, Apache Software Foundation recently launched an open source project called "Drill". Apache Drill implements Google's Dremel.
This project will create an open source version of Google Dremel Hadoop tool (which Google uses to speed up the Internet application of Hadoop data analysis tools). And "Drill" will help Hadoop users to query massive data sets faster.
the "drill" project is actually inspired by Google's Dremel project: it helps Google realize the analysis and processing of massive data sets, including analyzing and grabbing Web documents, tracking application data installed on Android Market, analyzing spam, analyzing test results on Google's distributed construction system, and so on.
by developing the "Drill”Apache open source project, organizations will hopefully establish the API interface and flexible and powerful architecture to which Drill belongs, thus helping to support a wide range of data sources, data formats and query languages.
v. RapidMiner
RapidMiner is the world's leading data mining solution with advanced technology to a great extent. Its data mining tasks involve a wide range, including various data arts, which can simplify the design and evaluation of data mining process.
Functions and features
Free data mining technology and library
1% Java code (which can run in the operating system)
The data mining process is simple, powerful and intuitive
The internal XML ensures a standardized format to represent the exchange data mining process
Large-scale processes can be automatically carried out in a simple scripting language
Multi-level data views, Ensure effective and transparent data
interactive prototype of graphical user interface
command line (batch mode) automatic large-scale application
Java API (application programming interface)
simple plug-in and promotion mechanism
powerful visualization engine, visual modeling of many cutting-edge high-dimensional data
supported by more than p>4 data mining operators
Yale University has been successfully applied in many different application fields.
6.? Pentaho BI
unlike traditional bi products, Pentaho BI
Pentaho BI is a process-centered and solution-oriented framework. Its purpose is to integrate a series of enterprise BI products, open source software, API and other components to facilitate the development of business intelligence applications. Its appearance enables a series of independent products oriented to business intelligence, such as Jfree and Quartz, to be integrated to form a complex and complete business intelligence solution.
Pentaho SDK*** consists of five parts: Pentaho platform, Pentaho sample database, Pentaho platform that can run independently, Pentaho solution sample and a pre-configured Pentaho network server. Among them, Pentaho platform is the most important part of Pentaho platform, which includes the main source code of Pentaho platform; The data services provided by Pentaho database for the normal operation of Pentaho platform, including configuration information, Solution related information, etc., are not necessary for Pentaho platform, and can be replaced by other database services through configuration; Pentaho platform which can run independently is an example of the independent running mode of Pentaho platform, which demonstrates how to make Pentaho platform run independently without the support of application server. The Pentaho solution example is an Eclipse project to demonstrate how to develop relevant business intelligence solutions for the Pentaho platform.
Pentaho BI platform is built on the basis of servers, engines and components. These provide the J2EE server, security, portal, workflow, rule engine, chart, collaboration, content management, data integration, analysis and modeling functions of the system. Most of these components are based on standards and can be replaced by other products.
This article is reproduced from the data-technology sharing column of Gamigu University, please indicate the source.