The description of hadoop correctly refers to:
A distributed system infrastructure developed by Apache Foundation, which is a software framework of storage system and computing framework. It mainly solves the problem of massive data storage and calculation, and is the cornerstone of big data technology.
Hadoop is a distributed system infrastructure developed by Apache Foundation, which is a software framework of storage system and computing framework. It mainly solves the problem of massive data storage and calculation, and is the cornerstone of big data technology.
the core of Hadoop is HDFS (Hadoop Distributed File System) and MapReduce. HDFS is a distributed file system, which can distribute a large amount of data to multiple computers for storage.
this distributed storage method can ensure the reliability and high availability of data, and can expand the storage capacity by adding nodes. HDFS also has data backup and fault recovery mechanisms to ensure data security.
MapReduce is a distributed computing model, which can decompose a large-scale data set into several small tasks and execute them in parallel on multiple computers. There are two main components in MapReduce model: Mapper and Reducer. Mapper is responsible for splitting the input data into several small key-value pairs, and processing each key-value pair to generate intermediate results.
Reducer is responsible for merging the intermediate results generated by Mapper according to the key, and finally outputting the results. Through MapReduce model, complex computing tasks can be decomposed into several simple subtasks, thus improving computing efficiency and scalability.
in addition to HDFS and MapReduce, Hadoop also includes many other components and tools, such as yarn (yet another resource explorer), Hive, Pig, HBase and so on. YARN is a resource manager, which can coordinate and manage computing resources and task scheduling in a cluster.
Hive is a data warehouse tool based on HQL(HadoopQueryLanguage), which can use a language similar to SQL to query and analyze data. Pig is a data stream processing tool based on Latin scripting language, which can help users write and execute complex data processing tasks. HBase is a distributed column database, which can store massive structured numbers on Hadoop cluster
.