Hadoop: batch computing emphasizes batch processing of data mining and analysis.
Spark: The purpose of source cluster computing system based on memory computing is to make data analysis faster. Spark, a source cluster computing environment similar to Hadoop, has some similarities, which makes some workload surfaces of Spark perform better. In other words, Spark not only provides interactive queries, but also enables the memory cloth dataset to optimize the iterative workload.
Spark Scala language realizes Scala as its application framework. Hadoop and Spark Scala can closely combine their own Scala images, manipulate the set images and easily manipulate the layout data set.
Although Spark was created to support the iterative operation of cloth data sets, Hadoop complements the parallel operation of Hadoop file system. Mesos, the third cluster framework, supports rows. Spark, developed by AMP Lab (Algorithm, Machine and Human Laboratory) in Berkeley, California, has a constructive and low-latency data analysis application.
Although the similarity between Spark and Hadoop provides a new framework for cluster computing, it is actually different. First, Spark cluster computing designs a specific type of workload, that is, some parallel operations reuse work data sets (as opposed to machine learning) to optimize some types of workloads. Spark introduces the concept of memory cluster computing, which shortens the access delay by caching data sets.
The data processing surface believes that hadoop is already familiar. Based on Googlemap/reduce, Hadoop's sender provides map and reduce primitives, which makes parallel batch processing program not simple and beautiful. Spark provides data set operation types, such as Hadoop provides MapReduce. Two operations are better than mapping, filtering, flat mapping and sampling. Group by key, reduce by key, union, join, co group, map values, sort, partionby, etc. Some operations are called transformation and provide actions such as counting, collecting, reducing, searching, saving, etc. Some data set operation types provide users with a convenient communication model between processing nodes. Then, like Hadoop's pure data shuffle model, users name the materialized control junction, saying that the programming model is more ingenious than Hadoop.
2. Is the 2.Spark fault-tolerant surface better than its tools?
Spark's paper "Elastic Distributed Data Set: Fault-tolerant Abstraction of Memory Cluster Computing" does not consider fault tolerance, but calculates the inverted cloth data set and makes two forms of checkpoint data records. His update seems to be a Spark user, although it seems to save storage space, because the data processing model is similar to the DAG operation process, because one node in the diagram is wrong, and because the lineage chain depends on complex performance, all computing nodes recalculate the cost, for example, storing data, saving updates, and making a checkpoint. It's up to the user, but the ball is kicked to the user. In my opinion, users measure the IO disk space for storing data according to the business type. Cost recalculation cost selection cost is a better strategy than continuous connection or checkpoint. Spark remembers that the node that generated some data sets is now faulty. Spark reconstructs the data set according to the stored information, and mismatches its nodes to help reconstruction.
3. What are the characteristics of 3.Spark's efficiency in data processing?
Spark provides high-performance data processing capabilities, allowing users to experience rapid feedback. Another application is data mining. Spark fills the memory for caching. DAG eliminates the necessary steps. It is more suitable for iterative operation. Iterative convergence is suitable for parallelizing our operations with Spark. Using Spark to realize R language will reduce users' data mining workbooks.
Compared with Twitter Storm framework, Spark's distribution data processing model is interesting and unique. Storm is basically like putting it in an independent transaction pipeline, but its transaction layout is the opposite. Spark uses this model to collect transactions in a short time (I assume 5 seconds). RDD uses the Spark application to process the collected data in groups. The author claims that this model is slow, the node failure is more stable, and the interval of 5 seconds is enough for several applications.
abstract
Hadoop Authoritative Guide, hbase Authoritative Guide, hive Authoritative Guide, Scale Distributed Storage System, zoopkeeper, Data Internet Scale Data Mining and Distributed Processing and other books are different from supplements, so you can completely read this book.