Druid is a data storage and analysis system developed by MetaMarket Company. It is specially designed for high-performance OLAP (OnLine Analysis Processing) on ??massive data sets. Currently, Druid has been incubated under the Apache Foundation.
The main features of Druid: Common application fields of Druid: As a SaaS company, Youzan has many business scenarios and a very large amount of real-time data and offline data.
Before using Druid, some development students used SparkStreaming or Storm to perform scenario analysis on some OLAP scenarios.
Using this type of solution requires not only writing real-time tasks, but also carefully designing storage for querying.
The problems are: the development cycle is long; the initial storage design is difficult to meet the iterative development of needs; it is not scalable.
After using Druid, developers only need to fill in a data ingestion configuration and specify dimensions and indicators to complete data ingestion; from the Druid features described above, we know that Druid supports SQL, and the application APP can be used like ordinary JDBC
Query the data in the same way.
With the help of Youzan's self-developed OLAP platform, data ingestion configuration has become simpler and more convenient. It only takes about 10 minutes to create a real-time task, which greatly improves development efficiency.
Druid's architecture is Lambda architecture, which is divided into real-time layer (Overlord, MiddleManager) and batch layer (Broker and Historical).
The main nodes include (PS: All functions of Druid are in the same software package and are started through different commands): 4.1 The main goals of Youzan OLAP platform: 4.2 Youzan OLAP platform architecture Youzan OLAP platform is used to manage Druid
and surrounding component management systems, the main functions of the OLAP platform: The data ingestion method used by the OLAP platform is the Tranquility tool, which allocates different numbers of Tranquility instances to each DataSource according to the traffic size; the configuration of the DataSource will be pushed to the Agent-Master, and the Agent-
The Master will collect the resource usage of each server and select a resource-rich machine to start the Tranquility instance. Currently, only the memory resources of the server are considered.
At the same time, the OLAP platform also supports functions such as starting and stopping, expanding and shrinking Tranquility instances.
Streaming data processing frameworks all have time windows, and data arriving later than the window period will be discarded.
How to ensure that late data can be built into Segment and prevent the real-time task window from being closed for a long time.
We developed the Druid data compensation function and configured streaming ETL through the OLAP platform to store the original data on HDFS. Flume-based streaming ETL can ensure that the data of the same hour is in the same file path according to the event time.
Then manually or automatically trigger the Hadoop-Batch task through the OLAP platform to build the segment offline.
ETL based on Flume uses HDFS Sink to synchronize data, implements Timestamp's Interceptor, and creates files according to the timestamp field of Event (a folder is created every hour). Delayed data can be correctly archived in the file of the corresponding hour.
With the increase of accessed services and long-term running time, the data size is also getting larger and larger.
The Historical node loads a large amount of Segment data, and it is observed that most queries are concentrated in recent days. In other words, hot data in recent days can be easily queried, so the separation of hot and cold data is important to improve query efficiency.
Druid provides Historical's Tier grouping mechanism and data loading Rule mechanism, which can effectively separate hot and cold data through configuration.
2021, start again