1. Usually a theme is created first. For example, TopicA has three partitions and two copies (two in total: leader+follower). Two copies of the same partition are definitely not on the same server.
2. Basic summary of Kafka workflow: 1) Broker: Broker represents Kafka's node. Brokers are distributed and independent of each other. Register with zookeeper at startup, and there will be a node dedicated to recording the list of proxy servers on Zookeeper: /brokers/ids.
3. Process description: The user first constructs the message object ProducerRecord to be sent, and then calls KafkaProducer#send method to send it.
4. Kafka's workflow is in Kafka's place, and the news is classified by theme. Producer's production news, consumer's consumption news, reading and consumption are the same theme.
5. Like other middleware, every time kafka sends data, it is sent to the Leader partition and written to the disk in sequence. Then the master partition will synchronize the data to the slave partitions of each slave partition. Even if the primary partition hangs, it will not affect the normal operation of the service.
Kafka segmentation
In the previous example (Kafka producer- writing data to Kafka), the ProducerRecord object contains the target topic, key and value.
Step 1: Make all topic partitions into a topicanpartition list, then sort the topicanpartition list according to hashCode, and finally send it to each consumer thread by polling.
In Kafka, each theme will contain multiple partitions. By default, the next partition can only be consumed by two consumers in a consumer group, which leads to the problem of partition allocation.
The more partitions you need, the more file handles you need. You can increase the number of open file handles by configuring operating system parameters.
In a word, Kafka's message storage adopts partition, LogSegment and sparse index, which realizes high efficiency.
Kafka is a distributed message system, which supports partition and multiple copies, and is based on the coordination of zookeeper.
Message Queuing (III) Consistency and Fault Handling Strategy of kafka
The server needs idempotent when processing messages, and both the producer and receiver of messages need idempotent; The sender needs to add a timer to traverse and re-push the unprocessed message to avoid the transaction execution fracture caused by message loss.
High throughput: Kafka has a high throughput, even in a commercial cluster with low single-node performance, it can ensure that a single node transmits 65438+ million messages per second. High fault tolerance: Kafka supports the strategy of multi-partition and multi-copy in design, and has strong fault tolerance.
To ensure consistency, producers need to retry after failure, but retrying will lead to the problem of message duplication. One solution is to give each message a unique id, and avoid the problem of message duplication through the initiative of the server, but this mechanism has not been implemented in Kafka at present.
At this point, rahbitMQ will immediately delete the message. In this case, if the consumer fails to handle the message in an abnormal way (but the message queue already thinks that the message has been used), the message will be lost. As for the solution, it is enough to confirm the message manually.
Kafka's storage mechanism At this time, the message generated by the producer will be appended to the end of the log file, so that the file will become bigger and bigger. In order to prevent the inefficiency of data location caused by too large log files, Kafka adopts the fragmentation index mechanism.
Kafka is a distributed message queue, which has the characteristics of high performance, persistence, multi-copy backup and horizontal expansion. Producers write messages into queues, and consumers take messages from queues for business logic. Generally, it plays the role of decoupling, peak clipping and asynchronous processing in architecture design.
What is Kafka's principle?
1, Kafka is a message system, which was originally developed from LinkedIn as the basis of LinkedIn's ActivityStream and operation data processing pipeline. Now it has been used by many companies as various types of data pipelines and message systems.
2.Kafka's replication mechanism is that multiple server nodes replicate the logs of other nodes' subject partitions. When one node in the cluster fails, the request to access the failed node will be transferred to other normal nodes (this process is usually called rebalancing).
3.Kafka uses a globally unique number to refer to each proxy server. Different brokers must register with different broker ids. After creating a node, each agent will record its own IP address and port information in the node.
4. The ordering of 4.kafka messages is realized by message key order-preserving strategy. A theme, a partition, a consumer, internal single thread consumption, write n memory queues, and then n threads consume one memory queue respectively.
Kafka interview questions
1, talk about your understanding of Kafka's idempotency? The idempotency of Producer means that when sending the same message, the data will only be persisted once on the server side, and the data will not be lost or duplicated, but the idempotency here is conditional: Kafka introduced transaction support in version 0. 1 1.
2. Share some notes on Linux interview questions, and disassemble Linux knowledge points from load balancing, nginx, MySQL, redis, kafka, zabbix, k8s, etc. Personal technical points used to check leaks and fill gaps.
3, large companies, strong infrastructure research and development strength, using RocketMQ is a good choice. If it is a scene of real-time computing and log collection in the field of big data, Kafka is an industry standard, absolutely no problem, and the community activity is very high, which will never be yellow, not to mention it is almost normal in this field all over the world.
4. For example, if you write that you are good at MySQL, Jquery and bootstrap, then we will ask these questions. Of course, it won't be particularly difficult. Just to prove that you really know, not to brag.
5. It includes remote service framework middleware, such as Dubbo and Apache RPC framework. Message queuing middleware, such as: Alibaba open source distributed middleware RocketMQ, high throughput message publishing and streaming media service Kafka, etc.
6. Everyone knows that Kafka has good performance, but few people really understand the reasons. This is also a sad story. One of my interviews was cool on this topic. So how does Kafka achieve high performance from the design point of view? Kafka will write the information on the hard disk and never lose the data.