Message System: Apache Kafka Basics and comparison
Problem
In an enterprise, there can be 100s and 1000s of such connection, and for each connection, one will have to look out for data type, connection type, schema, etc. Managing these connections in a highly coupled environment adds more difficulty.
For example: if we have 3 source services and 4 destination services, we will end up with total no of connections = 12 (4*3).
Solution
Answer to the above problem is “Message System”. Here, for the same example, we can total no of connections = 7 (4+3).
Why use Message System in place of database, in above scenario?
All interested parties are notified once their expected data arrives in messaging system and eventually, they consume their respective data.
Message System
Messaging system is responsible for transferring of data from one application to another so the application can focus on data without getting bogged down on data transmission and sharing.
There are two types of Messaging system:
Point to Point Messaging pattern
· Messages are persisted in a Queue
· A particular message can be consumed by a maximum of 1 receiver only
· No time dependency laid for receiver to receive the message
· On receiving, an acknowledgement is sent back to the sender
Publish Subscribe Messaging pattern
· Messages are persisted in a topic
· A particular message can be consumed by any number of consumers
· Time dependency laid for consumer to consume the message
Apache Kafka
As mentioned in the official documentation:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Under Kafka, servers and clients communicate over a high-performance TCP network protocol. Kafka is easy to deploy on all possible environments such as cloud, virtual machine, containers, etc. as per system requirement.
Apache Kafka Architecture
Apache Kafka Basic Concepts
Events
An event is a recording of an incident which has occurred in real world or in terms of business. Kafka records the data in the form of events. In Kafka, an event (a single piece of data) will have a key, value, time stamp and some metadata attached to it.
Producers and Subscribers (Customers)
The client applications that publish events (data) to Kafka are called as Producers and those applications which are subscribed to these events are called Consumers.
· Producers can choose to receive acknowledgement of data writes.
· Producer takes a string, serializes to bytes, and sends to Kafka, and consumer de-serializes the bytes upon receiving.
· Each consumer is associated with a Consumer Group.
· Consumer Group is a group of related consumers (interested on a particular type of data and perform same operation on that data) that perform a task.
· Each consumer within a group reads from exclusive partition.
· Kafka stores the offset at which a Consumer Group has been reading.
· When a consumer in a group has processed data received from Kafka, it should be committing the offsets in a topic _consumer_offset_.
· If a consumer dies, it will be able to read back from where it let off, thanks to committed consumer offsets.
Note: Producers and Consumers are completely decoupled in Kafka.
Topics
A stream of messages belonging to a particular category is called a Topic. In Kafka, events are organized and durably stored in topics. Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events. Unique id of a topic is ‘Name’.
Topics can be replicated as well as partitioned and distributed on multiple brokers.
NOTE: Events with the same event key are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition’s events in the same order as they were written.
· Order is guaranteed only within a partition.
· Data is assigned randomly to a partition unless a key is provided.
· Messaged are never produced or consumed from a replica.
· Only one broker can be a leader for a given partition. Only that broker can receive or serve data for that partition.
Servers/Brokers
Software processes who maintain and manage the published messages, are known as Kafka Servers. Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the Brokers. Other servers run ‘Kafka Connect’ to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters.
Kafka Connect is a tool for scalability and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency.
Kafka Cluster is a set of Brokers who are communicating with each other to perform management and maintenance tasks.
· One needs to connect with one broker to connect with entire cluster.
· Brokers also manage [consumer_offset] and are responsible for delivery to right consumers.
· Each Broker knows about all the topics, brokers, and partitions.
· Brokers can be added in a running Kafka Cluster without any downtime.
Zookeeper
Zookeeper is used to monitor Kafka Cluster and coordinated with each broker. It maintains all metadata information related to Kafka Cluster in form of Key Value pair.
Metadata includes Configuration Information and health status of each broker. It is used for Controller election within Kafka Cluster.
A set of Zookeeper nodes working together to manage other distributed system is known as Zookeeper Cluster/ensemble.
Topic Replication
In Kafka, each broker contains some sort of data and there can be a scenario of possible data loss, in case one of the brokers fails down. Precautionary, Apache Kafka enables a feature of replication to secure data loss even when a broker fails down.
Replication Factor: No of copies of data over multiple brokers values should always be greater than 1. In Kafka, a replication factor is created for the topics contained in any broker.
Kafka Producers Acks Deep Dive
· (acks = 0) -> No response is requested.
If a broker goes offline or an exception happens, we won’t know and loose data. It is useful in scenarios where it is acceptable to lose data, such as Metric Collection, Log Collection, etc.
· (acks = 1) -> Leader acks. {Default, since Kafka 2.0}
Leader response is requested but replication is not a guarantee. If acknowledgement is not received, producer may retry. If leader broker goes offline but replicas have not replicated the data yet, we have a data loss.
· (acks — all) -> Replicas acks.
Leaders and replicas both acknowledge data. It leads to more latency but more security as well. This setting must be used in conjunction with (min.insync.replicas). It can be set over a broker or topic level.
Example: If (min.insync.replicas) = 2; it means, at least 2 brokers that are in sync replica, must respond with acknowledgement.
Kafka Message Compression
In Kafka, producers usually send data that is text based, in JSON format. However, in production, if throughput is high, applying compression is a good idea.
NOTE: Compression is enabled at producer level and does not require any configuration change in Brokers or Consumers. Compression type is set to ‘none’ by default, however, if applied, values can be ‘gzip’, ‘lz4’ and ‘snappy’.
The compressed batch has following advantages:
· Batch size gets much smaller with compression ration up to 4 times the actual size.
· Less latency.
· Better throughput.
· Better disk utilization in Kafka.
By tweaking ‘linger.ms’ and ‘batch.size’, we can increase batch size and hence better throughput.
‘linger.ms’: No of milliseconds a producer is willing to wait before sending batch.
Default behavior in Kafka:
· It will have up to 5 requests in flight, meaning up to 5 messages individually sent at the same time.
· After this, if more messages have to be sent while others are in flight, Kafka is smart and will start batching them while they wait to send them all at once.
Comparison of Kafka with its contemporaries like RabbitMQ
RabbitMQ: A general purpose message broker based around message queues, designed with smart broker/passive consumer model.
· It has powerful routing capabilities.
· Easy to scale by addition/removal of competing consumers on a single queue.
· Can be configured for consistency, high availability, low latency, high throughput.
· It supports strict ordering.
· Wider use case: Event driven micro-services, publish-subscribe messaging, RT analytics.
When to use RabbitMQ over Kafka?
· If we don’t have specific requirements such as event replay, RabbitMQ gives greater flexibility, can meet high throughput and RT event processing needs at lower cost of operation.
· Evolving application requirement.
· Decoupled Producers and Consumers.
· Consumers independently bring their own queue that binds to exchanges.
When to use Kafka over RabbitMQ?
· Streaming requirement.
· High throughput.
· Need to join multiple streams.
· Need to scale the application.
· Need features such as Replay.
Note: Below is the link of a small GitHub project, which fetches tweets from Twitter APIs and publishes them to a topic on Kafka.