4 min read

Apache Kafka is one of the most popular choices in choosing a durable and high-throughput messaging system. Kafka’s protocol doesn’t conform to any queue agnostic standard protocol (that is, AMQP), and provides concepts and semantics that are similar, but still different, from other queuing systems. In this post I will cover some common Kafka tools and add-ons that you should consider employing when using Kafka as part of your system design.

Data mirroring

Most large-scale production systems deploy their systems to multiple data centers (or A

availability zones / regions in the cloud) to either avoid a SPOF (Single Point of Failure) when the whole data center is brought down, or reduce latency by serving systems closer to customers at different geo-locations. Having all Kafka clients all reading across data centers to access data as needed is quite expensive in terms of network latency, and it affects service performance. For Kafka to have the best performance in throughput and latency, all services should ideally communicate to a Kafka cluster within the same data center. Therefore, the Kafka team built a tool called MirrorMaker that is also employed in production at Linkedin. MirrorMaker itself is an installed daemon that sets up a configured number of replication streams from the destination cluster pulling from the source cluster, and is able to recover from failures and records its state in Zookeeper.

With MirrorMaker you can set up Kafka clients that can read/write from clusters in the same DC. This aggregation from other brokers is replicated asynchronously and the local changes are polled from other clusters as well.

Auditing

Kafka is often served as a pub/sub queue between a frontend collecting service and a number of downstream services that includes batching frameworks, logging services, or event processing systems. Kafka works really well with various downstream services because it holds no state of each client (which is impossible for AMQP). Kafka also allows each consumer to consume data at different offsets of the same partition with high performance.

Also, typically systems not only have one cluster of Kafka, but multiple Kafka clusters. These clusters act as a pipeline where a consumer of one Kafka cluster feeds into a recommendation system that writes that output into another set of Kafka clusters.

One common need for a data pipeline is to have logging/auditing, to ensure that all of the data you produce from the source is reliably delivered into each stage. If this data is not delivered, then you will know the percentage of data that is missing.

Kafka out-of-the-box doesn’t provide this functionality, but it can be added using Kafka directly. One implementation is to give each stage of your pipeline an ID, and in the producer code at each stage write out the sum of the number of records in a configurable window that is pushed into Kafka along with the stage ID into a specific topic (that is, counts) at each stage of the pipeline.

For example, with a Kafka pipeline that consists of stage A -> B -> C, you could imagine simple code such as the following to write out counts at a configured window:

producer.send(topic, messages);
sum += messages.count();
lastUpdatedAt = System.currentTimeMillis();
if (lastUpdatedAt - lastAudited >= WINDOW_MS) {
  lastAuditedAt = System.currentTimeMillis();
  auditing.send("counts", new Message(new AuditMessage(stageId, sum,
lastAuditedAt).toBytes());
}

At the very bottom of the pipeline the counts topic will have the aggregate of counts from each pipeline, and a custom consumer can pull in all of the count messages and partition by stage and compare the sums. The results at each window can also be graphed to show the number of messages that are flowing through the system.

This is what is done at LinkedIn to audit their production pipeline, and has been suggested for a while to be incorporated into Kafka itself but that hasn’t happened yet.

Topic partition assignments

Kafka is highly available, since it offers replication and allows users to define the number of acknowledgments and the broker assignment of each replicated data for each partition. By default, if no assignment is given, then it’s randomly assigned.

Random assignment might not be suitable, especially if you have more requirements of how you want to place these data replicas. For example, if you are hosting your data on the cloud and want to withstand an availability zone failure, then placing more than one AZ for your data replication would be a good idea. Another example would be rack awareness in your data center.

You can definitely build an extra tool that generates a specific replica assignment based on all of this information.

Conclusion

The Kafka tools described in this post are some common tools and features companies in the community often employ, but depending upon your system there might be other needs to consider. The best way to see if someone has implemented a similar feature that is open source is to email the mailing list or ask on IRC (freenode #kafka).

About The Author

Timothy Chen is a distributed systems engineer at Mesosphere Inc., The Apache Software Foundation. His interests include: open source technologies, big data, and large-scale distributed systems. He can be found on Github as tnachen.

LEAVE A REPLY

Please enter your comment!
Please enter your name here