Apache Kafka (Kafka) is an open source, distributed streaming platform that enables (among other things) the development of real-time, event-driven applications. So, what does that mean?
Today, billions of data sources continuously generate streams of data records, including streams of events. An event is a digital record of an action that happened and the time that it happened. Typically, an event is an action that drives another action as part of a process. A customer placing an order, choosing a seat on a flight, or submitting a registration form are all examples of events. An event doesn’t have to involve a person—for example, a connected thermostat’s report of the temperature at a given time is also an event.
These streams offer opportunities for applications that respond to data or events in real-time. A streaming platform enables developers to build applications that continuously consume and process these streams at extremely high speeds, with a high level of fidelity and accuracy based on the correct order of their occurrence.
LinkedIn developed Kafka in 2011 as a high-throughput message broker for its own use, then open-sourced and donated Kafka to the Apache Software Foundation (link resides outside ibm.com). Today, Kafka has evolved into the most widely-used streaming platform, capable of ingesting and processing trillions of records per day without any perceptible performance lag as volumes scale. Fortune 500 organizations such as Target, Microsoft, AirBnB, and Netflix rely on Kafka to deliver real-time, data-driven experiences to their customers.
The following video provides further information about Kafka (9:10):
Kafka has three primary capabilities:
Developers can leverage these Kafka capabilities through four APIs:
Kafka is a distributed platform—it runs as a fault-tolerant, highly available cluster that can span multiple servers and even multiple data centers. Kafka topics are partitioned and replicated in such a way that they can scale to serve high volumes of simultaneous consumers without impacting performance. As a result, according to Apache.org, “Kafka will perform the same whether you have 50 KB or 50 TB of persistent storage on the server.”
Kafka is used primarily for creating two kinds of applications:
RabbitMQ is a very popular open source message broker, a type of middleware that enables applications, systems, and services to communicate with each other by translating messaging protocols between them.
Because Kafka began as a kind of message broker (and can, in theory, still be used as one) and because RabbitMQ supports a publish/subscribe messaging model (among others), Kafka and RabbitMQ are often compared as alternatives. But, the comparisons aren’t really practical, and they often dive into technical details that are beside the point when choosing between the two. For example, that Kafka topics can have multiple subscribers, whereas each RabbitMQ message can have only one; or that Kafka topics are durable, whereas RabbitMQ messages are deleted once consumed.
The bottom line is:
Kafka is frequently used with several other Apache technologies as part of a larger streams processing, event driven architecture or big data analytics solution.
Apache Spark is an analytics engine for large-scale data processing. You can use Spark to perform analytics on streams delivered by Apache Kafka and to produce real-time stream processing applications, such as the aforementioned click-stream analysis.
Apache NiFi is a data flow management system with a visual, drag-and-drop interface. Because NiFi can run as a Kafka producer and a Kafka consumer, it’s an ideal tool for managing data flow challenges that Kafka can’t address.
Apache Flink is an engine for performing computations on event streams at scale, with consistently high speed and low latency. Flink can ingest streams as a Kafka consumer, perform operations based on these streams in real-time, and publish the results to Kafka or to another application.
Apache Hadoop is a distributed software framework that lets you store massive amounts of data in a cluster of computers for use in big data analytics, machine learning, data mining, and other data-driven applications that process structured and unstructured data. Kafka is often used to create a real-time streaming data pipeline to a Hadoop cluster.
Build, modernize and manage applications securely across any cloud with confidence.
From your business workflows to your IT operations, we’ve got you covered with AI-powered automation.
Connect applications, services and data with IBM Cloud Pak for Integration, the most comprehensive integration platform on the market.