8 minutes
Apache Kafka is an open-source, distributed, event-streaming platform that processes real-time data. Kafka excels at supporting event-driven applications and building reliable data pipelines, offering low latency and high-throughput data delivery.
Today, billions of data sources continuously produce streams of information, often in the form of events, foundational data structures that record any occurrence in the system or environment.
Typically, an event is an action that drives another action as part of a process. A customer placing an order, choosing a seat on a flight, or submitting a registration form are all examples of events. An event doesn’t have to involve a person, for instance, a connected thermostat’s report of the temperature at a given time is also an event.
Event streaming offers opportunities for applications to respond instantly to new information. Streaming data platforms like Apache Kafka allow developers to build systems that consume, process and act on data as it arrives while maintaining the order and reliability of each event.
Kafka has evolved into the most widely used event-streaming platform, capable of ingesting and processing trillions of records daily without any perceptible performance lag to support scalable volumes. Over 80% of Fortune 500 organizations use Kafka, including Target, Microsoft, AirBnB and Netflix, to deliver real-time, data-driven customer experiences.
In 2011, LinkedIn developed Apache Kafka to meet the company’s growing need for a high-throughput, low-latency system capable of handling massive volumes of real-time event data. Built using Java and Scala, Kafka was later open-sourced and donated to the Apache Software Foundation.
While organizations already supported or used traditional message queue systems (e.g., AWS’s Amazon SQS), Kafka introduced a fundamentally different messaging system architecture.
Unlike conventional message queues that delete messages after consumption, Kafka retains messages for a configurable duration, enabling multiple consumers to read the same data independently. This capability makes Kafka ideal for messaging and event sourcing, stream processing, and building real-time data pipelines.
Today, Kafka has become the de facto standard for real-time event streaming. Industries that use Kafka include finance, ecommerce, telecommunications and transportation, where the ability to handle large volumes of data quickly and reliably is essential.
Kafka is a distributed platform; it runs as a fault-tolerant, highly available cluster that can span multiple servers and even multiple data centers.
Kafka has three primary capabilities:
Producers (applications or topics) write records to topics named logs that store the records in the order they occurred relative to one another. Topics are then split into partitions and distributed across a cluster of Kafka brokers (servers).
Within each partition, Kafka maintains the order of the records and stores them durably on disk for a configurable retention period. While ordering is guaranteed within a partition, it is not across partitions. Based on the application’s needs, consumers can independently read from these partitions in real time or from a specific offset.
Kafka ensures reliability through partition replication. Each partition has a leader (on one broker) and one or more followers (replicas) on other brokers. This replication helps tolerate node failures without data loss.
Historically, Kafka relied on Apache ZooKeeper, a centralized coordination service for distributed brokers. ZooKeeper ensured Kafka brokers remained synchronized, even if some brokers failed. In 2011, Kafka introduced KRaft (Kafka Raft Protocol) mode, eliminating the need for ZooKeeper by consolidating these tasks into the Kafka brokers themselves. This shift reduces external dependencies, simplifies architecture and makes Kafka clusters more fault-tolerant and easier to manage and scale.
Developers can leverage Kafka’s capabilities through four primary application programming interfaces (APIs):
The Producer API enables an application to publish a stream to a Kafka topic. After a record is written to a topic, it can’t be altered or deleted. Instead, it remains in the topic for a preconfigured amount of time, for two days, or until storage space runs out.
The Consumer API enables an application to subscribe to one or more topics and to ingest and process the stream stored in the topic. It can work with records in the topic in real time or ingest and process past records.
This API builds on the Producer and Consumer APIs and adds complex processing capabilities that enable an application to perform continuous, front-to-backstream processing. Specifically, the Streams API involves consuming records from one or more topics, analyzing, aggregating or transforming them as required, and publishing the resulting streams to the same topics or other topics.
While the Producer and Consumer APIs can be used for simple stream processing, the Streams API enables the development of more sophisticated data- and event-streaming applications.
This API lets developers build connectors, which are reusable producers or consumers that simplify and automate the integration of a data source into a Kafka cluster.
Developers use Kafka primarily for creating two kinds of applications:
Applications designed specifically to move millions and millions of data or event records between enterprise systems, at scale and in real-time. The apps must move them reliably, without risk of corruption, duplication of data, or other problems that typically occur when moving such huge volumes of data at high speeds.
For example, financial institutions use Kafka to stream thousands of transactions per second across payment gateways, fraud detection services and accounting systems, ensuring accurate, real-time data flow without duplication or loss.
Applications that are driven by record or event streams and that generate streams of their own. In the digitally driven world, we encounter these apps every day.
Examples include ecommerce sites that update product availability in real-time or platforms that deliver personalized content and ads based on live user activity. Kafka drives these experiences by streaming user interactions directly into analytics and recommendation engines.
Kafka integrates with several other technologies, many of which are part of the Apache Software Foundation (ASF). Organizations typically use these technologies in larger event-driven architectures, stream processing or big data analytics solutions.
Some of these technologies are open source, while Confluent, a platform built around Kafka, provides enterprise-grade features and managed services for real-time data processing at scale. Companies like IBM, Amazon Web Services and others offer Kafka-based solutions (e.g., IBM Event Streams, Amazon Kinesis) that integrate with Kafka for scalable event streaming.
The Apache Kafka ecosystem includes:
Apache Spark is an analytics engine for large-scale data processing. You can use Spark to perform analytics on streams delivered by Apache Kafka and to produce real-time stream-processing applications, such as clickstream analysis.
Apache NiFi is a data-flow management system with a visual, drag-and-drop interface. Because NiFi can run as a Kafka producer and a Kafka consumer, it’s an ideal tool for managing data-flow challenges that Kafka can’t address.
Apache Flink is an engine for performing large-scale computations on event streams with consistently high speed and low latency. Flink can ingest streams as a Kafka consumer, perform real-time operations based on these streams, and publish the results for Kafka or another application.
Apache Hadoop is a distributed software framework that lets you store massive amounts of data in a cluster of computers for use in big data analytics, machine learning, data mining and other data-driven applications that process structured and unstructured data. Kafka is often used to create a real-time streaming data pipeline to a Hadoop cluster.
Apache Camel is an integration framework with a rule-based routing and mediation engine. It supports Kafka as a component, enabling easy data integration with other systems (e.g., databases, messaging queues), thus allowing Kafka to become part of a larger event-driven architecture.
Apache Cassandra is a highly scalable NoSQL database designed to handle large amounts of data across many commodity servers without any single point of failure.
Kafka is commonly used to stream data to Cassandra for real-time data ingestion and for building scalable, fault-tolerant applications.
RabbitMQ is a popular open-source message broker that enables applications, systems and services to communicate by translating messaging protocols. Since Kafka began as a message broker (and can still be used as one) and RabbitMQ supports a publish/subscribe messaging model (among others), Kafka and RabbitMQ are often compared as alternatives. However, they serve different purposes and are designed to address various types of use cases. For example, Kafka topics can have multiple subscribers, whereas each RabbitMQ message can have only one. Additionally, Kafka topics are durable, whereas RabbitMQ messages are deleted once consumed.
When choosing between them, it’s essential to consider the specific needs of your application, such as throughput, message durability and latency. Kafka is well-suited for large-scale event streaming, while RabbitMQ excels in scenarios requiring flexible message routing and low-latency processing.
Integrating Apache Kafka and open-source AI transforms how organizations handle real-time data and artificial intelligence. When combined with open-source AI tools, Kafka enables the application of pre-trained AI models to live data, supporting real-time decision-making and automation.
Open-source AI has made artificial intelligence more accessible, and Kafka provides the infrastructure needed to process data in real-time. This setup eliminates the need for batch processing, allowing businesses to act on data immediately as it’s produced.
For example, an ecommerce company might use Kafka to stream customer interactions, such as clicks or product views, as they happen. Pre-trained AI models then process this data in real-time, providing personalized recommendations or targeted offers. Kafka manages the data flow, while the AI models adapt based on incoming data, improving customer engagement.
By combining real-time data processing with AI models, organizations can make quicker decisions in fraud detection, predictive maintenance or dynamic pricing, enabling more responsive and efficient systems.
IBM Event Streams is an event streaming software built on open source Apache Kafka. It is available as a fully managed service on IBM Cloud or for self-hosting.
Unlock business potential with IBM integration solutions, connecting applications and systems to access critical data quickly and securely.
Unlock new capabilities and drive business agility with IBM cloud consulting services.