Apache Kafka is an open-source, distributed streaming platform that allows developers to build real-time, event-driven applications. With Apache Kafka, developers can build applications that continuously use streaming data records and deliver real-time experiences to users.
Whether checking an account balance, streaming Netflix or browsing LinkedIn, today’s users expect near real-time experiences from apps. Apache Kafka’s event-driven architecture was designed to store data and broadcast events in real-time, making it both a message broker and a storage unit that enables real-time user experiences across many different kinds of applications.
Apache Kafka is one of the most popular open-source data processing systems available, with nearly 50,000 companies using it and a market share of 26.7%.
How does Apache Kafka work?
Kafka is a distributed system, meaning it is a collection of different software programs that share computational resources across multiple nodes (computers) to achieve a single goal. This architecture makes Kafka more fault-tolerant than other systems because it can cope with the loss of a single node or machine in the system and still function.
Among distributed systems, Apache has distinguished itself as one of the best tools for building microservices architectures, a cloud-native approach where a single application is composed of many smaller, connected components or services. In addition to cloud-native environments, developers are also using Apache Kafka on Kubernetes, an open-source container orchestration platform, to develop apps using serverless frameworks.
For developers, a big part of Apache’s appeal is its unique architecture. Apache uses a publish-subscribe messaging system—a system that has what’s known as asynchronous communication, making it easier for developers to build advanced, architecturally complex applications. Apache’s architecture is made up of three categories—events, producers and consumers—and it relies heavily on application programming interfaces (APIs) to function.
Important Kafka concepts
Apache Kafka works on four underlying concepts: Events, streaming, producers and consumers. Here’s a brief look at how each of those models works together to give Apache Kafka its core capabilities.
Events and streaming
When a user interacts with a website—to register for a service or place an order for example—it’s described as an ‘event.’ In Apache architecture, an event is any message that contains information describing what a user has done. For example, if a user has registered on a website, an event record would contain their name and email address.
Perhaps no other capability distinguishes Apache Kafka from other data storage architectures more than its ability to stream events—a capability known as ‘event streaming’ or just ‘streaming’ (and specifically, in the case of Apache Kafka, as Kafka streams). Event streaming is when data that is generated by hundreds or even thousands of producers is sent simultaneously over a platform to consumers.
Producers and consumers
A ‘producer’, in Apache Kafka architecture, is anything that can create data—for example a web server, application or application component, an Internet of Things (IoT), device and many others. A ‘consumer’ is any component that needs the data that’s been created by the producer to function. For example, in an IoT app, the data could be information from sensors connected to the Internet, such as a temperature gauge or a sensor in a driverless vehicle that detects a traffic light has changed.
Kafka’s architecture is designed in such a way that it can handle a constant influx of event data generated by producers, keep accurate records of each event, and constantly publish a stream of these records to consumers.
Apache Kafka use cases
Apache Kafka’s core capability of real-time data processing has thrown open the floodgates in terms of what apps can do across many industries. Using Kafka, enterprises are exploring new ways to leverage streaming data to increase revenue, drive digital transformation and create delightful experiences for their customers. Here are a few of the most striking examples.
Internet of Things (IoT)
The Internet of Things (IoT), a network of devices embedded with sensors allowing them to collect and share data over the Internet, relies heavily on Apache Kafka architecture. For example, sensors connected to a windmill use IoT capabilities to transmit data on things like wind speed, temperature and humidity over the Internet. In this architecture, each sensor is a producer, generating data every second that it sends to a backend server or database—the consumer—for processing.
Kafka architecture facilitates this back-and-forth transmission and receipt of data—as well as its processing—in real-time, allowing scientists and engineers to track weather conditions from hundreds or thousands of miles away. Apache’s record-keeping and message-queue capabilities ensure the quality and accuracy of the data that’s being gathered.
Financial services
In the same way that Apache enables the gathering of data via IoT devices that can be streamed to consumers in real-time, it also enables the gathering and analysis of information from the stock market.
Apache has been used for many business-critical, high-volume workloads that are essential to trading stocks and monitoring financial markets. Some of the world’s largest banks and financial institutions, such as PayPal, Ing and JP Morgan Chase, use it for real-time data analysis, financial fraud detection, risk management in banking operations, regulatory compliance, market analysis and more.
Retail
Online retailers and e-commerce sites must process thousands of orders from their app or website every day, and Kafka plays a central role in making this happen for many businesses. Response time and customer relationship management (CRM) are key to success in the retail industry, so it’s important that orders are processed quickly and accurately.
Kafka helps simplify the communication between customers and businesses, using its data pipeline to accurately record events and keep records of orders and cancellations—alerting all relevant parties in real-time. In addition to processing orders, Kafka generates accurate data that can be analyzed to assess business performance and uncover valuable insights.
Healthcare
The healthcare industry relies on Kafka to connect hospitals to critical electronic health records (EHR) and confidential patient information. Kafka facilitates two-way communication that powers healthcare apps that rely on data that’s being generated in real-time by several different sources. Kafka’s capabilities also allow the sharing of knowledge in real-time; for example, a patient’s allergy to a certain medication that can save lives.
In addition to helping doctors get real-time data that informs how they treat patients, Kafka is also critical to the medical research community. Its data storage and analytics capabilities help researchers scour medical data for insights into diseases and patient care, speeding medical breakthroughs.
Telecom
Telecommunications companies use Apache for a variety of services. Primarily, its real-time data stream processing is used to monitor the networks that power millions of wireless devices worldwide. Apache collects data on network operations that it streams in real-time to servers that are constantly analyzing it for any problems. Records that Apache keeps for telecommunications companies include calls, texts, customer data, usage, dropped calls and more.
Gaming
Today’s most advanced gaming platforms rely on real-time communication between players hundreds and even thousands of miles apart. If there’s any lag time in a game where players’ reaction time is key to their success, performance will suffer. What’s more, the gaming industry has been booming of late, growing by a compound annual growth rate (CAGR) of 13.4 % and increasing the scrutiny of its key operational metrics.
Apache powers the lightning-fast communication and interaction between players that makes popular, hyper-real gaming ecosystems so popular. New games rely on Apache’s real-time streaming abilities as well as its real-time analytics and data-storage functions. Furthermore, Apache’s streaming pipeline helps players keep track of each other in real-time by ensuring that player movements are transmitted to other players instantly.
Benefits of Apache Kafka
Developers and engineers at some of the largest, most modern enterprises in the world use Apache to build many real-time business applications. Apache Kafka is behind apps that serve the financial industry, online shopping giants, music and video streaming platforms, video game innovators and more. Developing with Kafka has many advantages over other platforms, here are a few of its most popular benefits.
Speed
Kafka’s data processing system uses APIs in a unique way that help it to optimize data integration to many other database storage designs, such as the popular SQL and NoSQL architectures, used for big data analytics.
Scalability
Kafka was built to address high latency issues in batch-queue processing on some of the busiest websites in the world. It has what’s known as elastic, multi-cluster scalability, allowing workflows to be provisioned across multiple Kafka clusters, rather than just one, enabling greater scalability, high throughput and low latency.
Connectivity
Apache Connect, a data streaming tool, comes with 120 pre-built connectors that enable Apache to integrate with all the most popular backend data storage solutions, including AWS’ Amazon S3, MongoDB, Google BigQuery, ElasticSearch, Azure, DataDog and more. Developers using Apache can speed app development with support for whatever requirements their organization has.
Storage and tracking
Since some of the biggest and most demanding websites in the world use Apache, it needs to be able to log user activity quickly and accurately to avoid disruptions. Apache records frequent events like user registration, page views, purchases and other information related to website activity tracking in real-time. Then it groups the data by topic and stores it over a distributed network for fast, easy access.
Messaging
Apache receives and keeps messages in a queue—a container used for the storing and transmitting of messages. The container connects the messages to consumer apps and the user. Apache is designed in a similar way to other popular message brokers, like RabbitMQ; but unlike Rabbit and these other brokers, it divides its messages into Kafka topics using a message key which can be used to filter messages by relevancy.
Data processing
One of Apache’s most appealing attributes is its ability to capture and store event data in real-time. Other popular real-time data pipelines must run in what’s called a scheduled batch—a batch of data that can only be processed at a pre-scheduled time. Apache’s design allows for data to be processed in real-time, enabling technologies like IoT, analytics and others that depend on real-time data processing to function.
Learn more
Apache Kafka was built to store data and broadcast events in real-time, delivering dynamic user experiences across a diverse set of applications. IBM Event Streams helps businesses optimize Kafka with an open-source platform that can be deployed as either a fully managed service on IBM Cloud or on-premises as part of Event Automation.
Explore IBM Event Streams