Kafka 3.4 Documentation

Previous versions: 0.7.x, 0.8.0, 0.8.1.X, 0.8.2.X, 0.9.0.X, 0.10.0.X, 0.10.1.X, 0.10.2.X, 0.11.0.X, 1.0.X, 1.1.X, 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X, 2.5.X, 2.6.X, 2.7.X, 2.8.X, 3.0.X. 3.1.X. 3.2.X. 3.3.X.

1. Getting Started

1.1

Introduction

1.2 Use

cases

Here is a description of some of the popular use cases for Apache Kafka®. For an overview of several of these areas in action, check out this blog post.

Kafka messaging

works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, and so on). Compared to most messaging systems, Kafka has better performance, built-in partitioning, replication, and fault tolerance, making it a good solution for large-scale message processing applications.

In our experience, messaging uses are typically comparatively low-performance, but they can require low end-to-end latency and often rely on the strong durability guarantees Kafka provides.

In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

Tracking Website Activity

The original use case for Kafka was to be able to reconstruct a user activity tracking pipeline as a set of publish-subscribe sources in real time. This means that site activity (page views, searches, or other actions users can take) is published in core topics with one topic per activity type. These sources are available for subscription for a variety of use cases, including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehouse systems for offline processing and reporting.

Activity tracking is typically very high, as many activity messages are generated for each user page view.

Metric

Kafka is often used for operational monitoring data. This involves aggregating distributed application statistics to produce centralized sources of operational data.

Record aggregation

Many people use Kafka as a replacement for a record aggregation solution. Log aggregation typically collects physical log files from servers and places them in a central location (a file server or HDFS perhaps) for processing. Kafka abstracts details from files and provides cleaner abstraction of log data or events as a message flow. This allows for lower latency processing and easier support for multiple data sources and distributed data consumption. Compared to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

Stream processing

Many Kafka users process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then added, enriched, or transformed into new topics for later consumption or follow-up processing. For example, a processing pipeline for recommending news articles could crawl article content from RSS feeds and publish it in an “articles” topic; Post-processing can normalize or deduplicate this content and publish the clean article content to a new topic; A final processing stage might attempt to recommend this content to users. These processing pipelines create graphs of real-time data flows based on individual topics. Starting at 0.10.0.0, a lightweight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. In addition to Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

Event Sourcing

The event source is an application design style in which state changes are recorded as a sequence of records ordered over time. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

Confirmation Log

Kafka can serve as a kind of external confirmation record for a distributed system. Logging helps replicate data between nodes and acts as a resynchronization mechanism for failed nodes to restore their data. The record compaction feature in Kafka helps support this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

1.3 Quick

Start

Step 2: Start the

Kafka environment

NOTE: Your local environment must have Java 8+ installed

Apache Kafka can be started using ZooKeeper or KRaft. To get started with either configuration, follow one of the sections below, but not both.

Kafka with

ZooKeeper

Run the following commands to start

all services in the correct order: # Start the ZooKeeper service $ bin/zookeeper-server-start.sh config/zookeeper.properties

Open another terminal session and run:

# Start the Kafka broker service $ bin/kafka-server-start.sh config/server.properties

Once all services have been successfully launched, you will have a basic Kafka environment running and ready to use. Kafka with

KRaft

Generate a cluster UUID

$ KAFKA_CLUSTER_ID=”$(bin/kafka-storage.sh random-uuid)”

Format Log directories

$ bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties Start the Kafka server $ bin/kafka-server-start.sh config/kraft/server.properties Once the Kafka

server has started successfully, you will have a basic Kafka environment up and running and ready to use.

Step 3: Create a theme

to store your events

Kafka is a distributed event streaming platform that allows you to read, write, store, and process events (also called records or messages in documentation) on many machines.

Examples of events are payment transactions, geolocation updates from mobile phones, shipping orders, sensor measurements of IoT devices or medical equipment, and much more. These events are organized and stored in themes. Very simplified, a theme is similar to a folder in a file system, and the events are the files in that folder.

So before you can write your first events, you need to create a theme. Open another terminal session and

run : $ bin/kafka-topics.sh -create -topic quickstart-events -bootstrap-server localhost:9092

All Kafka command-line tools have additional options: run the kafka-topics.sh command without any arguments to display usage information. For example, it can also show you details such as the partition count of the new

topic: $ bin/kafka-topics.sh -describe -topic quickstart-events -bootstrap-server localhost:9092 Topic: quickstart-events Topic ID: NPmZHyhbR9y00wMglMH2sg PartitionCount: 1 ReplicationFactor: 1 Configs: Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 ISR: 0

Step 4: Write some events in the topic

A Kafka client communicates with Kafka runners over the network to write (or read) events. Once received, runners will store events in a durable, fault-tolerant manner for as long as you need, even forever.

Run the console producer client to write some events to the topic. By default, each line you enter will result in a separate event that will be written to the topic.

$ bin/kafka-console-producer.sh -topic quickstart-events -bootstrap-server localhost:9092 This is my first event This is my second event

You can stop the producer client with Ctrl-C at any time.

Step 5: Read

the events

Open another terminal session and run the console consumer client to read the events you just created:

$ bin/kafka-console-consumer.sh -topic quickstart-events -from-beginning -bootstrap-server localhost:9092 This is my first event This is my second event

You can stop the consumer client with Ctrl-C at any time

Feel free to experiment: for example, go back to your producer terminal (previous step) to write additional events and see how the events immediately appear in your consumer terminal.

Because events are stored durably in Kafka, they can be read as many times and by as many consumers as you want. You can easily check this by opening another terminal session and running the previous command again.

Step 6: Import/export your data as event flows with Kafka Connect

You probably have a lot of data in existing systems, such as relational databases or traditional messaging systems, along with many applications that already use these systems. Kafka Connect allows you to continuously ingest data from systems external to Kafka, and vice versa. It is an extensible tool that runs connectors, which implement custom logic to interact with an external system. It is therefore very easy to integrate existing systems with Kafka. To make this process even easier, there are hundreds of these connectors available.

this quickstart, we’ll see how to run Kafka Connect with simple connectors that import

data from a file into a Kafka theme and export data from a Kafka theme to a file.

First, make sure to add connect-file-{{fullDotVersion}}.jar to the plugin.path property in the Connect worker settings. For the purpose of this quickstart, we’ll use a relative path and consider the connectors package as a super jar, which works when quick start commands are run from the installation directory. However, it is worth noting that for production deployments it is always preferable to use absolute paths. See plugin.path for a detailed description of how to configure these settings.

Edit the config/connect-standalone.properties file, add or change the plugin.path configuration property to match the following, and save the file: > echo “plugin.path=libs/connect-file

-{{fullDotVersion}}.jar”

Then, start by creating some seed data to test with

: > echo -e “foonbar” > test.txt Or on Windows: > echo foo> test.txt > echo bar>> test.txt

Next, we’ll start two connectors that run in standalone mode, meaning they run in a single, local, dedicated process. We provide three configuration files as parameters. The first is always the configuration for the Kafka Connect process, which contains a common configuration, such as the Kafka agents to connect to and the data serialization format. The remaining configuration files each specify a connector to create. These files include a unique connector name, the connector class to instantiate, and any other settings required by the connector.

> bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties These sample configuration files, included with Kafka

, use the default local cluster configuration that you started earlier and create two connectors: the first is a source connector that reads lines from an input file and produces each to a Kafka topic and the second is a receiver connector that reads messages from a Kafka theme and produces each as a in an output file.

During startup, you will see a number of log messages, including some indicating that the connectors are being instantiated. After the Kafka Connect process starts, the source connector should start reading test lines.txt and producing them in the connect-test topic, and the receiving connector should start reading messages from the connect-test topic and writing them to the test.sink.txt file. We can verify that the data has been delivered through the entire pipeline by examining the contents of the output file:

> more test.sink.txt foo bar Note that the data

is stored in Kafka’s connect-test theme, so we can also run a console consumer to view the theme data (or use custom consumer code to process it):

> bin/kafka-console-consumer.sh -bootstrap-server localhost:9092 -topic connect-test -from-beginning {“schema”:{“type”:”string”,”optional”:false},”payload”:”foo”} {“schema”:{“type”:”string”,”optional”:false},”payload”:”bar”} …

The connectors continue to process data, so we can add data to the file and watch it move through the pipeline

: > echo Another line>> test.txt

You should see the line appear in the console’s consumer output and in the receiving file.

Step 7: Process

your events with Kafka Streams

Once your data is stored in Kafka as events, you can process the data with the Kafka Streams client library for Java/Scala. It allows you to deploy mission-critical real-time applications and microservices, where input and/or output data is stored in Kafka topics. Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology to make these applications highly scalable, elastic, fault-tolerant, and distributed. The library supports exact one-time processing, stateful operations and aggregations, windows, joins, event-time-based processing, and much more.

To give you a first idea, this is how the popular WordCount algorithm would be implemented:

KStream<String, String> textLines = builder.stream(“quickstart-events”); KTable<String, Long> wordCounts = textLines .flatMapValues(line -> Arrays.asList(line.toLowerCase().split(” “))) .groupBy((keyIgnored, word) -> word) .count(); wordCounts.toStream().to(“output-topic”, Produced.with(Serdes.String(), Serdes.Long()));

The Kafka Streams demo and application development tutorial demonstrate how to code and run a streaming application from start to finish.

Step 8: Finish

the Kafka environment

Now that you’ve reached the end of the quick start, feel free to take down Kafka’s environment or keep playing

. Stop producer and consumer customers with Ctrl-C, if

you have not already done so. Stop

the Kafka corridor with Ctrl-C.

Finally, if the Kafka section was followed with ZooKeeper, stop the ZooKeeper server with Ctrl-C.

If you also want to delete any data from your on-premises Kafka environment, including events you’ve created along the way, run the command

: $ rm -rf /tmp/kafka-logs /tmp/zookeeper /tmp/kraft-combined-logs

Congratulations!

The Apache Kafka Quick Start has completed successfully.

For more information, we suggest the following steps:

Read the short introduction to learn how Kafka works at a high level, its main concepts, and how it compares to other technologies. To understand Kafka in more detail, go to the Documentation.
Browse through the use cases to learn how other users in our global community are getting value

from Kafka.

Join a local Kafka meeting group and watch talks at Kafka Summit, the main conference of the Kafka community.

1.4 Ecosystem

There are a lot of tools that integrate with Kafka outside of the main distribution. The ecosystem page lists many of these, including flow processing systems, Hadoop integration, monitoring, and deployment tools.

Blogs

Kafka 3.4 Documentation