Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

Building a Streaming Analytics Stack with Apache Kafka and Druid

This is a guest blog from Fangjin Yang. Fangjin is the co-founder and CEO of Imply, a San Francisco based technology company, and one of the main committers of the Druid open source project. Fangjin previously held senior engineering positions at Metamarkets and Cisco. He holds a BASc in Electrical Engineering and a MASc in Computer Engineering from the University of Waterloo, Canada.

One popular trend in the data world recently is the rise of stream analytics. Organizations are increasingly striving to build solutions that can provide immediate access to key business intelligence insights through real-time data exploration. Architecting a data stack to transmit, store, and analyze streams at scale can be a difficult engineering feat without the proper tools. Luckily, existing open source solutions can be combined to form a flexible and scalable streaming analytics stack. In this blog post, we will use two popular open source projects, Apache Kafka and Druid, to build an analytics stack that enables immediate exploration and visualization of event data. Together they can act as a streaming analytics manager (SAM) that can make a real difference.

Apache Kafka
Apache Kafka is a publish-subscribe message bus that is designed for the delivery of streams. The architecture of Kafka is modeled as a distributed commit log, and Kafka provides resource isolation between things that produce data and things that consume data. Kafka is often used as a central repository of streams, where events are stored in Kafka for an intermediate period of time before they are routed elsewhere in a data cluster for further processing and analysis.

Druid is a streaming analytics data store that is ideal for powering user-facing data applications. Druid is often used to explore events immediately after they occur and to combine real-time results with historical events. Druid can ingest data at a rate of millions of events per second and is often paired with a message bus such as Kafka for high availability and flexibility.

Apache Kafka and Druid, BFFs
In our described stack, Kafka provides high throughput event delivery, and Druid consumes streaming data from Kafka to enable analytical queries. Events are first loaded in Kafka, where they are buffered in Kafka brokers before they are consumed by Druid real-time workers. By buffering events in Kafka, Druid can replay events if the ingestion pipeline ever fails in some way, and these events in Kafka can also be delivered to other systems beyond just Druid. When used together, they can help build streaming analytics apps.

In our tutorial, we are going to set up both Kafka and Druid, load some data, and visualize the data.

Getting started with Apache Kafka and Druid

You will need:
* Java 7 or better
* Node.js 4.x (to visualize the data)

* Linux, Mac OS X, or other Unix-like OS (Windows is not supported)

On Mac OS X, you can use Oracle’s JDK 8 to install Java and Homebrew to install Node.js.

On Linux, your OS package manager should be able to help for both Java and Node.js. If your Ubuntu-based OS does not have a recent enough version of Java, WebUpd8 offers  
packages for those OSes. If your Debian, Ubuntu, or Enterprise Linux OS does not have a recent enough version of Node.js, NodeSource offers packages for those OSes.

We will load the Wikipedia edits data stream for our tutorial. We will be using Imply’s distribution of Druid 0.9.0 and Confluent’s distribution of Kafka 0.10.0.

We’ll also need to download a small program that pulls events from Wikipedia and loads them into Kafka.


Starting Druid

First, in your favorite terminal, download and unpack the Druid distribution.

curl -O
tar -xzf imply-1.2.1.tar.gz

At this time, let’s also download our helper program that will load edits from Wikipedia directly into Kafka.

curl -O
tar -xzf kafka-wikiticker.tar.gz

Next, you’ll need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything:

cd imply-1.2.1
bin/supervise -c ../kafka-wikiticker/conf/quickstart.conf

Starting Kafka

In a separate terminal, download and unpack the release archive.

curl -O
tar -xzf confluent-3.0.0-2.11.tar.gz
cd confluent-3.0.0

Start a Kafka broker by running the following command in the new terminal:

./bin/kafka-server-start ./etc/kafka/

That’s it! Your Wikipedia data should now be in Kafka, and this data should be flowing from Kafka to Druid. Let’s visualize this data now.


Visualizing your data

You can immediately begin visualizing data with our stack using Pivot at http://localhost:9095/pivot. Pivot is an open source data visualization application centered around two primary operations: filter and split. Filter is equivalent to WHERE in SQL, and split is equivalent to GROUPBY. You can drag and drop dimensions into Pivot and examine your data through a variety of different visualizations. Some examples of using Pivot are shown below:

Drag-and-drop UIDrag-and-drop UI

Contextual explorationContextual exploration


Please note that if you split on time, you may only see a single data point as only very recent events have been loaded.

Further reading

Kafka and Druid can be used to build powerful streaming analytic apps. If you want to learn more about how to load your own datasets into Kafka, there is plenty of information in the Confluent docs. For more information about loading your own data into Druid and about how to set up a highly available, scalable Druid cluster, check out Imply’s documentation.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Announcing the Snowflake Sink Connector for Apache Kafka in Confluent Cloud

We are excited to announce the preview release of the fully managed Snowflake sink connector in Confluent Cloud, our fully managed event streaming service based on Apache Kafka®. Our managed […]

How Merging Companies Will Give Rise to Unified Data Streams

Company mergers are becoming more common as businesses strive to improve performance and grow market share by saving costs and eliminating competition through acquisitions. But how do business mergers relate […]

Build Real-Time Observability Pipelines with Confluent Cloud and AppDynamics

Many organisations rely on commercial or open source monitoring tools to measure the performance and stability of business-critical applications. AppDynamics, Datadog, and Prometheus are widely used commercial and open source […]

Sign Up Now

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills


上の「新規登録」をクリックすることにより、当社がお客様の個人情報を以下に従い処理することを理解されたものとみなします : プライバシーポリシー

上記の「新規登録」をクリックすることにより、お客様は以下に同意するものとします。 サービス利用規約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、当社がお客様の個人情報を以下に従い処理することを理解されたものとみなします: プライバシーポリシー

単一の Kafka Broker の場合には永遠に無料

商用版の機能を単一の Kafka Broker で無期限で使用できるソフトウェアです。2番目の Broker を追加すると、30日間の商用版試用期間が自動で開始します。この制限を単一の Broker へ戻すことでリセットすることはできません。

  • tar
  • zip
  • deb
  • rpm
  • docker
  • kubernetes
  • ansible

上の「無料ダウンロード」をクリックすることにより、当社がお客様の個人情報をプライバシーポリシーに従い処理することを理解されたものとみなします。 プライバシーポリシー

以下の「ダウンロード」をクリックすることにより、お客様は以下に同意するものとします。 Confluent ライセンス契約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、お客様の個人データが以下に従い処理することにも同意するものとします: プライバシーポリシー

このウェブサイトでは、ユーザーエクスペリエンスの向上に加え、ウェブサイトのパフォーマンスとトラフィック分析のため、Cookie を使用しています。また、サイトの使用に関する情報をソーシャルメディア、広告、分析のパートナーと共有しています。