Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

How I Learned to Stop Worrying and Love the Schema

The rise in schema-free and document-oriented databases has led some to question the value and necessity of schemas. Schemas, in particular those following the relational model, can seem too restrictive, and the case has been made that software development can be faster and more agile without them. However, just because it’s possible to go without schemas doesn’t mean it’s wise to do so – this sort of local optimization can cause huge headaches within even a small organization.

The Hazards of Many Languages

Imagine for a moment that you work at a company where all employees are required to speak only in their native language. Intercommunication can work, but either everyone has to be multilingual, or expensive translators must be added for every pair of languages spoken in the company. Even if you have a sophisticated and efficient way of getting messages from place to place, you’re still stuck with the overhead of constant translation.

Furthermore, even if your company phases out, say Latin, unless you are willing to discard all Latin records, you’re either stuck employing your Latin translators for the rest of eternity, or with the work of converting all Latin records into a new language.

Compare this to a company which standardized on a single language from the start. Every single form of communication is easier, and every message can be consumed many times at zero extra cost. Although there is an up-front cost in the sense that new employees must already know the language or be trained in it, the payoff is huge and permanent.

Having no standardized way of defining data across an organization presents a similar problem. It may be fine in the short term, but it quickly causes unnecessary difficulties, and it just doesn’t scale when you imagine multiplying by potentially thousands of different categories of data.

Temp-o-meter – A Tale of Woe

Let’s illustrate with a simple example. Say that your company has built a smartphone app called temp-o-meter which collects temperature data and sends it back to HQ. Version 0.1 was produced in a hurry, and produces simple comma separated data points with the format

“device_id, temp_celsius, timestamp, latitude, longitude”

A typical data point might look like this:


Pretty reasonable, but time passes, and the team decides JSON is easier to work with, and temp-o-meter v0.2 produces data like this:

    “device_id”: 123,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851

The problem is, some stage(s) in the downstream pipeline must now have logic to differentiate between CSV and JSON, and this logic must be aware that in the CSV format, temperature readings are in Celsius, but temperature stored in JSON is in Fahrenheit. What’s more, there may be some users who never upgrade their app, so the different versions of this data will continue to be published indefinitely.

Granted, this example is a bit contrived – clearly, for a given type of data, it’s not great to represent it with a mix of formats such as CSV, JSON, or XML, etc. However, just standardizing on a format such as JSON without schemas is not enough. Standardizing on JSON without using schemas is a little like standardizing on the Roman alphabet without standardizing on a language – everyone can easily read and write individual letters, but that still doesn’t guarantee they can read the messages!

Let’s go a little further with the temp-o-meter example and pretend that we live in a science-fiction world where not only phones, but even things like watches can produce streams of data. temp-o-meter needs to be ported, but lucky for the watch team, JSON is now the standard, and the format of temperature data was loosely documented on an obscure wiki page.

Here’s what the temp-o-meter watch team came up with:

    “device_id”: “watch_345”,
    “timestamp”: “Tue 05-17-2015 6:00”,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851

Not so bad on its own. However, although the format is now JSON, and although the field names are identical, the watch team used slightly different data formats in a few of the fields. “device_id” is no longer parseable as an integer, and “timestamp” is a completely different format.

Why is this a problem? Suppose there is an application which consumes data produced by the temp-o-meter v0.2 – it might have a chunk of code like this:

# Consumer parses a chunk of data from temp-o-meter v0.2
data = json.loads(data_chunk)

device_id = int(data[‘device_id’])
timestamp = float(data[‘timestamp’])
temperature = float(data[‘temperature’])
latitude = float(data[‘latitude’])
longitude = float(data[‘longitude’])

This consumer must be upgraded before the watch team releases, otherwise it will be completely broken when it encounters the (unintentionally) new data format. Despite the fact that the watch team and phone teams now both use JSON, different components of the system are now tightly coupled because the data’s ‘schema’ is embedded in both producers and consumers of this data. The ability to safely and independently evolve different components of the system has been hamstrung.

Had this company been using schemas, the watch and phone teams could have simply reused the same schema, avoiding the need for the watch team to reinvent the wheel, and preventing subtle incompatibilities which ultimately break a bunch of downstream consumers. By sharing the schema between watch and phone apps, this unintentional data evolution would have easily been avoided.

DRY Your Data Definitions With Schemas

At this stage in our little story, the ‘definition’ of temp-o-meter’s temperature data is decidedly un-DRY: it is encoded informally in temp-o-meter v0.1, temp-o-meter v0.2, some wiki pages, the watch app, and in all the various consumers which later parse and analyze this data.

On the other hand, by using schemas, the data definition for a particular kind of data exists in a single place. What’s more, schemas serve as self-contained and automatically enforceable contracts between writers and readers of data. Though they don’t remove the need for testing, schemas make testing data compatibility significantly simpler and can nip an entire class of problems in the bud by preventing corrupt, malformed or incompatible data from ever being published in the first place.

For additional compelling reasons to use schemas, it’s worth revisiting this post on Stream Data Platforms. In the next post on schemas, I’ll talk more about how schemas can provide a powerful tool to help evolve data formats in a sane and compatible way.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Project Metamorphosis Month 2: Cost-Effective Apache Kafka for Use Cases Big and Small

In April, we kicked off Project Metamorphosis. Project Metamorphosis is an effort to bring the simplicity of best of breed cloud systems to the world of event streaming. It is […]

Scaling Apache Kafka to 10+ GB Per Second in Confluent Cloud

Apache Kafka® is the de facto standard for event streaming today. The semantics of the partitioned consumer model that Kafka pioneered have enabled scale at a level and at a […]

Stream Processing with IoT Data: Challenges, Best Practices, and Techniques

The rise of IoT devices means that we have to collect, process, and analyze orders of magnitude more data than ever before. As sensors and devices become ever more ubiquitous, […]

Sign Up Now

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills


上の「新規登録」をクリックすることにより、当社がお客様の個人情報を以下に従い処理することを理解されたものとみなします : プライバシーポリシー

上記の「新規登録」をクリックすることにより、お客様は以下に同意するものとします。 サービス利用規約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、当社がお客様の個人情報を以下に従い処理することを理解されたものとみなします: プライバシーポリシー

単一の Kafka Broker の場合には永遠に無料

商用版の機能を単一の Kafka Broker で無期限で使用できるソフトウェアです。2番目の Broker を追加すると、30日間の商用版試用期間が自動で開始します。この制限を単一の Broker へ戻すことでリセットすることはできません。

  • tar
  • zip
  • deb
  • rpm
  • docker
  • kubernetes
  • ansible

上の「無料ダウンロード」をクリックすることにより、当社がお客様の個人情報をプライバシーポリシーに従い処理することを理解されたものとみなします。 プライバシーポリシー

以下の「ダウンロード」をクリックすることにより、お客様は以下に同意するものとします。 Confluent ライセンス契約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、お客様の個人データが以下に従い処理することにも同意するものとします: プライバシーポリシー

このウェブサイトでは、ユーザーエクスペリエンスの向上に加え、ウェブサイトのパフォーマンスとトラフィック分析のため、Cookie を使用しています。また、サイトの使用に関する情報をソーシャルメディア、広告、分析のパートナーと共有しています。