Technonews: Data Serialization Comparison: JSON, YAML, BSON, MessagePack

JSON is the de facto standard for data exchange on the web, but it has its drawbacks, and there are other formats that may be more suitable for certain scenarios. I’ll compare the pros and cons of the alternatives, including ease of use and performance

Note: I won’t cover implementation details here, but if you’re a Ruby programmer, check out this article, where Dhaivat writes about implementing some serialization formats in Ruby.

What Is Data Serialization

According to Wikipedia, serialization is:

the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

Let’s say you want to collect certain data about a group of people — name, last name, nickname, date of birth, instruments they play. You could easily set a spreadsheet, define some columns, and make every row an entry. You could go just a little further, define that the date of birth column must be a number, and that the instruments columns could be a list of options. It’d look like this:

name	last name	dob	nickname	instruments
William	Bailey	1962	Axl Rose	vocals, piano
Saul	Hudson	1965	Slash	guitar

More or less, what you did there was to define a data structure; and you’ll do just fine if you only need this on a spreadsheet format. The problem is that, if you ever want to exchange this information with a database or a website, the mechanics by which these data structures are implemented on these other platforms — even if the underlying semantics are overall the same — will be dramatically different. You can’t just plug-n-play a spreadsheet into a web application, unless the application has been specifically designed for it. And you can’t transfer that info from the website to the database unless you have some sort of export tool or gateway for it.

Let’s assume that our website already has these data structures implemented in its internal logic, and that it just cannot deal with a spreadsheet format. In order to solve these problems, you can translate these data structures into a format that can be easily shared across different applications, architectures, or what have you: you serialize them. And by doing so, you ensure not only that you can transfer this data across platforms, but that they can be reconstructed in the reverse process called deserialization. Furthermore, if exchanged back from the website to the spreadsheet, you’ll get a semantically identical clone of the original object — that is, a row that looks exactly the same as the one you originally sent.

In short: serializing data is finding some sort of universal format that can be easily shared across different applications.

The Formats

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It’s easy for humans to read and write; it’s easy for machines to parse and generate.

JSON is the most widespread format for data serialization, and it has the following features:

(Mostly) human readable code: even if the code has been obscured or minified, you can always indent it with tools such as JSONLint and make it readable again.

Very simple and straightforward specification: a summary of the whole spec fits on a single page (as displayed on the JSON site).

Widespread support: not only does every programming language or IDE come with JSON support, but also many web services APIs offer JSON as a means of data interchange.

As a subset of JavaScript, it supports the following JavaScript data types:
- string
- number
- object
- array
- true and false
- null

This is how our previous spreadsheet would look, after being serialized in JSON:

[
 
 "name": "William",
 "last name": "Bailey",
 "dob": 1962,
 "nickname": "Axl Rose",
 "instruments": [
 "vocals",
 "piano"
 ]
 ,
 
 "name": "Saul",
 "last name": "Hudson",
 "dob": 1965,
 "nickname": "Slash",
 "instruments": [
 "guitar"
 ]
 
]

BSON

BSON, short for Binary JSON, is a binary-encoded serialization of JSON-like documents. … It also contains extensions that allow representation of data types that are not part of the JSON spec.

JSON is a plain text format, and while binary data can be encoded in text, this has certain limitations and can make JSON files very big. BSON comes in to deal with these problems.

It has the following features:

convenient storage of binary information: better suitable for exchanging images and attachments

designed for fast in-memory manipulation

simple specification: like JSON, BSON has a very short and simple spec

primary data representation for MongoDB: BSON is designed to be traversed easily

extra data types:
- double (64-bit IEEE 754 floating point number)
- date (integer number of milliseconds since the Unix epoch)
- byte array (binary data)
- BSON object and BSON array
- JavaScript code
- MD5 binary data
- regular expressions

MessagePack

MessagePack It’s like JSON. But fast and small.

MessagePack (also msgpack) is another binary format for serialization. Not as well known as BSON, but it’s worth having a look at.

Among its features:

designed for efficient transmission over the wire

better JSON-compatibility than BSON: as explained by Sadayuki Furuhashi in this Stack Overflow post

smaller than BSON: is has a smaller overhead than BSON, and can serialize smaller objects most of the time

type checking: it supports static-typing

streaming API: support for streaming deserializers, which is useful for network communication.

YAML

YAML: YAML Ain’t Markup Language.
What It Is: YAML is a human friendly data serialization standard for all programming languages.

Back to plaintext formats, YAML is an alternative to JSON:

(truly) human readable code: YAML is so readable that even its front-page content is displayed in YAML to make this point

compact code: whitespace indentation is used to denote structure, no need for quotes nor brackets

syntax for relational data: to allow internal references with anchors ( &) and aliases (*)

especially suited for viewing/editing of data structures: such as configuration files, dumping during debugging, and document headers

a rich set of language independent types:
- collections:
  - unordered set of key (!!map)
  - ordered sequence of key (!!omap)
  - ordered sequence of key (!!pairs)
  - unordered set of non-equal values (!!set)
  - sequence of arbitrary values (!!seq)
- scalar types:
  - null values (~, null)
  - decimals (1234), hexadecimal (0x4D2) and octal (02333) integers
  - fixed (1_230.15) and exponential (12.3015e+02) floats
  - infinity (.inf, -.Inf) and not-a-number (.NAN)
  - true (Y, true, Yes, ON) and false (n, FALSE, No, off)
  - binary (!!binary) with base64 encoding
  - timestamps (!!timestamp).

This is how our little spreadsheet looks when serialized in YAML:

- name: William
 last name: Bailey
 dob: 1962
 nickname: Axl Rose
 instruments:
 - vocals
 - piano

- name: Saul
 last name: Hudson
 dob: 1965
 nickname: Slash
 instruments:
 - guitar

Other Formats

There are a number of other formats for serialization, such as Protocol Buffers (protobuf, also binary), that I’ve (in a rather discretionary manner) left out. If you just want to know every possible format, go and have a look at Wikipedia’s comparison of data serialization formats.

… HDF5?

We’ll get a bit off-topic here, but just slightly. The Hierarchical Data Format version 5 (HDF5) isn’t really for serialization, but rather for storage, and it’s taking data science and other industries by storm. It’s a very fast and versatile format that can be used not only to store a number of data structures, but even as a replacement for relational databases.

To conclude this intermission, let’s just mention that if you’re into binary formats such as BSON and MessagePack for storing/exchanging big volumes of information, you may very well want to have a look at HDF5.

Benchmarks and Comparisons

A pattern that emerges is that BSON can be more expensive than JSON when serializing, but faster when deserializing; and MessagePack is faster than both on any operation. Also, because of its overhead and in spite of being a binary format, BSON files can occasionally be bigger than JSON ones when storing non-binary data. Some links to have a look at:

Serialization Performance comparison (C#/.NET) by Maxim Novak on M@X on DEV.

Protocol Buffers, Avro, Thrift & MessagePack by Ilya Grigorik on ivita.com.

Binary Serialization Tour Guide by Karlin Fox in Atomic Object.

Efficiently Store Pandas DataFrames by Matthew Rocklin.

MessagePack vs JSON vs BSON by Wesley Tanaka.

It’s also worth noting that the performance could change depending on the serializer and the parser you choose, even for the same format.

Remarks and Commentary

As silly as it may sound, BSON has the advantage of the name: people automatically link the format developed by MongoDB (BSON) to the standard (JSON), which are not associated one to another. So when searching for a binary alternative for JSON, you may also consider other options.

In fact, MessagePack seems to beat BSON in every possible aspect: it’s faster, smaller, and it’s even more compatible to JSON that BSON is. (In fact, if you’re already working with JSON, MessagePack is almost a drop-in optimization.) Maybe as a “reporter” I should be more balanced, but as a developer, this is a no brainier.

Still, BSON is MongoDB’s format to store and represent data, so if you’re working with this NoSQL DB, that’s a reason to stick with it.

Of course, serialization is not all about storing binary data. Admittedly, JSON has a different goal in mind — that of being “human readable”. But it doesn’t take much effort to notice that YAML does a significantly better job at it.

However, the YAML spec is awfully big, specially when compared to that of JSON’s. But arguably, it must be, as it comes with more data types and features.

On the other hand, in can’t be ignored that the simplicity of JSON played a key role in its adoption over other serialization formats. It relies on an already existent widespread language, JavaScript, and if you know or are exposed to JS (which if you are in the web development industry, you are), you already know JSON.

Then why not adopt YAML, like now? In many cases it isn’t that easy. JSON still has a place for web APIs, as you can easily embed JSON code in HTTP requests (both for GET, as in URLs, and POST, as in sending a form): the format will let you know if the transmission was suddenly cut, as the code will automatically render invalid, which may not be the case with YAML and other competing plaintext formats. Also, you’ll still need to interact at one point or another with JSON-based APIs and legacy code, and it’s always a pain maintaining two pieces of code (JSON and YAML methods) for the same purpose (data serialization).

But then again, these are partly the same arguments that push us backwards and prevent us from adopting newer and more efficient technologies (e.g: like Python 3 over Python 2). And I thought for a minute that we, programmers and entrepreneurs, were innovators, aren’t we?

http://www.epaperindia.in/2016/11/data-serialization-comparison-json-yaml-bson-messagepack/
#Bson, #Data_Exchange, #Data_Serialization, #Json, #MessagePack, #Yaml

Technonews

Wednesday, 9 November 2016

Data Serialization Comparison: JSON, YAML, BSON, MessagePack

What Is Data Serialization

The Formats

JSON

BSON

MessagePack

YAML

YAML: YAML Ain’t Markup Language.
What It Is: YAML is a human friendly data serialization standard for all programming languages.

Other Formats

… HDF5?

Benchmarks and Comparisons

Remarks and Commentary

No comments:

Post a Comment

Wednesday, 9 November 2016

Data Serialization Comparison: JSON, YAML, BSON, MessagePack

What Is Data Serialization

The Formats

JSON

BSON

MessagePack

YAML

YAML: YAML Ain’t Markup Language.What It Is: YAML is a human friendly data serialization standard for all programming languages.

Other Formats

… HDF5?

Benchmarks and Comparisons

Remarks and Commentary

No comments:

Post a Comment

YAML: YAML Ain’t Markup Language.
What It Is: YAML is a human friendly data serialization standard for all programming languages.