Wednesday, 9 November 2016

Data Serialization Comparison: JSON, YAML, BSON, MessagePack

JSON is the de facto standard for data exchange on the web, but it has its drawbacks, and there are other formats that may be more suitable for certain scenarios. I’ll compare the pros and cons of the alternatives, including ease of use and performance


Note: I won’t cover implementation details here, but if you’re a Ruby programmer, check out this article, where Dhaivat writes about implementing some serialization formats in Ruby.



What Is Data Serialization


According to Wikipedia, serialization is:


the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.


Let’s say you want to collect certain data about a group of people — name, last name, nickname, date of birth, instruments they play. You could easily set a spreadsheet, define some columns, and make every row an entry. You could go just a little further, define that the date of birth column must be a number, and that the instruments columns could be a list of options. It’d look like this:


















namelast namedobnicknameinstruments
WilliamBailey1962Axl Rosevocals, piano
SaulHudson1965Slashguitar

More or less, what you did there was to define a data structure; and you’ll do just fine if you only need this on a spreadsheet format. The problem is that, if you ever want to exchange this information with a database or a website, the mechanics by which these data structures are implemented on these other platforms — even if the underlying semantics are overall the same — will be dramatically different. You can’t just plug-n-play a spreadsheet into a web application, unless the application has been specifically designed for it. And you can’t transfer that info from the website to the database unless you have some sort of export tool or gateway for it.


Let’s assume that our website already has these data structures implemented in its internal logic, and that it just cannot deal with a spreadsheet format. In order to solve these problems, you can translate these data structures into a format that can be easily shared across different applications, architectures, or what have you: you serialize them. And by doing so, you ensure not only that you can transfer this data across platforms, but that they can be reconstructed in the reverse process called deserialization. Furthermore, if exchanged back from the website to the spreadsheet, you’ll get a semantically identical clone of the original object — that is, a row that looks exactly the same as the one you originally sent.


In short: serializing data is finding some sort of universal format that can be easily shared across different applications.



The Formats


JSON


JSON logoJSON (JavaScript Object Notation) is a lightweight data-interchange format. It’s easy for humans to read and write; it’s easy for machines to parse and generate.


JSON is the most widespread format for data serialization, and it has the following features:



  • (Mostly) human readable code: even if the code has been obscured or minified, you can always indent it with tools such as JSONLint and make it readable again.

  • Very simple and straightforward specification: a summary of the whole spec fits on a single page (as displayed on the JSON site).

  • Widespread support: not only does every programming language or IDE come with JSON support, but also many web services APIs offer JSON as a means of data interchange.

  • As a subset of JavaScript, it supports the following JavaScript data types:
    • string

    • number

    • object

    • array

    • true and false

    • null


This is how our previous spreadsheet would look, after being serialized in JSON:



[

"name": "William",
"last name": "Bailey",
"dob": 1962,
"nickname": "Axl Rose",
"instruments": [
"vocals",
"piano"
]
,

"name": "Saul",
"last name": "Hudson",
"dob": 1965,
"nickname": "Slash",
"instruments": [
"guitar"
]

]

BSON


BSON logoBSON, short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. … It also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec.


JSON is a plain text format, and while binary data can be encoded in text, this has certain limitations and can make JSON files very big. BSON comes in to deal with these problems.


It has the following features:



  • convenient storage of binary information: better suitable for exchanging images and attachments

  • designed for fast in-memory manipulation

  • simple specification: like JSON, BSON has a very short and simple spec

  • primary data rep­res­ent­a­tion for Mon­goDB: BSON is de­signed to be tra­versed eas­ily

  • extra data types:
    • double (64-bit IEEE 754 floating point number)

    • date (integer number of milliseconds since the Unix epoch)

    • byte array (binary data)

    • BSON object and BSON array

    • JavaScript code

    • MD5 binary data

    • regular expressions


MessagePack


MessagePackIt’s like JSON. But fast and small.


MessagePack (also msgpack) is another binary format for serialization. Not as well known as BSON, but it’s worth having a look at.


Among its features:



  • designed for efficient transmission over the wire

  • better JSON-compatibility than BSON: as explained by Sadayuki Furuhashi in this Stack Overflow post

  • smaller than BSON: is has a smaller overhead than BSON, and can serialize smaller objects most of the time

  • type checking: it supports static-typing

  • streaming API: support for streaming deserializers, which is useful for network communication.

YAML


YAML: YAML Ain’t Markup Language.
What It Is: YAML is a human friendly data serialization standard for all programming languages.


Back to plaintext formats, YAML is an alternative to JSON:



  • (truly) human readable code: YAML is so readable that even its front-page content is displayed in YAML to make this point

  • compact code: whitespace indentation is used to denote structure, no need for quotes nor brackets

  • syntax for relational data: to allow internal references with anchors ( &) and aliases (*)

  • especially suited for viewing/editing of data structures: such as configuration files, dumping during debugging, and document headers

  • a rich set of language independent types:
    • collections:
      • unordered set of key (!!map)

      • ordered sequence of key (!!omap)

      • ordered sequence of key (!!pairs)

      • unordered set of non-equal values (!!set)

      • sequence of arbitrary values (!!seq)


    • scalar types:
      • null values (~, null)

      • decimals (1234), hexadecimal (0x4D2) and octal (02333) integers

      • fixed (1_230.15) and exponential (12.3015e+02) floats

      • infinity (.inf, -.Inf) and not-a-number (.NAN)

      • true (Y, true, Yes, ON) and false (n, FALSE, No, off)

      • binary (!!binary) with base64 encoding

      • timestamps (!!timestamp).



This is how our little spreadsheet looks when serialized in YAML:



- name: William
last name: Bailey
dob: 1962
nickname: Axl Rose
instruments:
- vocals
- piano

- name: Saul
last name: Hudson
dob: 1965
nickname: Slash
instruments:
- guitar

Other Formats


There are a number of other formats for serialization, such as Protocol Buffers (protobuf, also binary), that I’ve (in a rather discretionary manner) left out. If you just want to know every possible format, go and have a look at Wikipedia’s comparison of data serialization formats.



… HDF5?


HDF5 logo


We’ll get a bit off-topic here, but just slightly. The Hierarchical Data Format version 5 (HDF5) isn’t really for serialization, but rather for storage, and it’s taking data science and other industries by storm. It’s a very fast and versatile format that can be used not only to store a number of data structures, but even as a replacement for relational databases.


To conclude this intermission, let’s just mention that if you’re into binary formats such as BSON and MessagePack for storing/exchanging big volumes of information, you may very well want to have a look at HDF5.



Benchmarks and Comparisons


A pattern that emerges is that BSON can be more expensive than JSON when serializing, but faster when deserializing; and MessagePack is faster than both on any operation. Also, because of its overhead and in spite of being a binary format, BSON files can occasionally be bigger than JSON ones when storing non-binary data. Some links to have a look at:



  • Serialization Performance comparison (C#/.NET) by Maxim Novak on M@X on DEV.

  • Protocol Buffers, Avro, Thrift & MessagePack by Ilya Grigorik on ivita.com.

  • Binary Serialization Tour Guide by Karlin Fox in Atomic Object.

  • Efficiently Store Pandas DataFrames by Matthew Rocklin.

  • MessagePack vs JSON vs BSON by Wesley Tanaka.

It’s also worth noting that the performance could change depending on the serializer and the parser you choose, even for the same format.



Remarks and Commentary


As silly as it may sound, BSON has the advantage of the name: people automatically link the format developed by MongoDB (BSON) to the standard (JSON), which are not associated one to another. So when searching for a binary alternative for JSON, you may also consider other options.


In fact, MessagePack seems to beat BSON in every possible aspect: it’s faster, smaller, and it’s even more compatible to JSON that BSON is. (In fact, if you’re already working with JSON, MessagePack is almost a drop-in optimization.) Maybe as a “reporter” I should be more balanced, but as a developer, this is a no brainier.


Still, BSON is MongoDB’s format to store and represent data, so if you’re working with this NoSQL DB, that’s a reason to stick with it.


Of course, serialization is not all about storing binary data. Admittedly, JSON has a different goal in mind — that of being “human readable”. But it doesn’t take much effort to notice that YAML does a significantly better job at it.


However, the YAML spec is awfully big, specially when compared to that of JSON’s. But arguably, it must be, as it comes with more data types and features.


On the other hand, in can’t be ignored that the simplicity of JSON played a key role in its adoption over other serialization formats. It relies on an already existent widespread language, JavaScript, and if you know or are exposed to JS (which if you are in the web development industry, you are), you already know JSON.


Then why not adopt YAML, like now? In many cases it isn’t that easy. JSON still has a place for web APIs, as you can easily embed JSON code in HTTP requests (both for GET, as in URLs, and POST, as in sending a form): the format will let you know if the transmission was suddenly cut, as the code will automatically render invalid, which may not be the case with YAML and other competing plaintext formats. Also, you’ll still need to interact at one point or another with JSON-based APIs and legacy code, and it’s always a pain maintaining two pieces of code (JSON and YAML methods) for the same purpose (data serialization).


But then again, these are partly the same arguments that push us backwards and prevent us from adopting newer and more efficient technologies (e.g: like Python 3 over Python 2). And I thought for a minute that we, programmers and entrepreneurs, were innovators, aren’t we?


http://www.epaperindia.in/2016/11/data-serialization-comparison-json-yaml-bson-messagepack/
#Bson, #Data_Exchange, #Data_Serialization, #Json, #MessagePack, #Yaml

No comments:

Post a Comment