Serialization formats

Serialization formats help to avoid messy custom grammars, and are used extensively in computing – whenever there is a need to store the data, to pass it from one program to another, to show it to or read it from a user.

Different formats come with their pros and cons, and often do affect the structures one creates with an intent to serialize them – since the formats often lack a specified way to support some constructs (most notably, sum types), and their supported primitive types vary (e.g., unicode strings and binary data are not always supported, number representations are a mess). There is a Wikipedia comparison of data serialization formats, but I'd rather compare a different set of features here – the ones that seemed important to me at one point or another. Here they are:

Since it's mostly about shades of grey, common formats are (rather subjectively) rated on the scale from 0 to 4:

  sum txt sch desc prim str simp
json 0 4 3 4 3 0 2
yaml 0 4 0 4 2 3 0
xml 0 3 4 4 3 3 1
dsv 2 3 2 0 4 4 3
posix 1 3 3 0 2 3 1
s-exp 3 4 0 4 0 0 2
n3 - 2 4 4 \=xml 4 3

And a bit of explanation:

JSON is still my go-to format, but the situation with streaming and sum types is annoying. S-expressions would be more usable if they were standardised, and DSV has its pros and cons comparing to those. Maybe an extended version of DSV—with additional delimiters for opening and closing parens—deserves to exist.



There are jq for JSON and miller for CSV, which are even nicer in some aspects, and those formats are still textual, more or less. But those are not particularly common, and are format-specific.