Serialization formats help to avoid messy custom grammars, and
are used extensively in computing – whenever there is a need to
store the data, to pass it from one program to another, to show
it to or read it from a user.
Different formats come with their pros and cons, and often do affect the
modules/structures one creates with an intent to serialize them – since
the formats often lack a specified way to support some constructs (most
notably, sum types), and their supported primitive types vary (e.g.,
unicode strings and binary data are not always supported, number
representations are a mess). There is a
Wikipedia comparison
of data serialization formats, but I'd rather compare a different set
of features here – the ones that seemed important to me at one point or
another. Here they are:
- Sum types
-
Without those, one ends up with tons of optional fields, which don't
describe the structures any precisely, or nested structures with
different fields in top-level ones. Not expecting any advanced static
typing from serialization formats, but there should at least be a
standard way to encode sum types.
- Textual
-
Readable and writable using regular text editing/processing tools (that
could still include some control characters though). Binary formats with
specific tools (akin to XML tools, JSON's jq, CSV's miller, etc) or
editing modes tend to complicate the process, often unnecessarily.
- Schemas
-
Language-agnostic specifications such as XML DTD, RELAX NG, JSON schema,
OpenAPI specification. The ones that can be used to describe and
validate the data in a non-ambiguous and standard way, and preferably
easily. Establishing the structures and their serialization is painful
enough even without possible miscommunication.
- Descriptive
-
Field names are rather helpful for tasks such as manual editing or
reading. A specification may be unavailable, and even if it is there –
the names still provide a sort of textual UI, even if rather redundant.
Comments are helpful, too, though they can be encoded as strings using
regular constructs.
- Primitive types
-
Unicode strings should be supported, and the rest is rather arguable.
Integer and rational numbers are often used, but JSON gets away with
just IEEE floating ones: parsers can (and do) fail if they get a
non-integer number while expecting an integer. But in the same way, they
could parse those numbers from strings, as they usually do with dates.
Apparently syntactically distinguishing numbers from strings in
serialized data is not particularly useful. Specifying the formats to
prevent the use of e.g. incompatible localised formats should be useful
though.
- Streaming
-
Pretty much any format with arrays/lists can be read lazily, sort of
turning it into streams (assuming that appropriate flushing is going
on), or multiple objects can be read in a loop. But without additional
conventions, it may not be easy to figure, and common libraries may not
provide the required functionality. Besides, common stream processing
tools assume particular kinds of formats, and wouldn't apply to others.
- Simplicity
-
This is a property that is usually desired in all kinds of
things, and is often understood differently. Fortunately,
simplicity of a grammar can be more or less defined
with Chomsky hierarchy, as well as with an apparent
redundancy.
Since it's mostly about shades of grey, common formats are (rather
subjectively) rated on the scale from 0 to 4:
| sum | txt | sch |
desc | prim | str | simp |
json | 0 | 4 | 3 | 4 | 3 | 0 | 2 |
yaml | 0 | 4 | 0 | 4 | 2 | 3 | 0 |
xml | 2 | 3 | 4 | 4 | 3 | 3 | 1 |
dsv | 2 | 3 | 2 | 0 | 4 | 4 | 3 |
posix | 1 | 3 | 3 | 0 | 2 | 3 | 1 |
s-exp | 3 | 4 | 0 | 4 | 0 | 0 | 2 |
n3 | - | 2 | 4 | 4 | 3 | 4 | 3 |
And a bit of explanation:
- JSON
-
Allows different ways to encode sum types, but no standard way, so it's
a mess. Actually that's a good [anti-]example of "less is more": by
simply adding objects (which don't seem to be much more useful than
alists) to something that would not be very different from s-expressions
otherwise, encoding of more fundamental types can be made so much more
awkward. It's textual and easily editable, its primitive types are
pragmatic and not too bad. Streaming is not a part of the standard, and
though the format is pretty simple, it's not as simple as some of the
others.
- YAML
-
Introduces streams, but bloats types and syntax, so the simplicity
goes away.
- XML
-
Not particularly user-friendly (for either reading or writing/editing,
though few are) and rather verbose, but machine-friendly, has a lot of
tooling and things based on it. Suitable for streaming. Does not play
nicely with standard unix tools (though very few do). Apparently plenty
of effort was put into planning of XML and related technologies
(starting from SGML). Similarly to JSON, provides different options for
sum encoding. Unambiguous mixing of vocabularies (using namespaces) is
useful sometimes. As a markup language, particularly suited for document
markup.
- DSV
-
Can be seen as a 2-dimensional array, or as a mere tokenization helper.
I think there's just one sane way to encode sum types with it (by
providing a constructor name), and assuming that the types are just
strings (or that it's simply not typed), it may seem pretty nice,
simple, streaming- and processing-friendly. On the other hand, it needs
additional conventions. Such conventions make DSV harder to compare to
others, and some of the simplicity goes away as one tries to cram
complex structures into it. Coalpit implements DSV with such
conventions, as an example.
- POSIX file format notation
-
Relatively simple grammar, but the types are rather complex. It's not
exactly a format, but perhaps a family of formats, which includes
complex formats as well; so it isn't necessarily streaming- or
editing-friendly, but may be.
- S-expressions
-
There's a draft from 1997, an attempt to standardise them, and it
doesn't default to unicode strings.
- N-Triples
-
This is one of the RDF serialization formats, and RDF has a bunch of
cool properties by itself. But for serialization of arbitrary data
structures, it'd require to convert those into RDF first, so may not be
very convenient if those additional properties are not needed. And this
makes it hard to compare RDF serialization formats to general ones
without more context.
- ASN.1
-
Seems to be used rarely, rather complicated, and more of a family of
formats – so it's not rated here, but perhaps still worth mentioning.
JSON was my go-to format for a while, but the situation with streaming and
sum types is annoying. S-expressions would be more usable if they were
standardised, and DSV has its pros and cons comparing to those. But maybe
XML is good enough for most purposes. Regardless of serialisation format
choice, it is always possible to mess up underlying data models, or to
compose and serialise those in a sensible way,
It's at least entertaining to muse on making a format from
scratch, and perhaps useful to consider the choices one would
make if it was practical to compose such a format.
I would try to pick a model for information encoding, before its
serialization: to be generally useful, it should be least
arbitrary. There is a few descriptive logics specifically for
knowledge representation, and (G)ADTs usable with
constructive/intuitionistic type theory and logic, which is
usable as a foundateion of mathematics. There are alternatives,
and they are tricky to compare, but I'm fairly certain that any
decent model would be quite usable, even if not the only (or the
best) solution to knowledge representation.
Then there's the rabbit hole of composing a language for all
sorts of things (which is also exciting, but likely less
practical; see "formal human languages" for more musings on
that). So perhaps it is a good idea to just pick a seemingly
practical logic.
The serialization itself should then be as simple as possible,
according to Chomsky hierarchy, and with as few rules as
possible. And preferably individual schemas should be extensible
by different parties without conflicts and confusion (as done in
XML and n-triples).
Out of the listed formats, JSON and YAML quite clearly don't fit
the description, perhaps POSIX file format notation doesn't
either; s-expressions, XML, n-triples, and possibly just DSV
seem fairly close, or at least usable for pretending that they
are.