Serialization formats help to avoid messy custom grammars
(though so do non-messy custom grammars), and are used
extensively in computing – whenever there is a need to store
data, to pass it from one program to another, to show it to or
read it from a user.
Different formats come with their pros and cons, and often do affect the
modules/structures one creates with an intent to serialize them – since
the formats often lack a specified way to support some constructs (most
notably, sum types), and their supported primitive types vary (e.g.,
unicode strings and binary data are not always supported, number
representations are a mess). There is a
Wikipedia comparison
of data serialization formats, but I'd rather compare a different set
of features here – the ones that seemed important to me at one point or
another. Here they are:
- Sum types
-
Without those, one ends up with additional optional fields,
which do not describe the structures precisely, or nested
structures with different fields in top-level ones. Not
expecting any advanced static typing from serialization
formats, but a standard way to encode sum types is preferable.
- Textual
-
Readable and writable using regular text editing or processing
tools (that could still include some control characters
though), like sed, awk, grep, and text
editors. Format-specific tools (akin to XML tools, JSON's jq,
CSV's miller) do not count.
- Binary
-
Ability to read and write data in a given format without
explicit textual parsing, using basic memory
operations. Usually good for efficiency, and often for
simplicity, but likely to clash with textual representation.
- Schemas
-
Language-agnostic specifications such as XML DTD, RELAX NG,
JSON schema, OpenAPI specification. The ones that can be used
to describe and validate the data in a non-ambiguous and
standard way, and preferably easily. Establishing the
structures and their serialization is painful enough even
without possible miscommunication.
- Descriptive
-
Field names are rather helpful for tasks such as manual editing or
reading. A specification may be unavailable, and even if it is there –
the names still provide a sort of textual UI, even if rather redundant.
Comments are helpful, too, though they can be encoded as strings using
regular constructs.
- Primitive types
-
Unicode strings should be supported, and the rest is rather
arguable. Integer and rational numbers are often used, but
JSON gets away with just IEEE floating ones: parsers can (and
do) fail if they get a non-integer number while expecting an
integer. But in the same way, they could parse those numbers
from strings, as they usually do with dates. Apparently
syntactically distinguishing numbers from strings in
serialized data is not particularly useful. Format
specifications for such strings may be useful though, but then
it is similar to an embedded serialization format.
- Streaming
-
Virtually any format with arrays (or lists) can be read
lazily, turning those into streams, or multiple objects can be
read in a loop. But additional conventions are needed (for
appropriate flushing, for instance), and the used libraries
should provide such functionality, at which point is must be a
part of the standard anyway. So some means of streaming are
expected in a standard.
- Simplicity
-
This is a property that is commonly desired, and is
often understood differently. Fortunately, simplicity of a
grammar can be more or less defined with Chomsky hierarchy, as
well as with an apparent redundancy.
Since it is mostly about shades of gray, common formats are
(rather subjectively) rated on the scale from 0 to 4:
| sum | txt | bin | sch |
desc | prim | str | simp |
json | 0 | 3 | 0 | 3 | 4 | 3 | 0 | 2 |
yaml | 0 | 3 | 0 | 0 | 4 | 2 | 3 | 0 |
xml | 2 | 2 | 0 | 4 | 4 | 3 | 3 | 1 |
dsv | 2 | 2 | 1 | 2 | 0 | 4 | 4 | 3 |
posix | 1 | 2 | 1 | 3 | 0 | 2 | 3 | 1 |
s-exp | 3 | 3 | 0 | 0 | 4 | 0 | 0 | 2 |
n3 | - | 1 | 0 | 4 | 4 | 3 | 4 | 3 |
cfg/bnf | 4 | 4 | 1 | 3 | 2 | 4 | 2 | 4 |
Expanding on those:
- JSON
-
Allows different ways to encode sum types, but no standard way, so it's
a mess. Actually that's a good [anti-]example of "less is more": by
simply adding objects (which don't seem to be much more useful than
alists) to something that would not be very different from s-expressions
otherwise, encoding of more fundamental types can be made so much more
awkward. It's textual and easily editable, its primitive types are
pragmatic and not too bad. Streaming is not a part of the standard, and
though the format is pretty simple, it's not as simple as some of the
others.
- YAML
-
Introduces streams, but bloats types and syntax, so the simplicity
goes away.
- XML
-
Primarily a markup language, so in some cases it is not
particularly convenient, but machine-friendly, has a lot of
tooling and things based on it. Suitable for streaming. Does
not play nicely with standard unix tools (though very few
do). Apparently plenty of effort was put into planning of XML
and related technologies (starting from SGML). Similarly to
JSON, provides different options for sum encoding. Unambiguous
mixing of vocabularies (using namespaces) is useful sometimes.
- DSV
-
Can be seen as a 2-dimensional array, or as a mere tokenization helper.
I think there's just one sane way to encode sum types with it (by
providing a constructor name), and assuming that the types are just
strings (or that it's simply not typed), it may seem pretty nice,
simple, streaming- and processing-friendly. On the other hand, it needs
additional conventions. Such conventions make DSV harder to compare to
others, and some of the simplicity goes away as one tries to cram
complex structures into it. Coalpit implements DSV with such
conventions, as an example.
- POSIX file format notation
-
Relatively simple grammar, but the types are rather complex. It's not
exactly a format, but perhaps a family of formats, which includes
complex formats as well; so it isn't necessarily streaming- or
editing-friendly, but may be.
- S-expressions
-
There's a draft from 1997, an attempt to standardise them, and it
doesn't default to unicode strings.
- N-Triples
-
This is one of the RDF serialization formats, and RDF has a bunch of
cool properties by itself. But for serialization of arbitrary data
structures, it'd require to convert those into RDF first, so may not be
very convenient if those additional properties are not needed. And this
makes it hard to compare RDF serialization formats to general ones
without more context.
- ASN.1
-
Seems to be used rarely, rather complicated, and more of a family of
formats – so it's not rated here, but perhaps still worth mentioning.
- Custom context-free languages, with (A/E)BNF
- Usually not listed among serialization formats, but
simplifying/generalizing serialization formats may lead to
these. Fairly widely known and supported, and a fine way to
encode data. They tend to produce particularly clean data
encodings, though an arbitrary programmer is more likely to
manage to call a JSON parsing function (and maybe even grab
correct values out of the resulting structures) than to parse
a given context-free grammar, so in some cases it won't be
suitable.
JSON was my go-to format for a while, but the situation with streaming and
sum types is annoying. S-expressions would be more usable if they were
standardised, and DSV has its pros and cons comparing to those. But maybe
XML is good enough for most purposes. Regardless of serialisation format
choice, it is always possible to mess up underlying data models, or to
compose and serialise those in a sensible way,
It is entertaining to muse on making a format from scratch, and
perhaps useful to consider the choices one would make if it was
practical to compose such a format.
Context-free grammars
I would try to pick a model for information encoding, before its
serialization: to be generally useful, it should be least
arbitrary. There is a few descriptive logics specifically for
knowledge representation, and (G)ADTs usable with
constructive/intuitionistic type theory and logic, which is
usable as a foundateion of mathematics. There are alternatives,
and they are tricky to compare, but I'm fairly certain that any
decent model would be quite usable, even if not the only (or the
best) solution to knowledge representation.
Then there's the rabbit hole of composing a language for all
sorts of things (which is also exciting, but likely less
practical; see "formal human languages" for more musings on
that). So perhaps it is a good idea to just pick a seemingly
practical logic.
The serialization itself should then be as simple as possible,
according to Chomsky hierarchy, and with as few rules as
possible. And preferably individual schemas should be extensible
by different parties without conflicts and confusion (as done in
XML and n-triples).
Out of the listed formats, JSON and YAML quite clearly don't fit
the description, perhaps POSIX file format notation doesn't
either; s-expressions, XML, n-triples, and possibly just DSV
seem fairly close, or at least usable for pretending that they
are.
At some point in 2022 I tried to sketch such a format, aiming
ADTs encoded into DSV (akin to Coalpit, but with a basic
specification instead of Haskell: relying on prefix notation
with known/fixed arity), but that looked much like BNF. Since
then context-free grammars with BNF are included into the
comparison.
This reminds me of similarly simplifying and generalizing
mathematical logic (proof assistance), producing metamath:
similarly focusing on textual representation, and its
rewriting. Likewise, rewriting systems are quite commonly
(though often implicitly) used in programming and CS, and it
might be interesting to attempt integrating those applications
better than they commonly are combined. Also related are
transition systems, action languages.
Textual and binary
One can compose a format scoring well in the comparison table
above by encoding algebraic data types using a format readable
as both text and binary data, using fixed-length (powers of 2)
textual strings for identifiers, hexadecimal numbers. Perhaps it
can be made DSV-compatible by separating values with
delimiters. ADTs would also provide a way to encode lists,
though for optimization (and as a syntactic sugar) something
akin to netstrings or quotation with escaping can be
used. Compatibility with POSIX file format and s-expressions may
be achievable in some cases.