Serialisation formats

Serialization formats help to avoid messy custom grammars, and are used extensively in computing – whenever there is a need to store the data, to pass it from one program to another, to show it to or read it from a user.

Different formats come with their pros and cons, and often do affect the structures one creates with an intent to serialize them – since the formats often lack a specified way to support some constructs (most notably, sum types), and their supported primitive types vary (e.g., unicode strings and binary data are not always supported, number representations are a mess). There is a Wikipedia comparison of data serialization formats, but I'd rather compare a different set of features here – the ones that seemed important to me at one point or another. Here they are:

Sum types
Without those, one ends up with tons of optional fields, which don't describe the structures any precisely, or nested structures with different fields in top-level ones. Not expecting any advanced static typing from the serialization formats, but there should at least be a standard way to encode sum types.
Textual
Readable and writable using regular text editing/processing tools (that could still include some control characters though). Binary formats with specific tools (akin to XML tools, JSON's jq, CSV's miller, etc) may seem cool, but it complicates the process, often unnecessarily.
Schemas
Language-agnostic specifications such as XML DTD, JSON schema, OpenAPI specification. The ones that can be used to describe and validate the data in a non-ambiguous and standard way, and preferably easily. Establishing the structures and their serialization is painful enough even without possible miscommunication.
Descriptive
Field names are rather helpful for tasks such as manual editing or reading. A specification may be unavailable, and even if it is there – the names still provide a sort of textual UI, even if rather redundant. Comments are helpful, too, though they can be encoded as strings using regular constructs.
Primitive types
Unicode strings should be supported, and the rest is rather arguable. Integer and rational numbers are often used, but JSON gets away with just IEEE floating ones: parsers can (and do) fail if they get a non-integer number while expecting an integer. But in the same way, they could parse those numbers from strings, as they usually do with dates. Apparently syntactically distinguishing numbers from strings in serialized data is not particularly useful. Specifying the formats to prevent the use of e.g. incompatible localised formats should be useful though.
Streaming
Pretty much any format with arrays/lists can be read lazily, sort of turning it into streams (assuming that appropriate flushing is going on), or multiple objects can be read in a loop. But without additional conventions, it may not be easy to figure, and common libraries may not provide the required functionality. Besides, common stream processing tools assume particular kinds of formats, and wouldn't apply to others.
Simplicity
This is a property that is usually desired in all kinds of things, and is often understood differently. Fortunately, simplicity of a grammar can be more or less defined with Chomsky hierarchy, as well as with an apparent redundancy.

Since it's mostly about shades of grey, common formats are (rather subjectively) rated on the scale from 0 to 4:

sumtxtsch descprimstrsimp
json 0434302
yaml 0404230
xml 0344331
dsv 2320443
posix1330231
s-exp3404002
n3 -244343

And a bit of explanation:

JSON
Allows different ways to encode sum types, but no standard way, so it's a mess. Actually that's a good [anti-]example of "less is more": by simply adding objects (which don't seem to be much more useful than alists) to something that would not be very different from s-expressions otherwise, encoding of more fundamental types can be made so much more awkward. It's textual and easily editable, its primitive types are pragmatic and not too bad. Streaming is not a part of the standard, and though the format is pretty simple, it's not as simple as some of the others.
YAML
introduces streams, but bloats types and syntax, so the simplicity goes away.
XML
Not particularly user-friendly (for either reading or writing/editing) and rather verbose, but machine-friendly, has a lot of tooling and things based on it. Suitable for streaming, does not play nicely with standard unix tools (though very few do). Apparently plenty of effort was put into planning of XML and related technologies (starting from SGML).
DSV
Can be seen as a 2-dimensional array, or as a mere tokenization helper. I think there's just one sane way to encode sum types with it (by providing a constructor name), and assuming that the types are just strings (or that it's simply not typed), it may seem pretty nice, simple, streaming- and processing-friendly. On the other hand, it needs additional conventions, what makes it harder to compare to others, and some of the simplicity goes away as one tries to cram complex structures into it.
POSIX file format notation
Relatively simple grammar, but the types are rather complex. It's not exactly a format, but perhaps a family of formats, which includes complex formats as well; so it isn't necessarily streaming- or editing-friendly, but may be.
S-expressions
There's a draft from 1997, an attempt to standardise them, but it doesn't default to unicode strings, what makes it rather appaling. Otherwise they are not quite specified, what makes them tricky to use.
N-Triples
This is one of the RDF serialization formats, and RDF has a bunch of cool properties by itself. But for serialization of arbitrary data structures, it'd require to convert those into RDF first, so may not be very convenient if those additional properties are not needed. And this makes it hard to compare RDF serialization formats to general ones without more context.
ASN.1
Seems to be used rarely, rather complicated, and more of a family of formats – so it's not rated here, but perhaps still worth mentioning.

JSON is still my go-to format, but the situation with streaming and sum types is annoying. S-expressions would be more usable if they were standardised, and DSV has its pros and cons comparing to those. Maybe an extended version of DSV—with additional delimiters for opening and closing parens—deserves to exist.