Flat structures

This is slightly related to an older note, "Serialisation formats".

APIs—and particularly their data structures—are among the worst things when it comes to work projects: when they are already defined, they tend to be messy (almost always worse than the ones that can be found on public servers or in larger FLOSS projects), and sometimes poorly defined; when they are getting established, the process is also chaotic: there are deadlines, different views on what an API should be like, changing requirements. One can easily end up parsing some language-specific serialisation format from another language, or handling numbers that are sometimes encoded into strings, and generally write wrappers around those APIs to get usable functions and structures – not because the underlying one is low-level, but because it's a mess.

Different applications using the same API would prefer different structures and methods (and endpoints, in case of HTTP REST APIs), of course, but that's a reason to make them simple and flexible: orthogonal methods, data representations as canonical as possible; not to tailor it for particular use cases, given that they tend to vary and change.

One particular issue I keep encountering (and have encountered once again recently, which made me to write this down) is unnecessary structure hierarchies, particularly when it comes to time series data: in a fairly classic setting, there is time (a point or an interval, doesn't matter here), a few parameters indicating where the data comes from (location or device ID, sometimes sensor orientation, a sensor array number or ID for devices with multiple sensors – things like that), and a set of observations. Now, there are different ways to represent it (mostly from experience, even though they may look like made-up ones; parens denote products, square brackets – arrays/sets/collections/lists of those, and only key/field names are mentioned):

[time, [device, [orientation, [observation₁, …]]]]
[time, [orientation, [device, [observation₁, …]]]]
[device, [time, [orientation, [observation₁, …]]]]
[time, device, [orientation, [observation₁, …]]]
((device, orientation), [time, observation₁, …])
([chunk_id, time], [chunk_id, device, orientation, observation₁, …])
([chunk_id, time, device], [chunk_id, orientation, observation₁, …])
([chunk_id, time, device, orientation], [chunk_id, observation₁, …])
[time, device, orientationₓ-observation₁, …]

And a flat structure (well, a sane flat one; the last one above is also flat):

[time, device, orientation, observation₁, …]

Even the stranger approaches are shown here in their mild forms. Many other permutations are possible, but this should give a rough idea of the zoo. Larger and older systems tend to have the same data represented in multiple ways, and one has to convert the data between those – often^{[no citation needed]} using a flat structure as an intermediate representation: you flatten it, then sort, then group as you wish.

It is a reason enough for me to choose a flattened structure as the canonical, for use in interfaces.

Another prominent reason to stick to flat structures is stream processing: while it is possible with grouped data as well, translation from one grouping into another in general can't be done in a streaming manner. That, in turn, leads to software requiring much more resources than necessary (and than available, in some cases) to function.

Flat structures are redundant, but much less so than the regular named fields and textual data representations, and can be compressed. It goes against normalization, but excessive normalization adds more overhead than it evades by avoiding redundancies (and just not so useful for some kinds of data). Premature optimisation, as the saying goes, is the root of all evil. Grouping complexity is negligible if the data is already sorted (just a single run through the list, which already happens if that list gets processed, and friendly to stream processing), and sorting should be defined independently.

That is to say, flattening does not necessarily imply avoidance of nested structures (even of recursive ones, like lists) here: usually it does not make any difference for data processing if related fields are grouped into structures, and even small array fields (that are "atomic", in that they are not regrouped) are fine. It is mostly the groupings like those listed above that tend to lead to multiple representations.

As additional benefits, flatter structures are easier to fit into DB tables and DSV formats.

Something similar happens with alternatives (choices, options, branching, sum types), consider the following example with algebraic data types (in Haskell):

data Choice = Option1 | Option2
data Foo = Foo1 Int Choice | Foo2 Int

This has 2n + n inhabitants, which can be simplified to 3n:

data Choice = Option1 | Option2 | Option3
data Foo = Foo Int Choice

Which is then easier to parse, has a constant number of elements, and an overall better candidate for a canonical form, unless there are additional reasons to use the former option.

Sometimes it is faster and easier (and even necessary, in the presence of deadlines) to do whatever grouping is wanted on the other side(s) of the interface than to argue about it, but hopefully at least defaulting to flatter structures would lead to tidier APIs.