Complexity of data models

Example and motivation

An illustration of an awkward serialised structure (in JSON) to begin with:

{ "date": "1/2/3"
, "type": "2"
, "key1": "foo"
, "value_1": "yes"
, "key2": "baz"
, "value_2": 0
, ... }

Similar structures are easy to find in the wild, and usually they are only specified informally (if at all): e.g., the informal specification for the above one could state that "value_1" is optional, but must be specified if the optional "key1" is specified, then whether "key2" is allowed depends on the presence of "key1", and so on. After a while somebody makes use of the possibility to specify "value_1" without "key1" being specified anyway, and declare new semantics for that case, leading to a mismatch between expectations and actual handling by all the other software, and then possibly even hacks to handle it in other software. It can also specify types of "keyN" properties to be strings, and of the "value_N" ones – to be restricted to fixed "yes" strings, 0 numbers, or null (with the same semantics as for 0). Naming inconsistency, custom date format without a time zone, and a non-descriptive constant (in the "type" property) are also featured.

In some schemata and programming languages it would be impossible to specify, in others – merely complicated, and not straightforward to translate to/from more suitable types. But the information it conveys isn't that complex, and doesn't have to be that hard to validate and work with.

In some cases it's not so clear whether a model is overly complicated, and/or it is hard to reach an agreement. So each time I see data models (data types, structures) that I find overly complicated, I wish there were some justified and well-specified metrics to apply.

Descriptive complexity

Descriptive complexity (not necessarily following the exact definition of Kolmogorov complexity, which is tricky to apply, but rather complexity of a structure definition, measured in the number of terms used, for instance, and/or kinds of typing/logic required to specify those) seems like a fine approach: usually the types that are easy to define are also easy to work with in a given language, to analyse, and to reason about. It also complicates (or makes impossible, depending on a language) use of custom string formats, which often is a good thing.

Dependence on a language choice is problematic: though it's likely to filter out particularly complicated models using any common language (or schema), the results could differ in more subtle cases. It seems to work as a rough filter and a rough estimate, and some use (one-tape, two-symbol) Turing machines to make the choice of a language somewhat less controversial.

I'm inclined to use functional programming languages and calculus of constructions for modelling, which seems to be fairly practical and not too arbitrary, but it can be arguable, and wouldn't work as an argument in a discussion, since it involves relatively obscure type theory (likely would sound like gibberish to those who are not familiar with it, and/or don't like to deal with typing and schemata in general).

Occam's razor

Similarly to scientific modelling, forming one's worldview, or investigating anything in general, it seems useful to employ Occam's razor and avoid introducing unnecessary entities if a model can be more "compact" without them. And especially "made up" entities: hypothesized or otherwise introduced just to fit others; I think it helps to avoid stuff like supernatural beliefs, poorly justified conspiracy theories, as well as poorly justified and redundant data structures. Informally I picture it as taking the available pieces and spending more time trying to find how to put them together neatly, but not to declare it being impossible, not to put them in an arbitrary way and then make up additional ad hoc ones to support those. Appearance of new pieces would sometimes require to redo such models if one tries to also keep them neat, but it may be better to occasionally refactor than just to give up and live with a mess.

Anti-patterns

Lists of anti-patterns are likely to be incomplete and even more arguable, but they tend to be easier to convey. For instance, by searching for "JSON anti-patterns" (since search for "data model anti-patterns" leads to databases-specific cases), I've found a StackOverflow question describing one of the issues illustrated in the example above, with a reference to "Arrject" on The Daily WTF.

Perhaps these can be combined with justifications via the descriptive complexity, with some examples of manipulations in different languages.