Complexity of data models

Ease of use is a desirable property of data models, the lack of which is particularly annoying in APIs or other kinds of shared models. But given that programmers tend to view the types and their serialisation through the tools they use to manipulate those, and often in application to the task at hand, sometimes it is hard to settle on models, and easy to end up with large amounts of code translating and validating many partially compatible models. Here is an outline of the problems that I keep encountering, and of potential solutions.

Example and motivation

An illustration of an awkward serialised structure (in JSON) to begin with:

{ "date": "1/2/3"
, "type": "2"
, "key1": "foo"
, "value_1": "yes"
, "key2": "baz"
, "value_2": 0
, ... }

Similar structures are easy to find in the wild, and usually they are only specified informally (if at all): e.g., the informal specification for the above one could state that "value_1" is optional, but must be specified if the optional "key1" is specified, then whether "key2" is allowed depends on the presence of "key1", and so on. It can also specify types of "keyN" properties to be strings, and of the "value_N" ones – to be restricted to fixed "yes" strings, 0 numbers, or null (with the same semantics as for 0). Naming inconsistency, custom date format without a time zone, and a non-descriptive constant (in the "type" property) are also featured.

In some schemata and programming languages it would be impossible to specify, in others – merely complicated, and not straightforward to translate to/from more suitable types. But the information it conveys isn't that complex, and doesn't have to be that hard to validate and work with.

In some cases it's not so clear whether a model is overly complicated, and/or it is hard to reach an agreement. So each time I see data models (data types, structures) that I find overly complicated, I wish there were some justified and well-specified metrics to apply.

Descriptive complexity

Descriptive complexity (not necessarily following the exact definition of Kolmogorov complexity, which is tricky to apply, but rather complexity of a structure definition, measured in the number of terms used, for instance, and/or kinds of typing/logic required to specify those) seems like a fine approach: usually the types that are easy to define are also easy to work with in a given language, to analyse, and to reason about. It also complicates (or makes impossible, depending on a language) use of custom string formats, which often is a good thing.

Dependence on a language choice is problematic: though it's likely to filter out particularly complicated models using any common language (or schema), the results could differ in more subtle cases. It seems to work as a rough filter and a rough estimate.

I'm inclined to use functional programming languages and calculus of constructions for modelling, which seems to be fairly practical and not too arbitrary, but it can be arguable, and wouldn't work as an argument in a discussion, since it involves relatively obscure type theory (likely would sound like gibberish to those who are not familiar with it, and/or don't like to deal with typing and schemata in general).

Anti-patterns

Lists of anti-patterns are likely to be incomplete and even more arguable, but they tend to be easier to convey. For instance, by searching for "JSON anti-patterns" (since search for "data model anti-patterns" leads to databases-specific cases), I've found a StackOverflow question describing one of the issues illustrated in the example above, with a reference to "Arrject" on The Daily WTF.

Perhaps these can be combined with justifications via the descriptive complexity, with some examples of manipulations in different languages.