Data wrangling

Here are my notes and observations on ways to make mappings between data models less awkward or painful.


Models themselves approximate the modeled things in different ways, focus on different aspects of those, and often they are explicitly inaccurate or imprecise. Often conversions between them are necessarily lossy, and additionally accidentally lossy due to poorly specified models, poorly designed ones (see "complexity of data models"), or poor mappings themselves. As it happens with that distinction, necessary data loss must be accepted, but accidental data loss should be resisted.

Many-to-many mappings

As systems grow, requirements to have many-to-many mappings between models pop up: I usually encounter requirements to work with multiple databases on one side (different versions, or different projects), and multiple kinds of devices (working over different protocols, providing or consuming slightly different data) on another. But similar things happen with document converters, network protocol bridges, and so on.

A good conversion must have general enough structures, likely—though not necessarily—the most general ones in the system.


When there is some control over some of the involved data models (e.g., a database for storage), it is tempting to make them general enough to have lossless conversions into, or to cover all the possibilities on conversions from. That can take a considerable effort, and easily be ruined by changes in data models, which seem to happen in evolving systems regularly.

Another approach to data preservation, if lossy conversions are a concern, is storage of the data in its raw form before a conversion, for possible future re-reading and re-conversion.

It is straightforward with files and similar standalone data chunks, but potentially less straightforward with network protocols: while their sessions can be stored, for instance, in pcap files, it may be less suitable for longer-running sessions, those unbounded in time. At which point it is more suitable to consider a continuous log: not exactly of raw I/O events and transmitted data, since it can be encrypted with sessions keys thrown away afterwards, but of readable packets, basically an audit trail. Event sourcing is somewhat similar, though apparently it focuses more on restoring an application state, not necessarily things like parsing a part of a time series database anew.

Though any such source data storage come with caveats: they require additional storage and maintenance, and if they are required to be stored before processing, that introduces an additional point of failure on the main data path, while if the rest of the system can keep going with failed source data storage, those will not be reliable.


I think a viable approach consists of doing best-effort conversions between data models, taking into account that they change (so not spending much time or effort to adjust some of those to fit others: yet other models will appear and nullify those efforts), and possibly employing source data storage (audit trails) with source data to allow for retroactive adjustments, as well as to help with debugging.