Picking databases

From time to time programmers investigate which databases are around, especially when requirements change, and here are some of the results of my recent investigation, as well as some older and generic observations.

1 Avoiding workarounds

With sufficient confidence and ignorance, anything that stores data and allows to retrieve it afterwards can be used for virtually any storage-related task, which would be a special case of the golden hammer. On the other hand, it is not always easy (or possible) to find a perfect solution, so it is mostly about looking for a solution that would minimize the effort required to use it, but not by the means of the first case (i.e., implementing missing features improperly). What may be hard to avoid, since it is hard to know when you are doing something wrong, if you are doing it wrong already (and it can always be arguable, too).

2 CAP and type

There is a bunch of database properties which may be required, but those are probably the primary two in most cases. As mentioned above, the goal is to find a DB engine that would fulfill both, or, if there is no such database, then something that will require minimal workarounds to make it work as required.

2.1 SQL vs NoSQL

The choice is not really between SQL and NoSQL, but between relational, key-value, document-oriented, column-oriented, graph-oriented, or any other types of databases; usually relational ones are "SQL", while others are "NoSQL". But it does not seem to be important at all (in most cases) whether they support SQL or not: SQL is mostly used by programmers to hardcode queries (well, hopefully templates with placeholders, but not a big difference), or even by ORMs/libraries to compose them – not by non-programmers to retrieve data, as it supposed to be used, and rarely even by programmers, analysts, or administrators – to retrieve it. It is both funny and sad how various technologies intended to be used by non-programmers get to be used almost exclusively by programmers, while non-programmers use even worse tools.

2.2 Structure

It is pretty much what defines a required database type, yet depends on various factors itself. One approach is to define structure first and write queries later, another one is to consider queries first and model a structure to fit them then, and optimizing structure for queries is somewhere in the middle. Different tasks require different approaches, of course, and there also are biases.

Yet, a rough picture can be derived at this point: say, key-value storage will not be suitable for time series.

2.3 CAP

The CAP theorem should also be kept in mind: those are the requirements which are hard to make proper workarounds for – e.g., turning a CA database into an AP one is not a trivial task, unless there exists a replicator already. Though it is also criticized sometimes, what should also be kept in mind.

2.4 CAP + type

Here is a nice illustration; it is not quite comprehensive, yet covers some of the popular databases.

Though it is also a bit simplified here: say, MySQL can provide AP. And there are some third-party replication tools, too.

3 Other factors

Storage limits, reliability, performance, tools, community, maintenance activity, language bindings, and many other things should be considered, of course. As usual, it is about picking a tool for a task, and fighting biases; but to be able to do that efficiently, one should learn about different options and try them.

There's plenty of comparisons around; here is one for nosql, and there always are Wikipedia comparisons, wikivs, and search engines.

4 A bit of personal experience

  • MySQL (and MariaDB now) is the most popular one, and is often used by default, even if not really good for a task at hand. Not sure if there are any tasks for which it is really good though, and it seems a bit like PHP of databases. Related: Do Not Pass This Way Again. Though it works, and not that bad.
  • PostgreSQL: never had a chance to use it as a primary DBMS in a big project, though used for various smaller ones; it has neat features, and appears to be generally nicer than MySQL.
  • SQLite is rather simple, but nice for embedded databases (which is its purpose).
  • Firebird was slow and weird when I had to work with it. Maybe it was just a poor database structure though, and never used it afterwards.
  • MS SQL, Oracle: proprietary, avoiding whenever possible.
  • MongoDB didn't seem to be bad, though it was a relief to abandon it. Maybe other factors played a role back then: either my program was buggy, or its C bindings were; but either way, I am not too eager to try it again. Also, there's an article: Broken by Design: MongoDB Fault Tolerance.
  • Cassandra seems to be pretty nice (apart from being written in Java; not that it is necessarily a bad thing, but somehow I had mostly bad experience with those few programs written in it which I have used, and partially blamed the language – that's how biases arise): initially found it when looked for a reliable and fast database that would handle big amount of time series data, and then discovered once again when looked for an AP database. Maybe it is not mature enough yet, and I'm finding Java + DataStax (and maybe even Apache, which writes pretty cool things, yet there's often something weird around it – like Java) to be a bit appaling, had only pleasant experience with it after a few months in production. Upd: consumes relatively much memory by default, maybe better to avoid if it's not really needed to handle a lot of data.
  • CouchDB provides very nice and easy to use replication: AP, bi-directional and revisions-based (git-alike), with filtering functions (somewhat like views), pull/push (from local to remote, or from remote to local servers). There also is a web interface, which is handy to edit JSON. Somewhat appaling things about it include Erlang, HTTP (+ curl and JSON for administration; could be worse, and not hard to write wrappers, but still weird), JS for filters, and maybe Apache (though the latter is just what makes me to expect the other things, not bad by itself, and I am not entirely serious about those things being bad anyway), but overall it seems to be good for JSON documents when AP and customizable replication are required. Though have not yet used it in production at the time of writing.
  • File systems are like hierarchical key-value stores, and various types of synchronization (aka replication) can be set between them, too. Some of them provide interesting features and good optimizations, and sometimes there's no reason to add unnecessary abstraction layers on top of that. It is also relatively easy to migrate from one to another, and even to replicate data between different ones. Nice ones currently include Btrfs (which I'm using for a few years now) and ZFS.