Cassandra Gotchas / Best Practices / FlawsPosted on
Cassandra is a distributed NoSQL database.
One of the most annoying things about introducing new developers to Apache Cassandra has been the counter-intuitiveness of parts of the design. Another annoying thing has been introducing them to the myriad tools and commands needed for managing Cassandra.
I plan to summarise here the major problems new developers seem to have when first encountering the database. First the obvious, and often mentioned, ones. Then the gotchas which I keep having to explain afterwards.
To TLDR myself: if you/someone in your company doesn’t already know the things I’m highlighting, think about choosing a different database system.
“Cassandra” does not refer to a single product. There are, at the time of writing this, 3 products I would consider flavours of Cassandra:
- Apache Cassandra: The original, open-source, database. If you’re not sure which you’re looking for, use this one.
- DataStax Cassandra: A paid, for-enterprise, fork of Apache Cassandra, focused on providing extra functionality mainly useful to enterprise customers. Things like metrics, a dashboard for monitoring and alerting, etc. This tends to be quite expensive, and for small projects/companies the free open-source version will work fine.
- ScyllaDB: This is meant to be a drop-in replacement for Cassandra, written in C++. At the time of writing, I do not consider this database production-ready, with many needed features still unavailable.
Choosing one does not lock you out of switching later, however this process becomes more and more expensive in terms of downtime and resources as time goes on.
Due to the design of the database, as a distributed key-value/tabular database system, Cassandra does not support any form of JOINs, subqueries, or aggregates. This is sometimes touted as an overwhelmingly positive feature of the database, however I class it as a flaw.
Tables cannot be joined, therefore “denormalisation” happen at the data format level, and the choices made there must be done by someone with intricate knowledge of Cassandra systems, to prevent creation of data structures that imply dozens of queries to retrieve all data.
Subqueries do not exist, therefore any slightly complex operations must happen at the service-level outside of the database. This implies a huge amount of data being transferred over the network for even simple purposes, on top of the extra latency.
Aggregates do not exist (besides COUNT in a limited form), therefore data must be retrieved into application memory in the service and the aggregate must be run on the data there.
While some claim that this simplifies the database, and forces the logic to be moved to the service level (where it should be), the trade-off is a significant increase in the minimum level of Cassandra proficiency required for any developer to create or modify queries or database layout.
Cassandra does not natively use SQL, instead opting for a similarly designed language dubbed Cassandra Query Language or CQL. This language is similar enough to SQL that familiar people will be able to write simple queries.
It is also different enough that knowledge of Cassandra internals is required for even minor operations.
Creation of tables (or column families, as they are called) requires the user to set various parameters such as
compaction to not default to pre-defined values which are not ideal for a considerable proportion of use-cases.
Simple queries will seem to succeed, however have unintended consequences or return seemingly incorrect data on account of a
CONSISTENCY level that must be set appropriately according to your desired effect.
To make things worse, there are numerous seemingly innocuous queries which can take down the entire cluster of servers with no warning to those not familiar with Cassandra internals, such as changing the
compaction parameter on a production table. This query would also potentially succeed in development environments where load is smaller.
One of the biggest gripes with the database is a well documented one, which is the inconsistency of concurrent queries.
To put it simply, updates happen on columns separately (depending on the way the query was written) which can cause two separate
UPDATE queries to take effect at once, one affecting a number of columns and the other affecting the rest.
The way Cassandra is installed and configured is fairly straight-forward. The service is installed on your chosen servers, with somewhat sane defaults being automatically chosen for most configuration options.
Setting up a single-node cluster is trivial, and setting up a 3-node cluster is simple with the help of plentiful documentation. Almost none of this documentation includes information on critically relevant options which influence the future growth of the cluster.
There are a number of processes (compaction, compression, tombstone thresholds) which are Cassandra-specific and the correct values for those have to be chosen taking into consideration: the data size of the database (current and future), the hardware of the server, the desired throughput of the database, the planned load of each server.
There are also a number of options which must be configured exactly the same on each node in order to take effect and work correctly including through faults (seed list, snitch, bootstrapping). These are not highlighted in any useful way in existing documentation.
To make things worse, the documentation is misleading in suggesting certain values that then cause problems when the cluster has reached a certain size (e.g. snitch) despite no possible ill effect happening if the better values are given for users who have no knowledge of Cassandra internals.
Even if the person setting up the cluster is knowledgeable enough to skip over all of these issues, the actual values for certain options is dependent on data that is not available via stress testing, and is essentially intuition-based, which is not something desired in a database.
One of the least-understood Cassandra specifics is the arcane mechanism called ‘compaction’. The best reference I have found on the subject is a DataStax post on compaction which highlights the difference between two general-use compaction strategies.
I cannot highlight how little information and warning there is about this subject considering the huge implications for any production database.
Compaction directly affects:
- Data size on disk,
- Query latency,
- Input/Output hardware requirements,
- Fault tolerance in the case of extended unplanned downtime,
- Chance of sudden failure due to regular background compactions
The effect on all of these is also abstract and only vaguely measurable by people very knowledgeable about Cassandra internals. Even simple changes to compaction configuration can cause sudden unplanned downtime as a compaction ends up using up enough CPU that the entire cluster spirals into downtime.
The explanation for all of this, from every Cassandra user I have discussed this with, is that you are meant to keep so many nodes that any downtime spiral can be prevented by an engineer on-call to react to the downtime, or better yet, that the chance becomes ‘so low as to be negligible’.
I expect I do not have to highlight how this is a problem, especially as this is not spelled out anywhere.
An off-shoot of the aforementioned compaction woes, tombstones are Cassandra’s version of deleting rows. If a row is deleted, it is not actually deleted until a compaction of the row occurs.
Depending on your compaction strategy (and innumerable other factors), this might mean that the very first row you insert into your database, then delete, could theoretically still have both existence and deletion stored in two separate files for the entire lifetime of the cluster.
More importantly, Cassandra has major issues with delete-heavy workloads. Any data that exists temporarily in the database, or which is deleted in-between every update, is a ticking bomb waiting to explode.
Once a certain number of tombstones is reached (configured in the node configuration, also not highlighted in the documentation at all) a warning is emitted in the logs, to which an experienced Cassandra developer will react and change values/queries/table layout as appropriate to prevent.
The reasons ever-increasing tombstone numbers are a problem are many-fold:
- Compaction configuration failure directly affects tombstone deletion. If compaction fails, tombstone numbers skyrocket.
- High tombstone numbers indirectly affects compaction load. With enough tombstones, compaction can become expensive enough that compaction must be changed. Which also affects tombstone numbers in a nigh-unpredictable way.
- Extended unplanned downtime or partial data loss can cause tombstoned rows to be resurrected. I cannot emphasise how much of a problem this is for inexperienced Cassandra users, especially with barely any useful mention of tombstones in the documentation. Old data suddenly existing, and rows changing values unpredictably, is such a major failure that it can imply restoring to potentially day-old backups to revert.
In order to actually manage Cassandra, there are three available tools generally, each with their own specific capabilities:
- CQL: Modify data layout, change data, etc.
- nodetool: Directly query node or cluster information and change values or run commands. See section below for specifics.
- JMX: A low-level access to Cassandra nodes. Tends to be used solely for on-the-fly changes of internal state.
The process of setting up database nodes, spawning a new cluster, and setting up the system, involves not only configuring the individual nodes, but running at least two of these for various commands each.
The separation (up to and including authentication with them) of responsibility is unclear sometimes and even counter-intuitive. Documentation is thankfully clear about which commands to use on what, however confusion still occurs often.
As mentioned above, this tool is used to directly talk to a node or cluster in order to get information, change values, or run commands.
The documentation of the available commands is extensive, however only in a purely mechanical manner. There is nearly no documentation as to which commands are useful for which purposes. To make things worse, even the explanations are ambiguous sometimes, to the point of being unclear as to whether a command would affect a single node or the entire cluster.
For how important this tool ends up being for managing Cassandra, it is also terribly user-unfriendly, with most users ending up with scripts that run common diagnostics in a row constantly in order to receive streams of updates on the cluster/node status.
It is to be expected, however this tool also has massive potential for taking down the entire cluster on account of the breadth of its capabilities, from setting local innocuous variables to killing a node or starting a cluster-wide repair that is unsustainable. There are no checks as to feasibility of the command (doesn’t warn on trying to repair across 5 datacenters scattered around the globe), or intent of the user (killing a node is one single command with no confirmation).
Another of the biggest problems, yet least explained mechanisms, that Cassandra has is the concept of a ‘repair’. Despite the name, this is not an emergency tool or even automated process.
Repairing in Cassandra is literally a mechanism that allows nodes to “double-check” their local data against other nodes. This is necessary due to the inherent inconsistency and distributed nature of the database. Not to get into the absurdity of requiring this brute-force approach to guarantee a decent level of service, I will instead focus on its reveal of the two biggest flaws of Cassandra.
Repairs have to be run regularly by the user.
This process is absolutely required for any sane production environment, however the user must schedule a task to run it himself with no mention of it in the documentation.
Repairs can take down the entire cluster.
A repair is insanely expensive, and can literally lock up the entire cluster as it re-checks all of its available data. This is probably part of why the default setup does not configure a cronjob to run the repair, as the user must schedule it according to their own database setup and schedule, at the lowest point of the day.
Besides this, repairs increase data size unexpectedly, to the point where a disk space availability of 49% is enough to cause a repair to fail, and thus lock up the node, on account of it needing the full 50% of disk space to store the remote copies of its own data. This is not mentioned anywhere, to my knowledge.
To make things even worse, repairs are node-specific, and scheduling becomes a nightmare, as a node running a repair becomes inordinately slow to query. To the point where if enough nodes repair at the same time, the cluster could feasibly stop functioning in a spiral of death.
Despite claiming to be fault tolerant, Cassandra is tremendously intolerant of faults. While it is rare for a node outage to kill the entire cluster, a single node outage is enough to invalidate most of the dataset of the cluster in a default configured 3-node cluster accessed by clients with default parameters.
The reasons for this have been outlined above (tombstones, repairs), however there is one more important reason.
Cassandra’s general approach to fault tolerance is to try and prevent it, but if unsolved in the end fail ungracefully. A node being down for long enough (time period configured in every other node’s config files), then coming online normally, will continue serving its completely stale data.
If there is an outage, undetected, or impossible to deal with for whatever reason, then there is a chance your dataset may be permanently tainted as services grab stale data, then re-save it or try and modify it based on stale data.
This is not fault tolerance in any form.
Cassandra is a very good database for very specific purposes, in the hands of very skilled people who have learned Cassandra internals, quirks and workarounds over years of working with it. For everything else, it is a minefield.
If you are looking to use Cassandra for your projects, and you do not have someone who is already knowledgeable with Cassandra (and knows all of the aforementioned things), then be prepared for constant surprises which will not appear until your product reaches enough users that the problems become apparent.
That is to say, the problems will show up exactly when you want them the least.