Apache Cassandra is a column-family NoSQL data store designed for write-heavy persistent storage in Python web applications and data projects.
Cassandra is commonly used with Python for write-heavy application demands. The following tutorials walk through several of the helper libraries that can be used to interact with Cassandra, with and without web frameworks such as Django.
DataStax's Python Cassandra driver can be installed as an application dependency to make it easier to access and work with Cassandra in your Python applications.
Async Python and Cassandra with Gevent explains how you monkeypatch gevent into a Python 2.7 application and work with Cassandra using gevent's coroutines. Note that this post could have instead been written with asycnio if it were coded with Python 3.
How to Install and Use Cassandra on Django instructs how to use Cassandra with Django 1.8 but it should still be relevant for newer Django versions as well.
Using Cassandra with Python and uWSGI gives some short example code for connecting to a Cassandra cluster outside the HTTP request-response cycle to prevent timeouts and blocking issues with WSGI servers.
The Stack Overflow thread asking about the best Cassandra library/driver for Python? has a good answer on why to use the datastax/python-driver project due to its CQL support and active development.
Cassandra performance in Python: Avoid namedtuple
covers the performance penalty of using the namedtuple
type with the
DataStax Cassandra Python driver and how you can work around it.
These resources are written by engineering teams at organizations that have large scale Cassandra deployments. The posts cover topics such as monitoring, scaling and usage with billions of records.
How Discord Stores Billions of Messages talks about the evolution of Discord's very large scale message store system from a MongoDB instance to Cassandra for storing messages in a distributed, replicated cluster.
Monitoring Cassandra at Scale explains how the Yelp engineering team uses Cassandra to complement their MySQL and ElasticSearch instances. The post does a nice job of enumerating the warning signs to monitor and provides a short example of an issue with replication that could be caught by their approach.
How Uber Manages A Million Writes Per Second Using Mesos And Cassandra Across Multiple Datacenters shows why Uber needs accurate real-time data at large scale to make their driver and passenger operations run properly. The post goes into the overall architecture they use including cluster size, tolerable latency and other libraries in their stack.
Apache Cassandra can be used independently of Python applications for data storage and querying. The learning curve for getting started is similar to other NoSQL data stores but scaling, performance and monitoring can be challenging. The following resources focus on addressing those issues based on teams that have felt the pain and often released their resulting tools as open source projects.
The official getting started documentation for Cassandra provides installation, configuration, and basic querying information.
How Not To Use Cassandra Like An RDBMS (and what will happen if you do) gives examples in Cassandra's query language CQL of operations that are typical with relational databases but go terribly wrong with Cassandra, due to its NoSQL architecture that is optimized for other types of operations.
Cassandra Query Language (CQL) Tutorial explains the concepts and syntax behind the data management language that is Cassandra's equivalent to relational database SQL.
Backup and Recovery for Apache Cassandra and Scale-Out Databases covers issues encountered when trying to take snapshot backups of Cassandra due to partitions and consistency lag time that occur with just about every Cassandra setup.
Getting the Most Out of Cassandra is a video for on data modeling and application development for developers new to Cassandra.
The Total Newbie’s Guide to Cassandra compares Cassandra to traditional relational databases.
On Cassandra Collections, Updates, and Tombstones and Undetectable tombstones in Apache Cassandra present how developers often use Cassandra collections incorrectly when they are not experienced with how the data store operates.
When to use Cassandra and when to steer clear explains the advantages Cassandra provides such as high throughput on writes (versus reads) and availability. The disadvantages are also given such as strong consistency, typical relational database-style (ACID) transactions and reads without knowing the primary key of the record you want to access. These are common database tradeoffs you need to understand based on your workload and decide upon before you build out your whole data architecture!
Analyzing Cassandra Performance with Flame Graphs and Garbage Collection Tuning for Apache Cassandra are two posts in a series on how to debug issues in operational Cassandra deployments using appropriate data visualization, especially when the issue is due to the Java Virtual Machine (JVM)'s garbage collection methods.