Caching and Scaling for web frameworks (examples with Django)

Hi! Caching is the most important aspect of scaling along with database indices. It’s not impossible to get up to 99% speed improvements using either one, when done right. I’ve been using Django over 6+ years and I will be sharing my learnings when I had to scale beyond millions of users. The techniques and insights I share by no means are limited to Python nor Django, they are universal. For this tutorial django’s built in in-memory cache will be enough, you don’t need to setup a redis or memcache server.

You can find all the code I use under https://github.com/EralpB/djangocachingshowcase different versions are marked as different releases.

Should you cache?

Downside of caching is that it introduces some hard-to-reproduce errors and bugs, and sometimes inconsistencies like stale or no-more-existing objects/data. Considering this I wouldn’t keep a cache that only brings 20% improvement or one that has 20% hit rate. I’d aim for 80%.

Is this cache helping?

A well designed cache for a workload.. might bring down the website for another workload. This is why when designing the cache you must absolutely focus on real use cases and request patterns. not artificial tests.

Example Scenario

Let’s start with a bookshop scenario, where we have Book and Author models, and 2 views BookDetail, AuthorDetail.

class Author(Model):
	name = models.CharField(max_length=255)

class Book(Model):
	title = models.CharField(max_length=255)  
	author = models.ForeignKey(Author, on_delete=models.CASCADE)  
	purchase_count = models.IntegerField()

Imagine on BookDetail page we want to get trending/top books of the author, as recommendation. The same logic obviously can be put on AuthorDetail. To share this logic in both views, this is best put as a function to Author model. You will see caching also forces best practices like DRY (Don’t Repeat Yourself).

class Author(Model):
	...
	def get_top_books(self):
		time.sleep(5)  # make the calculation slower for the sake of argument
		return list(Book.objects.filter(author=self).order_by('-purchase_count')[:10])

If you don’t cache this function, you don’t just risk slower response times, which is by itself an awful thing, BUT also clogging all your web workers and start returning 504s to all other requests. If you have 30 gunicorn workers, your web server can handle 30 simultaneous requests. so if 30 request comes to this function at the same time, your website will be down for 5 seconds. This can cascade obviously to worse problems.

You can find the source code for this version at this tag, it’s marked v0 in Releases.

https://github.com/EralpB/djangocachingshowcase/tree/b93dafaf7f7fd2962334b22e78e0d10e9e14cff0

Loadtesting Version 0 - no caching

I created a loadtest using Locust, it randomly does one of these 3 actions, query index page, query author detail page and query book detail page. You can see index is super fast, but other 2 are very slow because our function is very slow to calculate.

loadtesting version 0

loadtesting version 0

There’s no need to further test this version, it’s expectedly awful.

Step1: Manual caching, transparent to the caller

Django has amazing cache managing utilities, so we will go with it.

from django.core.cache import cache
class Author(Model):
	...
	def get_top_books(self):
		cache_key = 'Author:get_top_books:{}'.format(self.id)
		cache_value = cache.get(cache_key)
		if cache_value is not None:
			return cache_value
		time.sleep(5)  # make the calculation slower for the sake of argument
		books = list(Book.objects.order_by('-purchase_count')[:10])
		cache.set(cache_key, books, 4 * 60 * 60)  # cache for 4 hours
		return books

Couple things, how to choose a cache key? I personally go with classname_functionname_extraparameters_objectid or if it’s about a specific task I start it with task name or filename. This shouldn’t really matter, only thing matters is that it’s unique, we don’t want to store 2 authors top books in the same cache key.

What this function does is,

1) Check if any previously cached value exists in cache database. 2.a) if yes return it 2.b) if no calculate the expected value 3.b) store the calculated value in cache 4.b) return the calculated value

This is transparent to the caller! A very good feature indeed, except the latency. so you can make this and and suddenly whole codebase benefits from this improvement, how good is that? :)

Is this cache helpful?

I have no idea. Rule 0 is caching is about the workload. if an author gets 1 query every 6 hours, this cache actually degrades performance. because hit rate is 0, and instead of just calculating now you have to do additional cache database calls. Although.. there’s still some advantage. it might be slower on regular workload but it gives you burst request or DDOS protection. if an attacker sends requests to this function, subsequent requests will be very very fast. (couple milliseconds instead of 5 seconds) This doesn’t need to be an attack, a viral effect or a marketing campaign could do the same thing at the moment you don’t expect.

Another thing to keep in mind is this cache favors popular authors, if an author gets 1 req/s (request per second) this cache will be very helpful for his/her page, whereas an author in the long tail will see a performance drop. You will have to weigh ups and downs to do the final decision and change the 4 hour to maybe 24 hours or maybe to 1 hour. How to keep track of this statistics is anoher posts topic, but you definitely have to have a feeling about the curve and request pattern.

Loadtesting Version 1

loadtesting version 1

loadtesting version 1

Here you can see initially response times are initially high but in short time they all drop, and our server can handle many many more requests per second thanks to caching.

You can find the source code for this version at this tag, it’s marked v1 in Releases.

https://github.com/EralpB/djangocachingshowcase/tree/f28e8098074f0b7a65db7c2de7474fafbea77004

Step 2: Fixing Stale/non-existing objects

Now with this, you might get an unexpected call, some pages not opening or showing old prices and so on! Can you guess what’s wrong? Let’s get back to the drawing board:

drawing board

Imagine redis is storing cached array like [Book ID #1, Book ID #5] the danger is if an admin deletes Book ID #5, suddenly you are listing a non-existing book in top books module. This can create very weird problems, Django might think DB-integrity is broken if it cannot resolve a foreign key and such that. Or you might correct a typo or do an important price change, the cached value would still be old and stale. The best solution to this is to cache object IDs and fetch fresh objects from database with cached IDs. Almost always the hard and slow part is executing the logic to decide which objects to show, if you have the IDs ready fetching from database should be couple of milliseconds. DBs are optimized for this!

from django.core.cache import cache
class Author(Model):
	...
	def get_top_books(self):
		cache_key = 'Author:get_top_books:{}'.format(self.id)
		cache_value = cache.get(cache_key)
		if cache_value is not None:
			*return list(Book.objects.filter(id__in=cache_value))*
		time.sleep(5)  # make the calculation slower for the sake of argument
		books = list(Book.objects.order_by('-purchase_count')[:10])
		# cache.set(cache_key, books, 4 * 60 * 60)  # cache for 4 hours
		cache.set(cache_key, [book.id for book in books], 4 * 60 * 60)  # cache for 4 hours
		return books

I changed 2 lines, the first return and the cache setting line. Now your function executes super fast and does not have stale object or update problem!

Loadtesting version 2

There’s no need to loadtest version 2, we haven’t done any performance related changes. Fetching fresh objects will slow down the endpoint at most couple of milliseconds. Again, nothing to loadtest here.

You can find the source code for this version at this tag, it’s marked v2 in Releases.

https://github.com/EralpB/djangocachingshowcase/tree/508653ab990a1cae2095609cfcd595073f361693

Step 3: Making cache transparent to Developer

Now the bad thing about this function is, caching and application logic is mixed, that’s never good news in programming, so I will be separating these 2 logic.

class Author(Model):
	...
	# _ indicates this function is for internal use only
	def _get_top_books(self):
		time.sleep(5)  # make the calculation slower for the sake of argument
		return list(Book.objects.order_by('-purchase_count')[:10])

	def get_top_books(self):
		cache_key = 'Author:get_top_books:{}'.format(self.id)
		cache_value = cache.get(cache_key)
		if cache_value is not None:
			return list(Book.objects.filter(id__in=cache_value))
		books = self._get_top_books()
		cache.set(cache_key, [book.id for book in books], 4 * 60 * 60)  # cache for 4 hours
		return books

This looks much better! Imagine a developer is tasked with, instead of purchase count let’s order by like or favorite count. That developer only needs to inspect 2 lines of code, and shouldn’t care at all about the caching logic. This reduces complexity from 8 lines to 2 lines. First part we made caching transparent to the caller, now we made it transparent to the logic developer.

What’s next?

Our code looks lovely, but there’s one thing that’s not ideal 7 lines of code overhead for a function. if Author model had 5 functions we wanted to cache, we would be repeating ourselves many times. In part 2 I will be using the library I wrote to put caching logic into a function wrapper, so you can enable it in 1 line in a very unintrusive way.

Loadtesting code:

	from locust import HttpLocust, TaskSet, between
	from locust import Locust, TaskSet, task

	def get_index(l):
		l.client.get("/")

	def get_author(l):
		l.client.get("/authors/1")

	def get_book(l):
		l.client.get("/books/1")

	class UserBehavior(TaskSet):
		tasks = {get_index: 1, get_author:1, get_book: 1}

	class WebsiteUser(HttpLocust):
		task_set = UserBehavior
		wait_time = between(5, 10)

to run this, you need to install Locust and then run the command locust -f filename.py then you can manage your loadtest in your browser.


champion

Congratulations! Keep caching like a champion, and please follow me on @EralpBayraktar :)