Django Database Performance Tips
I've wanted to share this for some time now, but felt that it would be too large an undertaking. So I will keep it straightforward and avoid diving into metrics: I encourage you to test out the suggestions for your use cases.
Note that I presume familiarity with web development and databases: think of this list as my step by step guide for diagnosing an issue, not a primer on database indices or schema design.
Diagnosing
To identify database performance issues, I will first try to use the django debug toolbar. At work, I even created a dummy view where I can plug in scripts and tasks (e.g. non-view-accessible code), just to use the debug toolbar.
Sometimes, however, the overhead of the debug toolbar is too great, and a request won't finish in a reasonable amount of time (or in my lifetime!). In those cases, I use a context manager that prints out the SQL executed within. I use this within a jupyter notebook, which makes it easy to iterate.
In either case, after this step I'll have pin-pointed the slowdown to a single slow query or an N+1 issue.
Profiling
In the case of a slow query, the next step is to check the query plan. I use a local copy of pev2, which can be downloaded here. Follow the instructions to run the query with EXPLAIN - the exact SQL should be available from the previous step. I use dbeaver or pgadmin to interact directly with the database.
Caveat: using EXPLAIN against a local database may not produce the same plan as in production. In my experience, ensuring that I have a similar amount of data helps, but if I can check in production, that's ideal.
Common Issues
- Missing index: may address full table scans.
- Too many joins: try prefetch related instead (see below).
- Slow join / subquery: experiment with subquery vs join.
Sometimes, the best solution is to just Write Another Query. It may feel awkward, and I'm sure there's some level of raw SQL-fu that could be performant in one query - but I stick to the ORM.
N+1
N+1 is the most common performance issue. To resolve, I'll reach for select_related()
and prefetch_related()
. Simple enough, most of the time, but beware select_related()
: adding joins can cause a query to slow to a crawl. In such circumstances, I will use prefetch_related()
instead - yes, even for a "*-to-one" relationship! A second query may be orders of magnitude faster than a join.
Rare: Avoid Loading More Than I Need
Sometimes, I'll find that the problem is that the query is just loading too much data. Think of what has to happen to get a field from the database to an object in Python: the data is read from disk by the database server, deserialized into memory, then serialized over the wire, then deserialized by the database driver in the webserver, then loaded into Django ORM objects: the cost adds up.
Experimenting with only()
and defer()
can lead to surprising performance improvements.
Caveat: only()
and defer()
are always maintenance burdens. Any field that's not initially included will require a new database query to fetch. I've seen it happen repeatedly: a field is added to the serializer but not to only()
(typically, they're defined in serializers.py
and views.py
respectively, so no locality of behavior to help!). Thus the performance boon becomes a performance penalty.
Rare: Don't pay the de/serialization cost
Suppose that I am trying to load a significant number of objects from the database. Scale is dependent, but let's say about 10k. The price of creating ORM objects is the difference between finishing the request in 30s or not (yes, still using Heroku).
In such a case, dropping to values()
can be great - the queryset will return dictionaries with the specified properties. The performance gains can be substantial, even orders of magnitude.
Caveat: values()
is only an option if I don't actually need model object instances.
Summary
Follow these steps:
- Diagnose
- Profile
Then try the common solutions:
- Add missing database indices
- Split the query into multiple
select_related()
,prefetch_related()
only()
,defer()
values()