BBIT - Faster SQLAlchemy with Memcached

SQLAlchemy provides a great Pythonic way of interacting with SQL, however it does introduce some overheads which can slow down your application, especially if you’re used to the speed of raw SQL. Read on to see how I was able to combine SQLAlchemy and Memcached to retrieve up-to-date results quickly.

Background

Very shortly I will be launching an update to Deal Shrimp that will change the data store to PostgreSQL. Currently Deal Shrimp stores all data in MongoDB, which I made the switch to from using PostgreSQL initially back in the early days of Deal Shrimp.

This post isn’t about Postgres vs Mongo or the reasons for choosing a document store over a relational database, but to sum up the Deal Shrimp history in a few sentences: I switched to Mongo soon after learning about it, as the schemaless storage seemed like magic. Its speed was also a huge bonus, in fact it was fast enough that I no longer needed Memcached. Document stores, however, can be difficult or time consuming to query unless you’ve stored the data in exactly the way you want to retrieve it, and wanting the flexibility to extend the types of data stored led me back to PostgreSQL.

Mongo is Fast

The switch to PostgreSQL has been feature-complete for a couple of weeks now, but there’s been one thing holding me back from releasing it: speed. Pages just didn’t render as quickly when pulling data from PostgresSQL. As a starting point, here’s a simple ab test on retrieving the main page from dealshrimp.co.nz (currently running MongoDB)

$ ab -n 10 -c 1 http://dealshrimp.co.nz/
…
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       24   30   4.2     29      37
Processing:   503  744 252.6    640    1200
Waiting:       96  138  52.3    119     248
Total:        531  774 253.1    676    1224
…

(c=1 is chosen to test a single page render at a time and should provide consistent results)

The median time to download the main page is 676ms.

Compare that to the same test against the development version of Deal Shrimp, running PostgreSQL:

$ ab -n 10 -c 1 http://127.0.0.1/
…
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   3.1      0      10
Processing:   712 1053 412.0    943    2034
Waiting:      706 1046 411.7    937    2027
Total:        712 1054 411.3    943    2035
…

A median of 943ms, which is not good, considering that this is a development server with exactly 1 user.

Finding bottlenecks by profiling

There are a few profiling middlewares that you can find online for Django, and the one I used was sourced from (http://www.djangosnippets.org/snippets/186/). It uses the Python hotshot module to display the number of calls to each method, the total time for those calls to execute, and the average time per call. The problem presents itself immediately:

ncalls     tottime  percall  cumtime  percall filename:lineno(function)
9890/3348    0.083    0.000    0.239    0.000 site-packages/sqlalchemy/orm/loading.py:323(_instance)

The SQLAlchemy ORM was making a lot of calls (over 9000) to instantiate Python instances with the data retrieved from PostgreSQL. Compare that with how long it takes to execute the query:

ncalls     tottime  percall  cumtime  percall filename:lineno(function)
…
        1    0.051    0.051    0.051    0.051 site-packages/sqlalchemy/engine/default.py:323(do_execute)

51ms for PostgresSQL to execute the SELECT vs 239ms for SQLAlchemy to turn the result into Python objects.

The first step in trying to reduce this time might be to try to reduce the number of object instances being instantiated, which first means understanding why there’s so many of them in the first place. At the time of writing, there are 675 deals in the database, which might suggest something is going wrong to cause so many objects to be created. However, a deal might be available in multiple regions. Instead of storing the Deal multiple times (for each Region) Deal Shrimp (in true relational database fashion) has a many-to-many relationship between Deal and Region, via a table that also stores the URL for the deal in that region.

Super-complicated many-to-many Deal/Region relationship.

A count() of the DealRegion table reveals that there are 3194 Deal-Region relationships. Without digging too deep, 9890 _instance calls doesn’t sound too bad to instantiate 674 Deals, 3194 DealRegions, and perhaps a number of related Site and Region objects.

Caching SQLAlchemy objects

It seems like a good place to start speeding up page retrieval would be to cache the results of this query. Enter Memcached and dogpile.cache (the recommended package for caching SQLAlchemy). dogpile.cache provides cache_on_arguments, a decorator to automatically cache a method’s return value based on the arguments given.

For testing, I created a profilable script that would retrieve all the active deals. The script, without caching, takes 611ms to retrieve all active deals.

Adding the cache_on_arguments decorator to retrieve the deals from the cache instead dropped execution time to 242ms. Nice! There are two problems with this caching strategy though. The first becomes obvious when examining the signature of the deal retrieval function:

    def deals(self, user=None, from_date=None, limit=None, region=None, before_date=None, tags=None)

It is hard to generate cache keys that are applicable to many different requests based on this method’s arguments. The problem is particularly due to the from_date argument. One of Deal Shrimp’s favourite features is that it remembers the last time you visited the site and then only show you deals found or updated since then. Because of the huge range of values from_date could be, caching on it is somewhat pointless. I contemplated chunking from_date into, say, two hours increments, but it seemed like a lot of work and just not right. I also played with the idea of retrieving all deals from Memcached and then filtering in Python, but that just seemed evil.

The second issue with this caching strategy is revealed when examining the profiling data.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3882    0.106    0.000    0.159    0.000 site-packages/sqlalchemy/orm/state.p:232(__setstate__)

A large portion of the time is set in recreating the SQLAlchemy ORM objects after retrieving them from Memcached (__setstate__ is used when unpickling). The bottleneck here is the Python code itself, so I needed a solution for caching that both:

Let me retrieve up-to-date data from the database without caching on dumb values; and
Was quick to populate the data I needed into Python objects.

SQLAlchemy ORM without the O

Avoiding SQLAlchemy’s ORM isn’t what I wanted to do, as converting my queries to raw SQL would have been a lot of work, and from initial profiling I could see that the time taken to execute the query was relatively low (51ms). It turns out that SQLAlchemy ORM queries can execute via a with_entities method, which takes a list of columns to return, and causes the query to return tuples instead of objects. A couple of adjustments to my deals getter method gave me this:

def deals(self, user=None, from_date=None, limit=None, region=None, before_date=None, tags=None, ids_only=False)
    deal_query = session.query(Deal).filter(Deal.active == True)
    …
    # snip out some query filtering stuff
    …
    if ids_only:
        deal_query = deal_query.with_entities(Deal.id)
    deals = deal_query[:limit]

    return deals

My test script, returning only IDs, now executes in just 65ms! This solves the first issue: caching on arguments is not necessary when it’s fast enough to execute and retrieve real time results. However, after getting the IDs, the full deal records still needs to be retrieved from the database or cache. Something like this should do the trick:

    def deals_for_ids(self, id_list):
        keys_list = [self.cache_key_for_deal_id(_id[0]) for _id in id_list]
        cached_deals = cache_deals_region.get_multi(keys_list)  # multi_get cached deals

        missing_deals = []
        missing_deals_original_index = []

        for index, deal in enumerate(cached_deals):
            if deal == NO_VALUE:
                # build a list of deals that weren't cached and retrieve from DB
                missing_deals.append(id_list[index])
                missing_deals_original_index.append(index)

        if missing_deals:
            uncached_deals = [deal for self.session.query(Deal).filter(Deal.id.in_(missing_deals))]

            for index, deal in enumerate(uncached_deals):
                cache.set(self.cache_key_for_deal_id(deal.id), deal)
                original_index = missing_deals_original_index[index]
                cached_deals[original_index] = deal

        return cached_deals

Looks good right? Wrong!

Go native

The above code will still have to unserialize a number of SQLAlchemy objects after retrieving the binary data using multi_get. I didn’t profile the above code, because, to be honest, foreseeing the unserializing issue, I didn’t actually write it. What I chose to do instead was create a dictionary representation of the Deal before storing it in the cache, and then retrieve that instead. Being a native Python type, it is quick to unserialize.

The final retrieval methods look like this:

    def deals_for_ids(self, id_list):
        keys_list = [self.cache_key_for_deal_id(_id[0]) for _id in id_list]
        cached_deals = cache_deals_region.get_multi(keys_list)

        missing_deals = []
        missing_deals_original_index = []

        for index, deal in enumerate(cached_deals):
            if deal == NO_VALUE:
                missing_deals.append(id_list[index])
                missing_deals_original_index.append(index)

        if missing_deals:
            uncached_deals = self.query_deals_for_ids(missing_deals)

            for index, deal in enumerate(uncached_deals):
                original_index = missing_deals_original_index[index]
                cached_deals[original_index] = deal

        return cached_deals


    def query_deals_for_ids(self, id_list):
        deals = [deal.public_dict for deal in self.session.query(Deal).filter(Deal.id.in_(id_list))]
        cache_deals_region.set_multi({DealManager.cache_key_for_deal_id(deal['id']): deal for deal in deals})

        return deals

And they would be used something like this:

deal_ids = deal_manager.deals(user=some_user, from_date=user.last_visit, ids_only=True)
deals = deal_manager.deals_for_ids(deal_ids)

Results

When the Deal dictionaries aren’t cached, my test script takes 632ms to execute - about the same amount of time as my original implementation. With caching turned on, retrieving the list of deals takes just 101ms - approximately a 6x speedup over the original execution time.

This performance increase is reflected in the ab results:

$ ab -n 10 -c 1 http://127.0.0.1/
…
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   222  232  17.4    227     280
Waiting:      216  227  17.6    223     276
Total:        222  232  17.4    227     280
…

With a median retrieval download time of 227ms, the page is now being delivered about 4 times faster than it was initially. Result!

Without caching, it takes about 2.6 seconds to build the page, so the first visitor to the site after a deal update might have some slow down. Luckily, this can be avoided due to Deal Shrimp being basically write only: the Deal Reader knows which deals have been added, removed or updated, and can flush and update the cache accordingly (even going as far as pre-rendering each deal’s HTML block which, is also cached), meaning that we can get near a 100% cache hit rate.

Conclusions

Profile first. Make sure you’re making enhancements in the right place.
Your SQL database is fast. It can take a long time for SQLAlchemy to build instances.
Pickling can take a long time, storing native Python types speeds things up.
Having SQLAlchemy return just IDs and pulling the data from Memcached is a good way of combining up-to-date results with fast retrieval.

Faster SQLAlchemy with Memcached