Back when we all used SQL databases, it was common to paginate through large result sets by appending LIMIT offset, rows per page to a SELECT query. Depending on the schema, data volume and database engine, this was inefficient to varying degrees. On smaller result sets and with the right indexes, it was… posssibly OK.

Databases like DynamoDB prevent this inefficiency by handling pagination differently. You can page through a pre-sorted table by selecting a partition and optionally a range within the sort key. After DynamoDB has returned a page of results and there are more to follow, it provides you with LastEvaluatedKey which you can pass to the next iteration of the query as ExclusiveStartKey in order to get the next page.

Random access to page n is not possible unless you know the keys that coincide with a page break. You’ve got to fetch the pages in order. Users cannot jump to page 292 or only show the last page. This is entirely acceptable in many cases as modern user interfaces often provide infinite scrolling or a show next 20 link.

As the ExclusiveStartKey is embedded into a URL, this means that this next link is not stable. If the table changes, the link to retrieve page 2 changes. If the results are scanned in reverse (for instance, most recent first) the next link will change every time a new record is added. This generates a lot of URLs, which, although I am not an expert, can hinder SEO efforts on public sites where effective indexing is important.

Some may also say that ?page=2 looks nicer than an encoded exclusive start key (which itself may consist of several key-value pairs, possibly as a base64-encoded structure), but this might just be vanity.

What if, for whatever reason, we wanted to bring back those old-school, things-were-better-back-in-the-old-days page numbers?

Pattern: map exclusive start keys to a numeric index

An exclusive start key is just a map containing the keys needed to resume the query and grab the next n items. It’s a reference point.

Rather than conveying it as part of the URL, we could instead store the key components needed to generate an exclusive start key from a numeric index. A page or skip query parameter would be included in the URL. A look up for item 20 will yield the keys needed to construct an exclusive start key to skip results by arbitrary intervals.

The drawback of this approach is the added complexity around building, maintaining, storing and serving this numeric index of rows. To avoid a confusing user experience, it is vital that the system of record and numeric index are kept consistent. Allowing users to filter the results will invalidate any pre-calculated page numbers, so additional indexes will need to be maintained. Only low cardinality, coarse filters are likely to be feasible in order to minimize the number of page indexes that need to be built.

The next section details some theoretical (read: half-baked and unproven) approaches. Note that there may be WTFs, misunderstandings or things I have not considered. Consider them to be non-production ready musings.

Approaches

Our example scenario is an e-commerce site that has product pages where users can comment on a product.

The following approaches assume comments DynamoDB table has the string keys PK, SK, PK2, where PK is a randomly generated comment ID, SK is a datetime of when the comment was posted, and PK2 is a grouping key, the SKU of a product the comment relates to. A global secondary index on PK2, SK is used by the application to show sets of comments, in reverse order.

PK SK PK2 (GSI PK) other attributes
abc281 2021-08-25 13:00 DAFT_PUNK_TSHIRT
abd212 2021-08-25 13:30 DAFT_PUNK_TSHIRT
abc912 2021-08-25 13:42 DAFT_PUNK_TSHIRT
ccc232 2021-08-25 13:55 DAFT_PUNK_TSHIRT

Redis sorted sets

The first (and possibly simplest) approach is to load the partition and sort keys into Redis sorted sets. This could use a managed service like AWS Elasticache for Redis.

A sorted set provides numeric index-based access to the keys (referred to as the rank of a set member), which can then be used to construct an ExclusiveStartKey to pass to DynamoDB. As the name of the type implies, Redis maintains the ordering (by score, aka creation date) on our behalf.

Assuming the table outlined above, we use PK2 as the Redis key for a sorted set, PK as the member and SK (converted to UNIX time) as the score.

For example ZADD <PK2> to_unixtime(<SK>) <PK>, would be sent to Redis through a Lambda function connected to a DynamoDB Stream off the table (it’d also need to send ZREM/ZADD to handle any deletions and changes.)

To get the exclusive start key for any page, the Redis command ZREVRANGE <PK2> <start> <end> WITHSCORES where both start and end is the index of the item to retrieve the keys of, is sent to Redis. This will yield a list response where 0 is <PK> and 1 is <SK>. SK should be converted back to a date time string from UNIX time. This is all that is needed to construct an ExclusiveStartKey.

It is possible to get the total cardinality for grouping key with ZCARD <PK2> which is needed to calculating the total number of pages.

Storing a large number of sorted sets with millions of members could get expensive due to how a sorted set is implemented by Redis: a map and a skip list. It is also quite annoying to have to pay for a lot of RAM for items that won’t be frequently accessed.

However this may be a reasonable trade off as it is a very simple solution that is likely to have predictable, consistent high performance.

Relational

The sorted sets approach could also be achieved with a relational database. This could use a managed service like AWS RDS in its various flavours.

The sorted sets would live in a single table with a convering index on PK2 ASC, SK DESC. Instead of a ZREVRANGE Redis command, SELECT PK, SK FROM pages WHERE PK2=? ORDER BY SK DESC LIMIT n, 1 would be used. Despite using LIMIT, performance is expected to be reasonable due to the small row size. Instead of ZCARD a SELECT COUNT(*) FROM pages WHERE PK2=? query would be used, but it would be worth understanding the performance characteristics, despite an index being present.

A similar Lambda function would keep this table in-sync with the DynamoDB table.

Files on disk, EBS, EFS or even S3

If you don’t want to run Redis or a relational database, the bang for buck option is good ol' files. There are several options here ranging from very fast instance-connected SSDs to EBS, EFS and object store services like S3.

This approach works by defining a fixed size C-style structure and writing the corresponding bytes to a file. Random access is made possible by calculating the offset within the file based on the consistent size of a structure. You can then seek to the relevant offset (calculated from the index * SIZE) and read that number of bytes. Alternatively you can read the most recent by seeking to the end and reading backwards with a negative offset.

With this pattern, the grouping key PK2 is used to name the file. If a lot of keys are expected, a small optimisation would be to shard the keys into a fixed number of subdirectories. As with the prior approaches, a Lambda function that consumes a DynamoDB (or Kinesis) stream would write to these files.

Assuming a 24-character value for PK, and SK converted to an integer unix time, the code to read index would be something along these lines.

STRUCT_DEF = "24s i"
SIZE = struct.calcsize(STRUCT_DEF) # 28 bytes (24+4)

PK2 = "DAFT_PUNK_TSHIRT"

with open(f"{PK2}.pag", "rb") as file:
    file.seek(SIZE * index) # zero indexed
    values = struct.unpack(STRUCT_DEF, input.read(SIZE))
    print(f"PK: {values[0]}, SK: {values[1]}, PK2: {PK2}")

This is likely to perform well with low latency on EC2 with instance storage or EBS. Non-scientific tests showed it worked far better than expected on an EFS mount in a Lambda function. If some additional latency can be tolerated, the bargain basement solution is to read just the individual record out of the much larger index object stored on S3. This is achieved by passing a range header, similar to the file offset above.

page_index = boto3.resource('s3').Object('pagination-indexes', f"{PK2}.pag")
start = SIZE * index
end = start + SIZE - 1
res = page_index.get(Range=f"bytes={start}-{end}")
values = struct.unpack(STRUCT_DEF, res["Body"].read())
print(f"PK: {values[0]}, SK: {values[1]}, PK2: {PK2}")

A crude way to find the number of entries to divide the file size by the struct size. This is an inexpensive operation with some pitfalls that are beyond the scope of this write-up. An alternate approach would be to maintain a header struct, again of fixed size, at the head of the index file.

So, we’ve established that read path is simple and fast. The write path is more complex. The model of appending bytes to a file does not work if we want to maintain order and cannot say with 100% certainty that records won’t appear out of order. Perhaps strict ordering is not necessary, but it would be confusing to have a comment from 2018 appearing alongside one from 2021. Additionally, as the index increases in size, it will need to be rewritten in sorted order. If S3 is being used as a storage backend, this would mean a PutObject of several megabytes to add a single entry. The shape of your workload will dictate whether or not this is reasonable.

A simple remedy is to not directly write to the index file at all, instead sending the change commands to an append-only write ahead log. At a timed interval (or when the WAL reaches a certain size), a commit process could run and apply these changes into the ordered index file, discarding the WAL. This reduces the amount of work needed to perform a sort and rewrite on the entire index, particularly if networked storage or S3 is where the indexes are stored. Deletions are fast in Redis as the member value is also indexed, which adds to the in-memory storage footprint of a sorted set. This file based approach does not bother to do that, members marked for deletion are simply skipped when the file is rewritten. This makes the files smaller to transmit and rewrite.

The clear cost to this approach is that changes won’t immediately appear.

If a degree of latency is acceptable, this is not a bad trade off. A complimentary hack would be to not consult the pagination index at all when querying the first n pages, and simply limit in your client. For example, instead of setting the DynamoDB query Limit to 20, set it to 200 and take a slice of the returned items to deliver up to page 10. This will increase read costs but caters for newest always being visible, with the potentially acceptable risk of some comments being shown again on page 11.

The index files can be generated in any language. In Python, the struct module is one way to achieve this - likewise in go, the binary module and ordinary go structs work as you would expect (although check how much reflection is being used to decode the structs: avoiding this can make a staggering difference to performance.)

This portability provides interesting options for a backfill of indexes as an indepedent batch process, for instance with Apache Spark or Apache Beam. Data from an operational store or data warehouse could be used to cheaply generate the index files in parallel. Changes that are happening beyond what is stored within the batch source would fill the write-ahead logs. Once the batch operation is complete, the discussed commit process can be enabled to catch up the indexes.

There will probably be other edge cases and discussions to be had around locking. It’s important not to try and write your own database, but it would be a bad idea to allow consumers even read-only access to the files. As this is a fairly low level approach, an API or Lambda function should be considered as the interface to the files.

Despite the odd looks you will probably get for suggesting this as a general approach, I quite like it for its simplicity, low cost, portability and high performance.

An interesting hybrid of this and the previously discussed relational approach would be to use sqlite as an alternative to fopen. Instead of dropping structs into a file, a sqlite database file would be used, providing a stable file format and the beautiful, highly performant sqlite engine for ordering. The storage footprint, due to the covering index, is likely to be much larger.

Sorted sets on disk with a Redis interface

I took the file system appproach a bit further and built a RESP service that implemented a few commands including ZCARD, ZRANGE, ZADD and ZREM. This worked pretty well on a t4g.micro EC2 instance. The ZADD and ZREM commands stage the changes to a separate file, as discussed above. A custom command, ZCOMMIT can be issued for a sorted set key. This will apply any pending changes to the index. When committing a few changes to an index file containing 2.3m items, the rewrite took about 1.3 seconds. Too expensive to run on every change, but if rewrites are only done occasionally, this could be acceptable in some scenarios. Each sorted set has its own index file, so a large number of indexes could be rebuilt in parallel with no other state or coordination required.

The redis-benchmark tool showed about 60000 rps (without pipelining) for the commands besides ZCOMMIT. This is obviously much slower than proper Redis, but acceptable.

RESP service implementation alongside the standard Redis client

There is nothing clever going on here. No log-structured merge trees or efficient sorting algorithms, just a frankenstein of the standard go library with some RESP help from the excellent tidwall/redcon.

If you don’t mind your sorted sets being occasionally out of date, you can host a lot of keys on disk using a very cheap EC2 instance. The set with 2.3m members occupied less than 65MB of disk space. This will cost far, far less than keeping them all in RAM.

As ever, it depends on whether your workload can tolerate these delayed commits and what the risk profile is with regard to potential data loss. There are other downsides - you will need to pick a file system that can efficiently store several small files. Use of a sqlite container file (or small number of them, representing a shard) may actually work out better. Perhaps something like RocksDB could be used to store the structs. Then there are considerations about how to run multiple replicas which strays into the writing a sketchy database of our own territory. Nevertheless, I think it’s a very interesting area to continue some research in.

DynamoDB

Instead of adding another data store it is possible to stamp a page marker numeric attribute onto every nth item in a table with an ascending page number. The oldest record lives on page 1.

A sparsely populated GSI would use this attribute as its sort key (plus other keys) so that only page start items are included.

Page marker index

PK SK PK2 (GSI PK) page (GSI SK)
abc912 2021-08-25 13:42 DAFT_PUNK_TSHIRT 2
ccc232 2021-08-25 14:55 DAFT_PUNK_TSHIRT 3

A partition header item contains the current page number (ascending) and running count of items remaining for the current page.

Table

PK SK current_page remaining card
STATS DAFT_PUNK_TSHIRT 3 2 7

When an item is added, the above STATS item is consulted. If it flows onto the next page, a page marker attribute is added to it and values within the STATS item for that product are incremented/reset within a transaction.

To paginate in reverse sort order (for instance, latest items first), get the PK: STATS, SK: DAFT_PUNK_TSHIRT item to find the current page. Assuming there are 10 pages and page 3 is requested: (10+1)-3 = 8, leading us to key PK2: DAFT_PUNK_TSHIRT, page: 8 on the page marker index. This item is retrieved to form an exclusive start key to be used in a query.

Handling changes other than serial appends economically is the big challenge here. If an item needs to be removed, subsequent page markers need to be updated. Depending on table size, this could result in a large number of operations and therefore increase read/write costs.

You may wonder why the oldest comments live on page 1 and the newest live on the highest page number. This is because our access pattern states that we must show the most recent comments on the first page, so would be continually updating page markers if a comment was being added to what the index considers to be page 1. In other words, we are optimising for changes being more likely in newer items.

This approach may only be appropriate for slow moving data, when changes are very rare or cannot happen, append only data, or when the table is materialised from scratch from another source.

Conclusion

When something seemingly simple appears convoluted to achieve with your current technology choices, you’ve got to consider whether it is a good return on investment and wise to even try to make it work.

Perhaps this is a solved problem in some database you don’t use but maybe should do. Maybe you just need to use whatever you are already using correctly. In this age of polyglot persistence, Kafka and so on, data has become liberated and can be streamed into multiple stores, each filling a particular niche. However, this is still operational overhead.

Before making the leap, consider whether the approaches discussed in this post are a case of YAGNI. In your context, is it really the best user experience to present users with page 1 of 392716? Must they be able to randomly jump to page 392701? Could your user interface slim down the result set more intuitively, so that using your application is less database-y? For example, infinite scrolling (think Twitter) is simpler for the user and seems more native these days. Guys, we’re doing pagination wrong is a great post that delves into the details further.

We don’t live in a one-size-fits-all world and sometimes creative solutions cannot be avoided. Workloads have varying levels of tolerance to eventual consistency and degrees of acceptable correctness. I’d be interested to hear thoughts on these approaches and if you’ve solved this problem in similar or entirely different way.

Discuss on Twitter