Paging with the Bucket Pattern - Part 1

Justin LaBreck

#data model

Have you ever noticed that moving through pages slows down as you view higher and higher numbered pages? Application designers work with pages frequently yet this problem persists. So what's the solution? Use a flexible, easy-to-use data model. MongoDB provides powerful ways of modeling data that make paging fast and efficient. Today, we’re going to explore paging through large amounts of data quickly and easily.

First, we need to understand the problem. Any time a full data set doesn't fit on a screen, paging is necessary. Most developers limit a display to 20, 50, or 100 items before requiring a "next" button. The most common implementation of paging uses sort, skip, and limit at the database level. But "skip and limit" has a problem.

Why do page loads slow down as the page number grows? It's a common problem caused by the way skip and limit works. Imagine a web page of a customer's stock trades that displays 1000 of the most recent trades per page. Querying a history collection to generate the trades list works like this:

db.history.find({ customerId: 7000000 })
            .sort({ date: 1 })
            .skip(0)
            .limit(1000)

The database quickly uses the index { customerId: 1, date: 1 } to find 1000 documents and returns 1000 documents. Everything is sorted by date. Simple enough.

The next page works similarly, except instead of skipping 0, we skip 1000. The database happily finds 2000 documents and returns 1000. Wait ... the database finds 2000? Yes, it finds 2000 documents to return 1000. That's how skip and limit works. Imagine viewing page 5000. That's skip 5,000,000 and limit 1000. The database must find 5,001,000 documents to return 1000 documents. No wonder it takes so long! There is a better way.

Skipping documents is time consuming so conversely, not skipping documents is not time consuming. Remember that first page we loaded? We retrieved 1000 results and prepared them for display. We had to iterate through 1000 documents and each one has a date. Conveniently, we also sorted by date. By holding on to the last date we displayed (via session variable or query string, for example), we can modify the query such that no skipping is required.

db.history.find({ 
    customerId: 7000000,
    date: { $gt: ISODate("2018-07-21T12:01:35") 
    })
	.sort({ date: 1 })
	.limit(1000)

This second query involves no skipping and efficiently uses our index. That's an improvement! But we still have a problem.

Viewing page 5,000 is significantly faster using this method but we have no way to jump directly from page 1 to page 5000. Why? This method modifies the query itself to find results quickly. It requires tracking the last result from the previous page to modify the query. Displaying the documents on page 5000 requires loading the last document from page 4999, which requires loading the last document from page 4998, which requires loading the last document from page 4997, and so on.

The whole point of using another method is to load values without first loading everything before it. This solution requires keeping track of the last document viewed to find the next set of documents. It only works if we don't provide the user with an option for jumping to specific pages.

There is a better way. We can use the bucket pattern.

First, a quick recap of the bucket pattern. Bucketing is most useful whenever a list of similar things all relate to a central entity. Capturing data points over time falls into this category. And importantly, most data sets that require pagination can use this pattern.

Our previous example worked with a collection that looks like this:

{
    "ticker" : "MDB", 
    "customerId": 7000000,
    "type" : "buy", 
    "qty" : 419, 
    "date" : ISODate("2018-10-26T15:47:03.434Z") 
},
{ 
    "ticker" : "MDB",
    "customerId": 7000000, 
    "type" : "sell", 
    "qty" : 29, 
    "date" : ISODate("2018-10-30T09:32:57.765Z") 
 }

And here is that same data set using the bucket pattern:

{
  "_id": "7000000_1540568823",
  "customerId": 7000000,
  "count": 2,
  "history": [
    {
      "type": "buy",
      "ticker": "MDB",
      "qty": 419,
      "date": ISODate("2018-10-26T15:47:03.434Z")
    },
    {
      "type": "sell",
      "ticker": "MDB",
      "qty": 29,
      "date": ISODate("2018-10-30T09:32:57.765Z")
    }
  ]
}

Using the bucket pattern, two trade documents condense into a single document using an array of trades. The array contains two objects because the original design had two documents. Fields that duplicate across the original two documents condense into the root of our single document (i.e. customerId). Other, unique fields appear as part of the history array.

There are lots of benefits to this schema design pattern but let's focus on the benefits to paging. Also note the addition of a count field. It represents how many trades appear in the history array. The count field becomes important later.

How does all this relate to paging? The most efficient way to gather information for display is to store information as it's needed for display. MongoDB excels at this by allowing you to store data exactly as needed in rich and complex ways. For paging, data is needed in buckets of 20, 50, 100, etc. and the bucket pattern allows us to represent each page as a single document.

Let's look at this same concept another way. When rendering a page using the old "find with skip and limit" method, each page load iterates through multiple documents. Displaying 20 trades per page requires iterating 20 times over a cursor, retrieving 20 documents from the server. With the bucket approach to paging, each page load requires only a single document to generate the entire page!

Now let's look deeply at storing the information for display.

Notice the value stored in _id. In our example, _id is a compound value. It's a string that concatenates the customerId and the first trade time in seconds (since the epoch). There is a reason for this.

Our web page is designed to display the stock trade history made by a single customer. Creating a compound _id starting with customerId effectively "groups" all the objects in the history array field by customerId. Using a regular expression, we can quickly find our first complete set of results:

db.history.find({ "_id": /^7000000_/ })
	.sort({ _id: 1 })
	.limit(1)

A single document will be returned. The document contains a history array with multiple stock trades ready to display!

Now imagine there are more than two trades. Let's look at an example with 1000 trades. How does this pattern work?

Going back to the idea that data should be stored as it is needed for display, each bucket should have enough trades to render a complete page. If the display is designed to show 20 trades per page, then store fifty documents containing 20 trades per bucket (1000 trades / 20 = 50 documents).

To display page one, fetch the first bucket from the server. To display page two, skip the first bucket using .skip(1) and fetch the second bucket from the server. To display page three, fetch the third bucket from the server. To jump to page forty, fetch the fortieth bucket from the server. It's that easy!

Done, right? Not exactly.

In the next part, we'll examine how to create and optimize this pattern for even more efficient and powerful pagination.