Dealing with Large Data Sets

Large data sets are increasingly becoming a reality of front-ends, GUIs and web apps alike. Users are getting more comfortable with tons of data, and are generating a lot of it too. Large data sets are as a much a reality as multi-threading—they’re both incredibly annoying to deal with, and they’re both not ignorable with today’s computers (and users).

Strategies

There are four basic strategies to dealing with large data sets: Ignoring the problem, Filtering, Paging, and Data Virtualization.

Ignoring the problem

Don’t do this unless your users never deal with more than ten things at a time. (Incidentally, I have never met anyone with an e-mail account with less than ten e-mails, or anyone with an iPod with ten songs—have you?)

Instant F. Why’d you bother writing the app in the first place? Your users are furious because they’re either spending large quantities of time staring at progress bars, or worse, a frozen screen.

Filtering

Filtering sometimes avoids the need for data virtualization, but this is usually nothing more than a band-aid for a broken arm—watch as your first user types in a query that somehow manages to return 90% of your data set and bring your app to its knees. In order for filtering to be effective, your user must be able and willing to provide enough criteria to narrow down the data set into something that’s not a “large data set”. From the user’s point of view, there is a very high up-front cost in using the UI—more often than not, “filtering” means twenty text boxes/dropdowns/checkboxes, all using names for fields that no one really understands. (“Hey, Bob—what do you suppose the “Type” dropdown is for? It’s got ‘PCX’, ‘AVW’, and ‘BOO’ in it.”) Those cryptic fields are not self-describing because the only thing that could help—the data itself—is locked behind a dizzying array of options.

This’ll get you anything from a C+ to a F, depending on the data set. Everything will be fine until That User types in “T” for a search query. (By the way, did you know “T” is the ticker symbol for AT&T?)

Paging

Implementing paging shows that you care about your users’ time. The GUI doesn’t feel overloaded or clunky and it’s a familiar concept to the user: everyone knows what “Page 5 of 752 (81–100, 15,035 items total)” means. It’s not a coincidence that most (all?) webmail apps show things in pages of 20 or 50 or so e-mails; the browser would choke over trying to render a <table> with thousands of rows and would provide for a very nasty user experience.

It’s not perfect, though:

  • Changing data. If you’re dealing with a data set where rows are frequently added/removed while the user is looking at it, paging is becomes a mediocre solution. You can then either keep the page boundaries stable (which requires at least the server to remember the full data set at the time of initial query) or ignore the problem (which would cause rows to randomly disappear or appear twice as the user Next Pages through your data set).
  • Slower user experience. Most operations are a request-reply back to the server—sorting, grouping, filtering, etc. Most clicks make the user wait.
  • No table scanning. There is something to be said for quickly scrolling through thousands of items, scanning for something when a search isn’t giving me back what I want. (Was it “Bob’s Restaurant”? Oh! I see it there on Row 452; I was actually looking for “Bill’s Burgers and Fries”.)
  • n-dimeinsional Data. Paging is very linear. If you have more than one dimension of data, paging is not very useful. I’m referring to every data set that looks better as a pretty graph.

This can get you as high as an A– (it definitely works for all the search engines) down to a C– (can you imagine what working with iTunes would be like if they made you “Click here for the next 20 results”?).

Data Virtualization

Make the user think you loaded everything. When they click the sort headers, they don’t know that you’re actually sucking out objects back from disk. When they’re whipping through the scrollbar, they have no idea that you’re grabbing loading the next 100 rows into memory.

The only downside? It’s obnoxiously difficult on most development platforms to get right.

What is virtualization?

Any developer who has worked with a grid control knows how important virtualization is for performance. The basic idea behind virtualization is very simple: if the user can’t see it, then the computer doesn’t need it. You can present the user with a giant scrollable expanse of cells; the grid will handle creating/dropping/recycling grid cells as necessary. Any grid control worth anything supports virtualization; even the basic, built-in ListBox and ListView in WPF (and Silverlight 3.0) support row-based virtualization. But in most simple binding situations, you generally need a collection of all of the data that you want to render for your grid loaded in memory.

For applications that don’t delve into large collections of data, there isn’t much point. The added complexity isn’t worth it. But that being said, a lot of commonly-used types of applications benefit from data virtualization:

  • Music library applications—think iTunes with tens of thousands of files of music, movies, etc.
  • Desktop mail applications—the inbox with thousands of e-mails
  • Google Maps—possibly the only web app that I can think of that shows off what data virtualization can truly look like

It would be really neat if Google could take the same approach with Mail as they do in Maps—a virtual table that doesn’t actually download all the mail, but instead lets you pan through the inbox using a fake scrollbar as if everything was already downloaded, but they don’t because it’s a pain in the neck and paging usually works just as well. (Or maybe they just haven’t thought of it yet. If you see that feature within the next few months or years, you can thank me for giving them the idea.)

A+, if you pull off the illusion flawlessly.

How do I implement it?

Most grid controls allow you to hook into their virtualization so that you could conceivably provide your own data virtualization to go along with the control virtualization native to the control, but there isn’t much out of the box in .NET to help you with the data side. There is no built-in “virtual” ICollection<T>, although there is nothing to stop someone from implementing one. Indeed, Beatriz Costa wrote about some data virtualization techniques that can be used in WPF and Silverlight, but they aren’t perfect.

Although .NET is a great platform for quickly building apps, Cocoa is a great platform for quickly building apps that can handle lots and lots of data. Cocoa on the Mac (and, as of iPhone OS 3.0, Cocoa Touch) supports data virtualization out-of-the-box using Core Data. Core Data provides a mechanism for defining a data model, and then it takes care of persistence of objects for you. On Cocoa Touch, NSFetchedResultsController acts as an intermediary between Core Data and your controls, essentially handling loading and unloading of data as the UI requires it. Given the memory constraints on the iPhone, Apple really had to provide a solid solution for data virtualization—keeping a list of even a few hundred moderately complicated objects could cause your app to run out of memory and crash.

Core Data uses SQLite under the hood*, and it is a very good solution for rolling your own data virtualization scheme. SQLite is an embedded database engine. There are no servers to install, no processes to run—it’s just a library that allows SQL access to a file on the file system. It has the semantics of a database and the semantics of a file, depending on which one is more convenient. System.Data.SQLite is an excellent ADO.NET provider that you can use in .NET (unfortunately, because it uses native code, it’s off-limits in Silverlight):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System.Data.SQLite;
 
public static class SQLiteTest
{
    public static void Main()
    {
        // colors.db is just a file that will be created in the
        // current directory if it doesn't exist
        using (var conn = new SQLiteConnection("Data Source=colors.db"))
        {
            conn.Open();
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "CREATE TABLE Colors(id INTEGER PRIMARY KEY, name TEXT)";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Red\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Green\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Blue\")";
                cmd.ExecuteNonQuery();
            }
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "SELECT name FROM Colors";
                using (IDataReader reader = cmd.ExecuteReader())
                {
                    System.Console.WriteLine(reader.Read(0));
                }
            }
        }
    }
}

Because the database is local, queries are fast from start to finish—no network latency issues here. And because it’s just a file, you can throw it away when you’re done. SQLite makes an excellent backbone to any data virtualization scheme because of its hybrid semantics (simplified random access like a database, convenience like a flat file).

*I’m purposely leaving out Core Data’s binary and XML serialization because they require the entire object graph to be loaded into memory, and if you’re going to do that, then what’s the point?

But large data sets aren’t that important…

Every mature application that I have ever been a part of has had to tackle the issue of larger data sets at some point. In my personal experience, thousands of entities (hundreds if you’re talking about images) is enough to be considered a “large data set” in that your application begins to visibly suffer.

I’d love to recommend data virtualization as the route to go for handling large data sets in .NET, but there really isn’t enough in the way of frameworks that provides an out-of-the-box solution. Implement paging unless you really need data virtualization. If you’re working on iPhone apps, drop everything and learn Core Data if you haven’t already. You’d be surprised at how much you don’t need to worry about. But regardless of what platform you’re developing a front-end on, you should always at least ask yourself how your GUI might respond to thousands of x. because one day it’s going to have to. —DKT

5 comments

  1. Nice article, Davin.

    At a previous job, I wound up rewriting a severe example of your ‘ignoring’ category into your ‘virtualization’ category (using SQLite, incidentally).

    The folks before me had written this very email-like application that needed to be very secure in its network use and disk storage. So they stored all data as encrypted blobs in a generic ‘blob’ table! So for example, if you wanted to see all messages in a particular folder, you had to scan through that whole table, decrypting everything as you went to determine whether or not it belonged in the folder you’d clicked on. I loaded it up with a million messages, and then showed my manager that it took an hour to display an empty folder. 🙂

    With a slight change to the encryption scheme, I managed to patch SQLite’s low-level disk driver to handle the encryption/decryption itself. After that (with a real data/index schema defined) I wrote a lazy-loading listview control to only query the visible range of records (or requery the visible range if a trigger indicated that visible data may have changed). So I got to say that I improved the performance of that typical use case from 1 hour to less than 1 second.

    There are some programming language lessons in there too, I think. Lazy-evaluation in principle can make that kind of incremental I/O easy, although that’s also affected largely by data structures. FRP is another angle (though I haven’t worked on that as much).

  2. Thanks! I don’t know that there is a faster way to learn about a lot of icky details about programming than to try to implement a working data virtualization scheme. It’s one of those problems that is so tempting to ignore because it’s both difficult to solve and I think, historically, widely accepted by users. (“Well, that’s just the way it’s gotta be, I guess.”) But as people see more programs out there that are capable of handling large data, perceptions of “acceptably slow” are likely to change.

    I had no idea you could play those kinds of tricks with SQLite, but I guess I shouldn’t be surprised; I’ve always found it to be a very flexible library that has sometimes bent, but never broken over what I’ve needed it for. —DKT

  3. I would think that a natural progression of the System.Data.DataSet class would be to allow it to behave more like a file-based data repository. I’ve been thinking about static extensions for this purpose.
    Cheers

  4. It would be very possible to build a paging system with extensions and DataSets…it might be much less possible, unfortunately, to completely fake out a DataSet so that it looks like it’s holding more data in memory than it really is. The DataSet was really meant to be more of an in-memory cache of data as opposed to providing a view over a much larger data set. It’d be interesting to see what could be done, though… —DKT

Leave a Reply

Your email address will not be published. Required fields are marked *