Dealing with Large Data Sets

Large data sets are increasingly becoming a reality of front-ends, GUIs and web apps alike. Users are getting more comfortable with tons of data, and are generating a lot of it too. Large data sets are as a much a reality as multi-threading—they’re both incredibly annoying to deal with, and they’re both not ignorable with today’s computers (and users).

Strategies

There are four basic strategies to dealing with large data sets: Ignoring the problem, Filtering, Paging, and Data Virtualization.

Ignoring the problem

Don’t do this unless your users never deal with more than ten things at a time. (Incidentally, I have never met anyone with an e-mail account with less than ten e-mails, or anyone with an iPod with ten songs—have you?)

Instant F. Why’d you bother writing the app in the first place? Your users are furious because they’re either spending large quantities of time staring at progress bars, or worse, a frozen screen.

Filtering

Filtering sometimes avoids the need for data virtualization, but this is usually nothing more than a band-aid for a broken arm—watch as your first user types in a query that somehow manages to return 90% of your data set and bring your app to its knees. In order for filtering to be effective, your user must be able and willing to provide enough criteria to narrow down the data set into something that’s not a “large data set”. From the user’s point of view, there is a very high up-front cost in using the UI—more often than not, “filtering” means twenty text boxes/dropdowns/checkboxes, all using names for fields that no one really understands. (“Hey, Bob—what do you suppose the “Type” dropdown is for? It’s got ‘PCX’, ‘AVW’, and ‘BOO’ in it.”) Those cryptic fields are not self-describing because the only thing that could help—the data itself—is locked behind a dizzying array of options.

This’ll get you anything from a C+ to a F, depending on the data set. Everything will be fine until That User types in “T” for a search query. (By the way, did you know “T” is the ticker symbol for AT&T?)

Paging

Implementing paging shows that you care about your users’ time. The GUI doesn’t feel overloaded or clunky and it’s a familiar concept to the user: everyone knows what “Page 5 of 752 (81–100, 15,035 items total)” means. It’s not a coincidence that most (all?) webmail apps show things in pages of 20 or 50 or so e-mails; the browser would choke over trying to render a <table> with thousands of rows and would provide for a very nasty user experience.

It’s not perfect, though:

  • Changing data. If you’re dealing with a data set where rows are frequently added/removed while the user is looking at it, paging is becomes a mediocre solution. You can then either keep the page boundaries stable (which requires at least the server to remember the full data set at the time of initial query) or ignore the problem (which would cause rows to randomly disappear or appear twice as the user Next Pages through your data set).
  • Slower user experience. Most operations are a request-reply back to the server—sorting, grouping, filtering, etc. Most clicks make the user wait.
  • No table scanning. There is something to be said for quickly scrolling through thousands of items, scanning for something when a search isn’t giving me back what I want. (Was it “Bob’s Restaurant”? Oh! I see it there on Row 452; I was actually looking for “Bill’s Burgers and Fries”.)
  • n-dimeinsional Data. Paging is very linear. If you have more than one dimension of data, paging is not very useful. I’m referring to every data set that looks better as a pretty graph.

This can get you as high as an A– (it definitely works for all the search engines) down to a C– (can you imagine what working with iTunes would be like if they made you “Click here for the next 20 results”?).

Data Virtualization

Make the user think you loaded everything. When they click the sort headers, they don’t know that you’re actually sucking out objects back from disk. When they’re whipping through the scrollbar, they have no idea that you’re grabbing loading the next 100 rows into memory.

The only downside? It’s obnoxiously difficult on most development platforms to get right.

What is virtualization?

Any developer who has worked with a grid control knows how important virtualization is for performance. The basic idea behind virtualization is very simple: if the user can’t see it, then the computer doesn’t need it. You can present the user with a giant scrollable expanse of cells; the grid will handle creating/dropping/recycling grid cells as necessary. Any grid control worth anything supports virtualization; even the basic, built-in ListBox and ListView in WPF (and Silverlight 3.0) support row-based virtualization. But in most simple binding situations, you generally need a collection of all of the data that you want to render for your grid loaded in memory.

For applications that don’t delve into large collections of data, there isn’t much point. The added complexity isn’t worth it. But that being said, a lot of commonly-used types of applications benefit from data virtualization:

  • Music library applications—think iTunes with tens of thousands of files of music, movies, etc.
  • Desktop mail applications—the inbox with thousands of e-mails
  • Google Maps—possibly the only web app that I can think of that shows off what data virtualization can truly look like

It would be really neat if Google could take the same approach with Mail as they do in Maps—a virtual table that doesn’t actually download all the mail, but instead lets you pan through the inbox using a fake scrollbar as if everything was already downloaded, but they don’t because it’s a pain in the neck and paging usually works just as well. (Or maybe they just haven’t thought of it yet. If you see that feature within the next few months or years, you can thank me for giving them the idea.)

A+, if you pull off the illusion flawlessly.

How do I implement it?

Most grid controls allow you to hook into their virtualization so that you could conceivably provide your own data virtualization to go along with the control virtualization native to the control, but there isn’t much out of the box in .NET to help you with the data side. There is no built-in “virtual” ICollection<T>, although there is nothing to stop someone from implementing one. Indeed, Beatriz Costa wrote about some data virtualization techniques that can be used in WPF and Silverlight, but they aren’t perfect.

Although .NET is a great platform for quickly building apps, Cocoa is a great platform for quickly building apps that can handle lots and lots of data. Cocoa on the Mac (and, as of iPhone OS 3.0, Cocoa Touch) supports data virtualization out-of-the-box using Core Data. Core Data provides a mechanism for defining a data model, and then it takes care of persistence of objects for you. On Cocoa Touch, NSFetchedResultsController acts as an intermediary between Core Data and your controls, essentially handling loading and unloading of data as the UI requires it. Given the memory constraints on the iPhone, Apple really had to provide a solid solution for data virtualization—keeping a list of even a few hundred moderately complicated objects could cause your app to run out of memory and crash.

Core Data uses SQLite under the hood*, and it is a very good solution for rolling your own data virtualization scheme. SQLite is an embedded database engine. There are no servers to install, no processes to run—it’s just a library that allows SQL access to a file on the file system. It has the semantics of a database and the semantics of a file, depending on which one is more convenient. System.Data.SQLite is an excellent ADO.NET provider that you can use in .NET (unfortunately, because it uses native code, it’s off-limits in Silverlight):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System.Data.SQLite;
 
public static class SQLiteTest
{
    public static void Main()
    {
        // colors.db is just a file that will be created in the
        // current directory if it doesn't exist
        using (var conn = new SQLiteConnection("Data Source=colors.db"))
        {
            conn.Open();
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "CREATE TABLE Colors(id INTEGER PRIMARY KEY, name TEXT)";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Red\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Green\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Blue\")";
                cmd.ExecuteNonQuery();
            }
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "SELECT name FROM Colors";
                using (IDataReader reader = cmd.ExecuteReader())
                {
                    System.Console.WriteLine(reader.Read(0));
                }
            }
        }
    }
}

Because the database is local, queries are fast from start to finish—no network latency issues here. And because it’s just a file, you can throw it away when you’re done. SQLite makes an excellent backbone to any data virtualization scheme because of its hybrid semantics (simplified random access like a database, convenience like a flat file).

*I’m purposely leaving out Core Data’s binary and XML serialization because they require the entire object graph to be loaded into memory, and if you’re going to do that, then what’s the point?

But large data sets aren’t that important…

Every mature application that I have ever been a part of has had to tackle the issue of larger data sets at some point. In my personal experience, thousands of entities (hundreds if you’re talking about images) is enough to be considered a “large data set” in that your application begins to visibly suffer.

I’d love to recommend data virtualization as the route to go for handling large data sets in .NET, but there really isn’t enough in the way of frameworks that provides an out-of-the-box solution. Implement paging unless you really need data virtualization. If you’re working on iPhone apps, drop everything and learn Core Data if you haven’t already. You’d be surprised at how much you don’t need to worry about. But regardless of what platform you’re developing a front-end on, you should always at least ask yourself how your GUI might respond to thousands of x. because one day it’s going to have to. —DKT