Dealing with Large Data Sets

Large data sets are increasingly becoming a reality of front-ends, GUIs and web apps alike. Users are getting more comfortable with tons of data, and are generating a lot of it too. Large data sets are as a much a reality as multi-threading—they’re both incredibly annoying to deal with, and they’re both not ignorable with today’s computers (and users).

Strategies

There are four basic strategies to dealing with large data sets: Ignoring the problem, Filtering, Paging, and Data Virtualization.

Ignoring the problem

Don’t do this unless your users never deal with more than ten things at a time. (Incidentally, I have never met anyone with an e-mail account with less than ten e-mails, or anyone with an iPod with ten songs—have you?)

Instant F. Why’d you bother writing the app in the first place? Your users are furious because they’re either spending large quantities of time staring at progress bars, or worse, a frozen screen.

Filtering

Filtering sometimes avoids the need for data virtualization, but this is usually nothing more than a band-aid for a broken arm—watch as your first user types in a query that somehow manages to return 90% of your data set and bring your app to its knees. In order for filtering to be effective, your user must be able and willing to provide enough criteria to narrow down the data set into something that’s not a “large data set”. From the user’s point of view, there is a very high up-front cost in using the UI—more often than not, “filtering” means twenty text boxes/dropdowns/checkboxes, all using names for fields that no one really understands. (“Hey, Bob—what do you suppose the “Type” dropdown is for? It’s got ‘PCX’, ‘AVW’, and ‘BOO’ in it.”) Those cryptic fields are not self-describing because the only thing that could help—the data itself—is locked behind a dizzying array of options.

This’ll get you anything from a C+ to a F, depending on the data set. Everything will be fine until That User types in “T” for a search query. (By the way, did you know “T” is the ticker symbol for AT&T?)

Paging

Implementing paging shows that you care about your users’ time. The GUI doesn’t feel overloaded or clunky and it’s a familiar concept to the user: everyone knows what “Page 5 of 752 (81–100, 15,035 items total)” means. It’s not a coincidence that most (all?) webmail apps show things in pages of 20 or 50 or so e-mails; the browser would choke over trying to render a <table> with thousands of rows and would provide for a very nasty user experience.

It’s not perfect, though:

  • Changing data. If you’re dealing with a data set where rows are frequently added/removed while the user is looking at it, paging is becomes a mediocre solution. You can then either keep the page boundaries stable (which requires at least the server to remember the full data set at the time of initial query) or ignore the problem (which would cause rows to randomly disappear or appear twice as the user Next Pages through your data set).
  • Slower user experience. Most operations are a request-reply back to the server—sorting, grouping, filtering, etc. Most clicks make the user wait.
  • No table scanning. There is something to be said for quickly scrolling through thousands of items, scanning for something when a search isn’t giving me back what I want. (Was it “Bob’s Restaurant”? Oh! I see it there on Row 452; I was actually looking for “Bill’s Burgers and Fries”.)
  • n-dimeinsional Data. Paging is very linear. If you have more than one dimension of data, paging is not very useful. I’m referring to every data set that looks better as a pretty graph.

This can get you as high as an A– (it definitely works for all the search engines) down to a C– (can you imagine what working with iTunes would be like if they made you “Click here for the next 20 results”?).

Data Virtualization

Make the user think you loaded everything. When they click the sort headers, they don’t know that you’re actually sucking out objects back from disk. When they’re whipping through the scrollbar, they have no idea that you’re grabbing loading the next 100 rows into memory.

The only downside? It’s obnoxiously difficult on most development platforms to get right.

What is virtualization?

Any developer who has worked with a grid control knows how important virtualization is for performance. The basic idea behind virtualization is very simple: if the user can’t see it, then the computer doesn’t need it. You can present the user with a giant scrollable expanse of cells; the grid will handle creating/dropping/recycling grid cells as necessary. Any grid control worth anything supports virtualization; even the basic, built-in ListBox and ListView in WPF (and Silverlight 3.0) support row-based virtualization. But in most simple binding situations, you generally need a collection of all of the data that you want to render for your grid loaded in memory.

For applications that don’t delve into large collections of data, there isn’t much point. The added complexity isn’t worth it. But that being said, a lot of commonly-used types of applications benefit from data virtualization:

  • Music library applications—think iTunes with tens of thousands of files of music, movies, etc.
  • Desktop mail applications—the inbox with thousands of e-mails
  • Google Maps—possibly the only web app that I can think of that shows off what data virtualization can truly look like

It would be really neat if Google could take the same approach with Mail as they do in Maps—a virtual table that doesn’t actually download all the mail, but instead lets you pan through the inbox using a fake scrollbar as if everything was already downloaded, but they don’t because it’s a pain in the neck and paging usually works just as well. (Or maybe they just haven’t thought of it yet. If you see that feature within the next few months or years, you can thank me for giving them the idea.)

A+, if you pull off the illusion flawlessly.

How do I implement it?

Most grid controls allow you to hook into their virtualization so that you could conceivably provide your own data virtualization to go along with the control virtualization native to the control, but there isn’t much out of the box in .NET to help you with the data side. There is no built-in “virtual” ICollection<T>, although there is nothing to stop someone from implementing one. Indeed, Beatriz Costa wrote about some data virtualization techniques that can be used in WPF and Silverlight, but they aren’t perfect.

Although .NET is a great platform for quickly building apps, Cocoa is a great platform for quickly building apps that can handle lots and lots of data. Cocoa on the Mac (and, as of iPhone OS 3.0, Cocoa Touch) supports data virtualization out-of-the-box using Core Data. Core Data provides a mechanism for defining a data model, and then it takes care of persistence of objects for you. On Cocoa Touch, NSFetchedResultsController acts as an intermediary between Core Data and your controls, essentially handling loading and unloading of data as the UI requires it. Given the memory constraints on the iPhone, Apple really had to provide a solid solution for data virtualization—keeping a list of even a few hundred moderately complicated objects could cause your app to run out of memory and crash.

Core Data uses SQLite under the hood*, and it is a very good solution for rolling your own data virtualization scheme. SQLite is an embedded database engine. There are no servers to install, no processes to run—it’s just a library that allows SQL access to a file on the file system. It has the semantics of a database and the semantics of a file, depending on which one is more convenient. System.Data.SQLite is an excellent ADO.NET provider that you can use in .NET (unfortunately, because it uses native code, it’s off-limits in Silverlight):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System.Data.SQLite;
 
public static class SQLiteTest
{
    public static void Main()
    {
        // colors.db is just a file that will be created in the
        // current directory if it doesn't exist
        using (var conn = new SQLiteConnection("Data Source=colors.db"))
        {
            conn.Open();
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "CREATE TABLE Colors(id INTEGER PRIMARY KEY, name TEXT)";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Red\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Green\")";
                cmd.ExecuteNonQuery();
 
                cmd.CommandText = "INSERT INTO Colors(name) VALUES(\"Blue\")";
                cmd.ExecuteNonQuery();
            }
            using (IDbCommand cmd = conn.CreateCommand())
            {
                cmd.CommandText = "SELECT name FROM Colors";
                using (IDataReader reader = cmd.ExecuteReader())
                {
                    System.Console.WriteLine(reader.Read(0));
                }
            }
        }
    }
}

Because the database is local, queries are fast from start to finish—no network latency issues here. And because it’s just a file, you can throw it away when you’re done. SQLite makes an excellent backbone to any data virtualization scheme because of its hybrid semantics (simplified random access like a database, convenience like a flat file).

*I’m purposely leaving out Core Data’s binary and XML serialization because they require the entire object graph to be loaded into memory, and if you’re going to do that, then what’s the point?

But large data sets aren’t that important…

Every mature application that I have ever been a part of has had to tackle the issue of larger data sets at some point. In my personal experience, thousands of entities (hundreds if you’re talking about images) is enough to be considered a “large data set” in that your application begins to visibly suffer.

I’d love to recommend data virtualization as the route to go for handling large data sets in .NET, but there really isn’t enough in the way of frameworks that provides an out-of-the-box solution. Implement paging unless you really need data virtualization. If you’re working on iPhone apps, drop everything and learn Core Data if you haven’t already. You’d be surprised at how much you don’t need to worry about. But regardless of what platform you’re developing a front-end on, you should always at least ask yourself how your GUI might respond to thousands of x. because one day it’s going to have to. —DKT

A Hello World Order Entry app in WPF

So you’ve decided that Windows Forms is more trouble than its worth. You’re all set to start writing a WPF app. You fire up Visual Studio 2008, create a new solution, and see absolutely nothing about this fancy MVVM that seems to have taken the world by storm. No standard template—just a “WPF Application” that, when you start it, gives you the same blank Window that you used to get when you were creating Windows Forms apps. Really?

To be fair, I don’t think even the gang in Redmond realized just how far the community would run with MVVM and a lot of what we think of as staples of MVVM design now simply weren’t even thought of when Microsoft was trying to rush VS 2008 out the door. Visual Studio 2010 will provide better out-of-the-box guidance in creating a well-factored WPF app, but that’s not here quite yet.

Until then, give Order Entry a whirl. It’s a simple app with a simple purpose—provide a simple order entry ticket (buy/sell, symbol, price, and quantity). It sets up the basics of defining a view and a viewmodel that exposes the properties of a model in a way more conducive to consumption by the view.

It illustrates:

  1. Binding a view to a viewmodel
  2. Separation of the view (visual elements) from the viewmodel (data, non-visual resources)
  3. A simple ICommand implementation that disables/enables the Submit button

It’s not perfect because it lacks two things that every real GUI has (which we’ll fix in later posts):

  1. Threading. When you press Submit, all the “work” happens on the GUI thread.
  2. Validation. When you type something in the view that is not physically storable in the viewmodel (like entering in “Abracadabra” as a quantity), the screen changes to indicate an error on the field, but it doesn’t stop the user from submitting (viewmodel and its ICommand don’t know that the view is in an error state—only the view knows).

Validation and WPF data binding are tricky to get right. We’ll talk about that in later posts too. But for now, here’s version 1 of the Order Entry app to chew on. —DKT

Windows Forms is a Maintenance Risk

People say that WPF is a pretty big change from the Windows Forms technology of yesteryear, but there are also people who have invested considerable time and money into Windows Forms solutions that aren’t eager to rewrite their apps yet again. I can understand that—it takes time to develop things, and rewriting things for the sake of using the latest and greatest is not usually something that keeps projects on time and on budget.

But any new development should definitely be written in WPF.

Here’s a snippet of Windows Forms code that keeps a TextBox on the screen synchronized with a Quantity field on an Order object:

MyEditor.cs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
public class MyEditor : UserControl
{
    private System.Windows.Forms.TextBox quantityTextBox;
 
    public MyEditor()
    {
        InitializeComponent();
    }
 
    private void InitializeComponent()
    {
        quantityTextBox = new System.Windows.Forms.TextBox();
        quantityTextBox.TextChanged += new System.EventHandler(quantityTextBox_TextChanged);
        // ...more auto-generated noise
    }
 
    public void SetOrder(Order order)
    {
        _order = order;
        _order.PropertyChanged += order_PropertyChanged;
        quantityTextBox.Text = _order.Quantity.ToString();
    }
 
    private void quantityTextBox_TextChanged(object sender, EventArgs e)
    {
        long qty;
        if (long.TryParse(quantityTextBox.Text, out qty))
        {
            _order.Quantity = qty;
            quantityTextBox.BackColor = Color.White;
        }
        else
        {
            quantityTextBox.BackColor = Color.Pink;
        }
    }
 
    private void order_PropertyChanged(object sender, PropertyChangedEventArgs e)
    {
        if (e.PropertyName == "Quantity")
        {
            quantityTextBox.Text = _order.Quantity.ToString();
        }
    }
}

And in WPF:

MyEditor.xaml
1
2
3
4
5
<UserControl x:Class="MyEditor"
                         xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
                         xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
    <TextBox Text="{Binding Path=Quantity}"/>
</UserControl>
MyEditor.xaml.cs
1
2
3
4
5
6
7
8
9
10
11
12
public partial class MyEditor
{
    public MyEditor()
    {
        InitializeComponent();
    }
 
    public void SetOrder(Order order)
    {
        this.DataContext = order;
    }
}

Which one would you rather maintain? —DKT

PowerShell Deployment Options are Awful

I once looked into using Powershell as a foundation for an application that was to be very heavily powered by scripts underneath. It seemed like an idea with a lot of promise—most of the libraries it was to work with were written in .NET, the users were comfortable with scripting, and it would have been easy to expose GUI components in a scriptable way (think VBA for Office)—however, actually getting the thing deployed would have been a totally different manner.

Most firms lock down end-user machines. This makes for a more secure and easier to manage network, but it makes deploying an application an absolute nightmare when an application requires a non-trivial installer and administrative privileges to use.

And that brings me to the title of this post. Powershell deployment options are awful. Considering that Powershell is written largely (exclusively?) in .NET, it’s unbelievable that an installer has to be run, files need to be installed in system directories, and registry keys need to be written. About 99% of what it does can be done in pure .NET, and any app written in pure .NET can be copied to a directory and run as long as its dependencies are in the same directory or in the GAC. Life would have been much simpler for me if I could just include the Powershell .NET assemblies, deploy them with my application, and run Powershell scripts.

And then let’s say I do end up being able to deploy PowerShell properly, with all its funky registry keys and whatever else that installer does; what am I supposed to do with any PSSnapIns that I’m dependent on?

Somewhere on the Microsoft campus, there needs to be a giant poster called “Lessons Learned”. It should include at least the following bullet points:

  • No Clippy. Cartoons don’t help people get work done—functional software does.
  • People like XCOPY deploy—installers are obnoxious.

Well, they removed Clippy from Office 2007. Right now, .NET apps are so simple to deploy it’s criminal. So what’s the deal with Powershell? —DKT