Developer Blog
The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. A subset of this has been pre-processed by the Cooper Center at the University of Virginia, who produced an online map of the population density and the racial/ethnic makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.)
Using Datashader on Big Data
The Cooper Center website delivers pre-rendered image tiles to your browser, which is fast, but limited to the plotting choices they made. What if you want to look at the data a different way - filter it, combine it with other data or manipulate it further? You could certainly re-do the steps they did, using their Python source code, but that would be a big project. Just running the code takes "dozens of hours" and adapting it for new uses requires significant programming and domain expertise. However, the new Python datashader library from Continuum Analytics makes it fast and fully interactive to do these kinds of analyses dynamically, using simple code that is easy to adapt to new uses. The steps below show that using datashader makes it quite practical to ask and answer questions about big data interactively, even on your laptop.
Load Data and Set Up
First, let's load the 2010 Census data into a pandas dataframe:
import pandas as pd
%%time
df = pd.read_hdf('data/census.h5', 'census')
df.race = df.race.astype('category')
CPU times: user 13.9 s, sys: 35.7 s, total: 49.6 s
Wall time: 1min 7s
df.tail()
meterswest | metersnorth | race | |
---|---|---|---|
306674999 | -8922890.0 | 2958501.2 | h |
306675000 | -8922863.0 | 2958476.2 | h |
306675001 | -8922887.0 | 2958355.5 | h |
306675002 | -8922890.0 | 2958316.0 | h |
306675003 | -8922939.0 | 2958243.8 | h |
Loading the data from the HDF5-format file takes a minute, as you can see, which is by far the most time-consuming step. The output of .tail()
shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator coordinates, and that the race/ethnicity for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic and 'o' is other (typically Native American).
Let's define some geographic ranges to look at later and a default plot size.
USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))
NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))
x_range,y_range = USA
plot_width = int(900)
plot_height = int(plot_width*7.0/12)
Population Density
For our first examples, let's ignore the race data, focusing on population density alone.
Datashader works by aggregating an arbitrarily large set of data points (millions, for a pandas dataframe, or billions+ for a dask dataframe) into a fixed-size buffer that's the shape of your final image. Each of the datapoints is assigned to one bin in this buffer, and each of these bins will later become one pixel. In this case, we'll aggregate all the datapoints from people in the continental USA into a grid containing the population density per pixel:
import datashader as ds
import datashader.transfer_functions as tf
from datashader.colors import Greys9, Hot, colormap_select as cm
def bg(img): return tf.set_background(img,"black")
%%time
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')
CPU times: user 3.97 s, sys: 12.2 ms, total: 3.98 s
Wall time: 3.98 s
Computing this aggregate grid will take some CPU power (4-8 seconds on this MacBook Pro), because datashader has to iterate through the hundreds of millions of points in this dataset, one by one. But once the agg
array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.
The aggregate grid now contains a count of the number of people in each location. We can visualize this data by mapping these counts into a grayscale value, ranging from black (a count of zero) to white (maximum count for this dataset). If we do this colormapping linearly, we can very quickly and clearly see...
%%time
bg(tf.interpolate(agg, cmap = cm(Greys9), how='linear'))
CPU times: user 25.6 ms, sys: 4.77 ms, total: 30.4 ms
Wall time: 29.8 ms
...almost nothing. The amount of detail visible is highly dependent on your monitor and its display settings, but it is unlikely that you will be able to make much out of this plot on most displays. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago and a few other places. For feeding 300 million points in, we're getting almost nothing back in terms of visualization.
The first thing we can do is prevent "undersampling." In the plot above, there is no way to distinguish between pixels that are part of the background and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white - let's discard the first 25% of the gray colormap and linearly interpolate the population densities over the remaining range:
bg(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'))
The above plot at least reveals that data has been measured only within the political boundaries of the continental United States and that many areas in the mountainous West are so poorly populated that many pixels contained not even a single person (in datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the specified colormap is shown for pixels that do have data). Some additional population centers are now visible, at least on some monitors. But, mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!
The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus, both of the above plots convey very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice for real-world data that spans multiple orders of magnitude:
bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'))
Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).
Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet, the choice of 'log'
was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.
With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. Given a set of raw counts, we can map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed data, regardless of the distribution:
bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'))
Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines and mountains) on the population density, as well as history (denser near the longest-populated areas) and even infrastructure (with many small towns located at crossroads).
Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But, much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data - the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen, which is why eq_hist
is the default colormapping.
Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here, we'll use the colormap with the colors interpolated between the named colors shown:
print(cm(Hot,0.2))
bg(tf.interpolate(agg, cmap = cm(Hot,0.2)))
['darkred', 'red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'white']
Such a representation can provide additional detail in each range, while still accurately conveying the overall distribution.
Because we can control the colormap, we can use it to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus, if we want to highlight, for exmaple, the top 1% of pixels (by population density), we can use a colormap divided into 100 ranges and simply change the top one to a different color:
import numpy as np
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)]) + ["red"]
bg(tf.interpolate(agg, cmap = grays2))
The above plot now conveys nearly all the information available in the original linear plot - that only a few pixels have the very highest population densities - while also conveying the structure of the data at all population density ranges via histogram equalization.
Categorical Data (Race)
Since we've got the racial/ethnic category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles, as shown above. To do this, we first need to set up a color key for each category label:
color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }
We can now aggregate the counts per race into grids, using ds.count_cat
, instead of just a single grid with the total counts (which is what happens with the default aggregate reducer ds.count
). We then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:
def create_image(x_range, y_range, w=plot_width, h=plot_height, spread=0):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
img = tf.colorize(agg, color_key, how='eq_hist')
if spread: img = tf.spread(img,px=spread)
return tf.set_background(img,"black")
The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast and a few isolated areas of category "other" in the West (primarily Native American reservation areas).
create_image(*USA)
Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:
create_image(*LakeMichigan)
With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:
create_image(*Chicago)
Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:
create_image(*Chinatown,spread=plot_width//400)
Note that we've used the tf.spread
function here to enlarge each point to cover multiple pixels so that each point is clearly visible.
Other Cities, for Comparison
Different cities have very different racial makeup, but they all appear highly segregated:
create_image(*NewYorkCity)
create_image(*LosAngeles)
create_image(*Houston)
create_image(*Atlanta)
create_image(*NewOrleans)
create_image(*Austin)
Analyzing Racial Data Through Visualization
The racial data and population densities are visible in the original Cooper Center map tiles, but because we aren't just working with static images here, we can look at any aspect of the data we like, with results coming back in a few seconds, rather than days. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas, except in the South and the East Coast:
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
bg(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'))
(Compare to the all-race eq_hist
plot at the start of this post.)
Or we can show only those pixels where there is at least one resident from each of the racial categories - white, black, Asian and Hispanic - which mainly highlights urban areas (compare to the first racial map shown for the USA above):
agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0)
bg(tf.colorize(agg2, color_key, how='eq_hist'))
In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.
We can also look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but also some rural areas and small towns in the South:
bg(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'))
Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites, but both are outnumbered by Hispanics.
Notice how each of these queries takes only a line or so of code, thanks to the xarray multidimensional array library that makes it simple to do operations on the aggregated data. Anything that can be derived from the aggregates is visible in milliseconds, not the days of computing time that would have been required using previous approaches. Even calculations that require reaggregating the data only take seconds to run, thanks to the optimized Numba and dask libraries used by datashader.
Using datashader, it is now practical to try out your own hypotheses and questions, whether for the USA or for your own region. You can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data has been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!
Interactive Bokeh Plots Overlaid with Map Data
The above plots all show static images on their own. datashader can also be combined with plotting libraries, in order to add axes and legends, to support zooming and panning (crucial for a large dataset like this one!), and/or to combine datashader output with other data sources, such as map tiles. To start, we can define a Bokeh plot that shows satellite imagery from ArcGIS:
import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource
bp.output_notebook()
def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
p = bp.figure(tools=tools,
plot_width=int(900), plot_height=int(500),
x_range=x_range, y_range=y_range, outline_line_color=None,
min_border=0, min_border_left=0, min_border_right=0,
min_border_top=0, min_border_bottom=0, webgl=webgl)
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
return p
p = base_plot()
url="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.png"
tile_renderer = p.addtile(WMTSTileSource(url=url))
tile_renderer.alpha=1.0
We can then add an interactive plot that uses a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread
function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors:
from datashader.bokeh_ext import InteractiveImage
def image_callback(x_range, y_range, w, h):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
img = tf.colorize(agg, color_key, 'log')
return tf.dynspread(img,threshold=0.75, max_px=8)
InteractiveImage(p, image_callback)
The above image will just be a static screenshot, but in a running Jupyter notebook you will be able to zoom and pan interactively, selecting any region of the map to study in more detail. Each time you zoom or pan, the entire datashader pipeline will be re-run, which will take a few seconds. At present, datashader does not use caching, tiling, partitioning or any of the other optimization techniques that would provide even more responsive plots, and, as the library matures, you can expect to see further improvements over time. But, the library is already fast enough to provide interactive plotting of all but the very largest of datasets, allowing you to change any aspect of your plot "on-the-fly" as you interact with the data.
To learn more about datashader, check out our extensive set of tutorial notebooks, then install it using conda install -c bokeh datashader
and start trying out the Jupyter notebooks from github yourself! You can also watch my datashader talk from SciPy 2016 on YouTube.