What Do 220,000,000,000 Data Points Look Like?

Update (Nov. 3, 2014): The map now contains 160 million activities and 375,000,000,000 points.

Recently we released a global heatmap of 77,688,848 rides and 19,660,163 runs from the Strava dataset. This was more of an engineering challenge to create a visualization of that size than anything else. But still, the map has raised many questions about how and where people run and ride. Some of these can only be answered using the raw data, which is addressed by our Metro product.

The code to generate the map is the grandchild of a heatmap I built almost two years ago. Last year the code was cleaned up and became the Personal Heatmaps feature on Strava. This time it has been refactored to handle the large dataset by reading from mmapped files stored on disk.

To start out, the world is broken up into regions presented by zoom level 8 tiles. Each one of these regions has a file containing the sorted set of key/values where the key is the pixel zoom and quadkey and the value is the count of GPS points. The quadkeys make it so all the data for a tile is stored sequentially in the file. Pixels with no GPS points are excluded and only every 3rd zoom level is stored in the file. The values for missing zooms can be found by adding the 4 to 16 values from higher zoom levels. Skipping zoom levels saves a bit a disk space, but it also preloads into memory the region of the file needed for deeper zooms.

This results in about 9000 files (6300 for rides, 4700 for runs) that are all opened as memory mapped files when the server starts. When a request for a tile comes in, the server finds the corresponding file handle and does a binary search on the keys. Since the info for the tile is stored sequentially in this file, it can do a fast read and build a 2D array of the number of GPS points in each pixel of the tile. Now those values need to be normalized to a value between 0 and 1 and colorizer.

The normalization is very local, taking into account the 8 neighboring tiles. For each tile the 95% percentile, non-zero, GPS count is computed. These values are averaged into 4 corner “heatmax” values for the current tile. The count for every pixel is divided by the bilinear interpolation of those values, capped at 1. This [0,1] value is used to color each pixel using a gradient function. You can see the local effect on roads that branch away from popular routes.

This is all done on the fly (minus memcache and a CDN) and takes about 200 milliseconds per tile. Why serve on the fly? Well, the ride map has 106,991,000 unique tiles, times three colors, plus the run and both versions and you’ve got a lot of S3 objects. This saves that step and lets me update parts of the map and tweak or add colors as needed.

Everything is hosted on a single c1.xlarge EC2 instance which maxes out at about 150 tiles a second. Because the files are memory mapped, the OS does all the caching. Still, the process is IO bound as users are accessing the map all over the world. Using an SSD backed instance solves the IO issues, but they’re way more expensive. Given that the CDN serves most of the tiles, I figured small slowdowns from IO bottlenecks are okay. This only happens when we have huge traffic, like when being featured on a Belgian national news site.

There have been a lot of suggestions on different types of maps to create, but I’m not really sure what’s next for this map, or the code. I think incorporating direction of travel could look really cool, but right now I’m more interested in using the map data. If you think about it, the heatmap is just a density distribution of GPS points. A "noisy" GPS stream could be corrected using these probabilities. The Slide Tool represents some of my initial thinking in this direction.