Serving Maps for the World's Athletes

Introduction

The Strava mobile apps and website render millions of map images every day to provide geographic context to activities, routes, segments, and more. We use and love Mapbox for our interactive maps on the web, but because of our high volume of requests, we came to a point where it was more economical to build our own static map service. This decision was spurred on several fronts. We wanted to leverage the power of open source mapping technology and have access to a service that catered to our unique use cases. Additionally, there was internal interest in adding Strava specific entities to our maps, such as segments and Topstops. Lastly, we valued ease of integration with our testing, monitoring, and deployment infrastructure. Based on these factors, we decided to start work on our own map service, which we call Meridian.

Meridian is written in the Go programming language, first developed at Google. Go was a great fit for the project because of how easy it is to use with C and C++; several popular low level mapping libraries are written in these languages (specifically Mapnik). Being able to write type-safe code in a modern language, while still leveraging open source libraries, allowed us to dig into the problems of deploying a map service at scale, rather than replicating drawing libraries.

The service was the first production system serving critical user-facing traffic on our Mesos cluster. A previous blog post describes our Mesos architecture. At a high level, Apache Mesos is “an open source cluster manager” that allows resources like memory and CPU to be allocated to instances of your application, named tasks. We use Marathon to manage the deployment and configuration of long running tasks. One exciting part of the project was making Meridian scalable and fault tolerant as nodes in our cluster were added and removed from the resource pool.

If you are interested in working with our team on the infrastructure described in this post, we're hiring! Check out our open Platform Engineer role.

The life of a Meridian request

Meridian exposes HTTP endpoints to mobile and web clients that return rendered map images. When a request is made, it first reaches an AWS CloudFront distribution that caches individual requests. If a request is not in Cloudfront, the request is forwarded to Meridian, via Route53, to an AWS Elastic Load Balancer (ELB). This ELB points at our set of Mesos agents. A subset of these agents run a HAProxy wrapper named Bamboo. Bamboo routes the request to a Meridian task on Mesos, which then handles the rendering of the map.

Before a map is rendered, information about what to show on a map must be retrieved from a datasource. Thankfully, a dedicated world-wide community curates and freely distributes this as part of Open Street Map (OSM). To facilitate querying OSM, the first phase of Meridian used a PostgreSQL-backed datasource. This version of the map service worked well, although it had several downsides. To make a Postgres server performant at scale, data must be cached in memory - this prevents slow reads from disk. For a map service that responds to queries for information all around the whole world, we needed large EC2 instances (r3.8xlarge) to fulfill this purpose. Additionally, optimizing queries on the data is a time intensive process.

Luckily, some inventive thinking by members of the OSM community, namely Michal Migurski, led to the idea of vector tiles. Vector tiles are a denormalized and encoded version of OSM information that eliminates the permanent need for a set of large databases by storing query results in the MBTiles format. Because a vector datasource can be a single file with all of the OSM information needed to render maps, this approach also makes it trivial to horizontally scale capacity with growing request volume. We implemented a lightweight vector datasource in Go that can serve these tiles in milliseconds and at thousands of queries per second.



Above is an example request for a static map with the parameters as follows: 1) the map style to use. 2) the center of the map to be rendered, along with the zoom level. 3) image options (dimensions, format, and if necessary, retina).

Map requests naturally describe a certain view of the world, complete with image options (width, height, and format), as well as information about the geographic location to be displayed. When a map request reaches Meridian, it is decomposed into several tiles that can be rendered in parallel. Each tile describes a geographic bounding box using a set of X and Y coordinates at a specific zoom level (Z), according to the Scalar Mercator projection, and can be cached based on these parameters.

As mentioned earlier, the first step in the map rendering process is to fetch the underlying data. Once all of the vector data for the tiles is retrieved, each tile is placed on a queue. For each map style, there is a pool of workers waiting to process map requests. These objects abstract the bindings for Mapnik, as well as handling other service-level functions, like sending statistics to StatsD and writing logs. Each request to render a tile is taken off the queue by one of these workers, and rendered asynchronously. After all are rendered successfully, they are stitched together and written to the response. Throughout this process, we use the handy Go net/context package to keep track of whether any step needs to be cancelled and cleaned up. For example, if the request to get the vector data for one tile fails, it wouldn’t be possible to render a complete map. That means that all other in-flight requests for tiles should also be canceled. To speed up future map renders, the individually rendered tiles are written to Memcached. Reading these from Memcached provides a significant performance boost - normally fetching data and rendering it takes around 80 milliseconds, whereas reading a tile from Memcached can be done in <10ms.

Optimizing Meridian

During the second phase of Meridian, we spent time achieving performance parity with the first release. The 90th percentile latency of unoptimized vector map requests was 800ms, whereas this same statistic for database backed maps was 600ms, The first step in the Go optimization process was running a 1% experiment and observing production traffic using pprof.

It was expected that garbage collection (GC) when generating vector maps would be a major source of latency - several rendered tiles are stitched together in memory for every request. Initial results indeed showed that a large period of time was spent in runtime.mallocgc. Conveniently, Go 1.7 had just been released, complete with updates to GC. Simply updating Meridian to use a new version of Go shaved 150ms off response times.

It was clear that image decoding and stitching were the next biggest users of CPU time. Looking at the standard library code for getting and setting pixels, it was clear that the generic nature of the functions was resulting in a significant number of interface conversions. Additionally, pixel values when decoding and stitching images were not being set directly in the result image - an intermediate struct containing the pixel values was allocated. To remove the interface conversion and allocations, the code was refactored to access pixels directly - given a color model, it is possible to get the memory locations of a color’s components from the image struct. Once the color’s components are known, they can be assigned directly to their location in the final map image.

After these optimizations, the 90th percentile of vector maps requests were completed in 450ms. Once the cache for tiles was filled, the 90th percentile of total request time was 125ms.

Now that the second phase of the Meridian service is complete, Strava maps are faster, cheaper, and more reliable. Making a scalable map service using cutting edge technology has proven to be an interesting and exciting challenge. To learn more about some of the other technical problems we are tackling, follow @StravaEng on Twitter and visit our careers page!