Gaining elevation

A/B Street is both a game and a very powerful tool that allows you to download a detailed street map of an area and test out changes to the configuration of roads:

It uses OpenStreetMap as a data source. OSM has impressively rich data on things like the number of lanes a street has, where bike lanes and turn restrictions go, and so on, but one thing it lacks is detailed elevation data. The developers noticed that it was suggesting absurd routes for cyclists in Seattle, including routes that no-one who’s actually tried cycling would ever repeat because of the hills involved. So they brought me in to figure out how to add elevation data.

I wrote a simple Python tool that reads in paths as plain lists of coordinates, and writes out statistics for each one: the start and elevations, and cumulative elevation gain and loss along each path. A/B Street incorporates this into a data load by exporting each road segment as a series of points every metre along the way, getting as fine-grained a picture of a route’s hilliness as the source data allows.

By default, elevation_lookups uses data from the Shuttle Radar Topography Mission, because that source provides fairly high quality data for most of the world’s land. I learned about SRTM while making this tool, and I’m still sort of star-struck that we have a global dataset like this, freely available to anyone who has a use for it. But it is also relatively low resolution, so I built in the ability to override the data source with higher resolution sources where known. It comes preconfigured with examples of both raster (LIDAR source) and vector (contour source) datasets for the Seattle area.

This is an open-source project. I hope it can be useful for other applications, and I have more ideas for it then I have time to implement. I’d love contributions from anyone this appeals to, and have some suggested starting points (not all requiring programming skills!).

Along the way, I ran into some unexpected challenges with parallelising lookups. I’ll briefly describe them in case anyone who gets stuck with the same things happens across this post. The solution did not seem obvious to me, but is quite simple to implement once you know what to do.

I used Python’s native multiprocessing library to parallelise. It’s generally fairly straightforward to use, but both Geopandas (used for vector lookups) and Rasterio (used for raster lookups) have quirks that made my first attempt fail. The underlying problem is that I thought of the loaded data in both modules as read-only. If it were so, it would make sense to load the data once, and then share that between each parallel thread. But this turns out to not quite be true in either module.

For Geopandas, the issue is not the loaded data itself, but the spatial index. Lookups are slow until the index is built, but it turns out that each thread has to build its own copy of the index. The documentation does mention that “the spatial index may not be fully initialized until the first use”, but it took me a while to appreciate the implication there: two threads could end up both trying to initialize the same index at the same time. Fortunately, building an index is much faster than loading the data in the first place, so simply moving that into the parallel function hasn’t caused any problems.

For Rasterio, the issue is that the dataset itself can’t be passed to child threads. I’m not entirely sure why, but I suppose it must be because we could in theory change that dataset. Conveniently, actually loading a raster file is very quick even for relatively large ones like the detailed LIDAR data for Seattle + adjacent areas. So much like the vector equivalent, once I finally understood what the problem was, moving the data load into the parallel function was an easy solution.

I’ve benefited a lot from breadcrumbs that other people have left around after debugging technical issues; I hope this helps someone else in turn.