I’ve been playing with an interesting dataset recently, and it got me thinking about challenges in effectively visualizing geospatial data. Specifically, how do you best display a continuous variable whose values span several orders of magnitude?
The dataset I’m working with comes from the Arctic Monitoring and Assessment Program. It’s a estimate of global anthropogenic emissions of mercury per 0.5 x 0.5 degree grid square. One important reason why AMAP generated these data (and how they did it is an interesting problem and the topic for another post) was to help atmospheric transport modelers who need to know where on earth emissions are coming from. But the data also allow for a nice visualization of global sources of mercury pollution that goes beyond simple maps showing emissions by country.
I’ll present two options here, and I’d love feedback on what works best. I think there are also trade-off depending on what the purpose of the visualization is (presentation vs. exploration) and the scale. Both are made on CartoDB. You can zoom, scroll, and click on a point to see the data. Check out the full-screen option which I think is pretty cool.
The first is perhaps the more flashy one. It uses yellow circles whose size are proportional to mercury emissions. There is a multiply effect so areas of overlap appear orange-red.
This one is a more traditional chloropleth approach using an orange-red scale to represent the magnitude of emissions over each grid square.
Some technical notes:
The dataset contains around 45,000 grid squares (areas with no anthropogenic emissions, like oceans, are no data) with mercury emissions ranging from about 10^-5 to 12,000 kg. That’s around 8 orders of magnitude. Some quick exploration of the data revealed that almost all the mercury emissions came from less than 10 percent of the model area.
Cumulative sum of mercury emissions (normalized to 1) as a function of magnitude of emissions in each cell. Almost all emissions are from cells with greater than 10 kg emissions. Note log scale on x axis.
Most areas have very small emissions, but a few have very high emissions. The data are like this because the emissions estimates are made using both point sources, “area” sources like artisanal mining, and population as a proxy for some general emissions. In any case, to facilitate visualization I removed the very-low-emissions-value grid squares. The remaining ~5000 squares comprise ~93% of total emissions. These data still have a pareto-like distribution ranging almost three orders of magnitude, but they are easier much easier to display on a map.
Cumulative sum of mercury emissions (normalized to 1) as a function of magnitude of emissions in each cell. Cells with < 50 kg Hg removed. Note log scale on x axis.
Note that the maps display mercury emissions per square km for each cell, not total mercury emissions. That is because the areas of the 0.5 x 0.5 grid cells vary with longitude. Those closest to the equator are larger, closer to the poles are smaller. So it makes for a more accurate display to normalize by the cell area.
An important factor in the visual appearance of continuous data like these is where to choose the breaks separating data points into different colors or sizes. This is especially difficult with pareto or power law distributions. CartoDB has several built in options for binning data. After playing around with them I choose head/tail breaks, which seems to work well on this type of distribution. CartoDB also allows you to easily change the breaks manually with cartoCSS. It was a challenge to find a binning and color/size scheme that portrayed the data in the most accurate way, while also maintaining a clear and striking appearance.
Color on the chloropleth comes from colorbrewer.
For more information on the development on the emissions model, see this paper.