On a recent trip I witnessed a huge flock (?) of fruit bats take flight over Dar Es Salaam. It was pretty amazing, so I made a video:
Bonus to whoever recognizes the music
On a recent trip I witnessed a huge flock (?) of fruit bats take flight over Dar Es Salaam. It was pretty amazing, so I made a video:
Bonus to whoever recognizes the music
The other day I learned that wordpress.com now supports embeds of CartoDB maps. This is pretty cool, and it inspired me to finish up a little project that I’ve been tinkering with for a while, in order to try out the new feature.
By the way, CartoDB is a web mapping tool that I think is one of the best interfaces available for creating interactive maps. You can make great looking maps quickly and easily, but there is also enough functionality to do more advanced stuff, like mess around with the CSS code.
This map shows estimates of how much mercury is on site at chlor-alkali plants per country. It distinguishes between countries that ban the export of mercury and those that don’t. This is important because chlor-alkali plants often contain hundreds of tons of mercury. When the facilities close the mercury can enter the commodity market where it can be used in artisanal gold mining.
The size of the bubbles reflects how many tons of mercury are estimated to be in chlor-alkali facilities in each country. Scroll, zoom, hover, or click for more details. The data are from the UNEP Global Mercury Partnership chlor-alkali inventory.
Technical CartoDB note: In order to distinguish (by bubble color) countries with and without export bans, I made two layers from the data table. However, because each set had a different range of values, the scale for the bubble size was different for each color. To fix this I manually changed the bubble size distribution cutoffs in the CSS tab. Is there an easier solution that I am missing?
Oh yeah, this is how you do the embed.
It’s been quite some time since my last post. I have been busy with a young child, new job, and an international move. But I’m hoping to get back into posting and making visualizations on a regular basis.
The reason for this post is that I came across an interesting resource called the International Environmental Agreements Database Project, hosted at the University of Oregon. The database contains information on about 1100 multilateral environmental agreements (MEAs) dating back to 1857. The data include the title, type (an original agreement or a protocol or amendment to an existing agreement), dates of signature and entry into force, and the parties. For some agreements there is even data on performance as well as coding to allow for comparison of the actual legal components.
As an initial exploration, I simply looked at how many agreements were concluded over time. The plot below shows the results for the last 100 years. Click for the interactive and shareable plot.ly version.
There is a pretty interesting pattern. From the early 20th century until the 1950s there are not that many MEAs. Then the pace picks up in mid-century, peaking in the early 1990s, and declining considerably after that.
What’s going on? Have all the easy agreements been reached and there is nothing more for countries to negotiate about? Maybe that’s part of it, but I think it has something to do with an event that coincided with the peak in MEAs – The 1992 Earth Summit and the resulting Rio Declaration on Environment and Development.
The Earth Summit was a huge event in the global environmental community, and occurred at a high point of optimism about multilateralism. There was a flurry of MEA activity around this time. But there was also a building movement to ensure that international environmental diplomacy was benefiting the poor, and in particular, developing countries.
The Rio Declaration enshrined the principle of common but differentiated responsibilities. This is the idea that while all nations have a responsibility to protect the global environment, rich nations should shoulder a greater share of the burden.
It is a noble sentiment, and one that in my view makes a lot of sense. But it had the effect of making it more difficult to reach agreements in international environmental negotiations. Developing countries started going into the negotiations expecting more support, in the form of funding, reduced obligations, or technology transfer, from the developed world. Common but differentiated responsibilities is at the root of a major sticking point in global climate talks. Should China, India, and other rapidly developing nations have the same stringent obligations as more mature economies?
I certainly don’t think this is the only cause of the decline in new MEAs in the last 20 years. And neither can I claim to be the first to think about the Rio Declaration’s impact on MEAs. There’s an entire literature on it. For example, Richard Benedick discussed this theme at length in reference to the Montreal Protocol and its aftermath in his book Ozone Diplomacy.
As a final disclaimer, for this analysis it would be best to filter the IEA database to exclude those MEAs that only have a few parties. That way you could really focus on the rate of global or large regional MEAs over time. Perhaps I’ll do that next.
But in any case, it’s an interesting dataset and an interesting pattern. And a good excuse to step back and think about the big picture in global environmental politics.
Recently there has been bit of buzz about a study claiming that female named hurricanes cause more fatalities, on average, than male ones. The authors suggested that the discrepancy is attributable to gender bias. Female named hurricanes do not seem as threatening to people, so presumably they take fewer precautions. From the start this seemed pretty far-fetched, and in fact a number of problems have been found with the study.
But it got me thinking about hurricane names. A more likely effect of a hurricane’s name would be to discourage parents from giving their children that name, if the hurricane is associated with death and destruction. Fortunately, there is readily available data with which to test this hypothesis. For hurricanes, I used the same data as the hurricane gender study described above (they may have had some problems with their methodology, but at least they released their data). It contains data on 92 Atlantic hurricanes that made landfall in the U.S. since 1950*. For baby names I turned to the Social Security Administration. There is a great R package called
babynames that makes the yearly SSA data available in a readily accessible format for use in R. As an aside, the SSA baby names data is the source of all sorts of interesting visualizations and analyses, such as the baby name voyager and this article from fivethirtyeight.com on predicting a person’s age based on their name.
The tricky part of this analysis is deciding how to define a decrease in name usage after a hurricane. The simplest way would be to look at how many times a name was given in the year of a hurricane versus how many times that name was given the following year. For example, how many baby Kartrinas were there in 2005 versus 2006. However, this method does not take into account that most names are either decreasing or increasing in popularity as part of a longer-term trend. So you have to look at how the popularity of a name was changing before the hurricane as well. To see why, look at this plot of the number of babies named Katrina over time.
Katrina peeked in popularity in in 1980 and has been declining ever since. But from 2004-2005 the number of Katrina’s actually increased about 13%. From 2005-2006, however, it decreased dramatically – by 26%. It’s a pretty good bet that this rapid decrease was due to the hurricane.
To quantify the change in a name’s usage after a hurricane, I made the assumption that the best predictor of how a name’s popularity will change in a given year is how it changed last year. To calculate the post-hurricane change in name usage I subtracted the percent change in name usage in the year before the hurricane from the percent change after the hurricane. In the Katrina example the post hurricane change would be (-26%) – (13%) = -39%. This post-hurricane percent change value is what I use in the analysis below.
Before we get to the results, let’s take at look at the fascinating case of Carla:
Hurricane Carla was an extremely intense storm that hit Texan in 1961, killing 43. The name “Carla” had been surging in popularity, but after 1961 it started a decline in popularity from which it never recovered. It seems a pretty good bet that the hurricane had a major role in Carla’s decline. Interestingly, the first live television broadcast of a hurricane was of Carla, with a young Dan Rather himself reporting from Galveston. Could the shock of the American TV-viewing public seeing footage of the storm in their living rooms have contributed to the demise of Carla as a name?
Back to the analysis. Indeed, the hurricane baby name effect seems real. After running the numbers, I found that names associated with a landfalling hurricane were about 15 percent less common in the year after the hurricane. Out of the 93 hurricanes in the data set, 65 were associated with a decrease in the popularity of their names, and only 21 were followed by increasing name usage. (Seven hurricane names were not found in the SSA data in their landfall year).
So far this is pretty intuitive. Of course people are less likely to name their dear infant after a natural disaster. Based on this reasoning, you’d expect that the more fatalities caused by a hurricane, the greater the baby name effect. Let’s test that.
The effect is quite small. When we take Katrina out (a massive outlier in terms of fatalities), it’s smaller still:
So the correlation between change in baby name usage and hurricane fatalities is quite weak. Finally, I had to see if the gender of the hurricane name affected this relationship. Were more deadly female-named hurricanes more or less likely than male names to affect baby name popularity? Maybe I’d even find that male baby name usage goes up with hurricane fatalities because parents associate the names with strength? I can see the Slate headline now! Alas, there is no significant difference:
By the way, there are more female names because from 1950 – 1979 all Atlantic hurricanes were given female names.
There’s an almost endless amount of interesting things to glean from the baby names data. My ultimate dream is an algorithm to determine the perfect name for your baby based on a number of criteria chosen by the expectant parents. It would really take the stress out of the naming process. One of the criteria would certainly be that the name is not on the World Meteorological Association’s list of tropical storm names!
Data and code available on github.
* The authors of the hurricane fatalities study did not include Katrina in their data set. I added it in with data from Wikipedia.
This post is intended to illustrate the cool things you can do with plot.ly’s API for R. Plot.ly is a web-based tool for making interactive graphs. It uses the
D3.js visualization library, and lets you create very attractive plots that can be easily shared or embedded in a web page. With the R API you can manipulate data in R and then send it over to plot.ly to create an interactive graph. There’s also a function that let’s you create a plot in R using
ggplot2, and then shoot the result directly over to plot.ly (summarized nicely here).
I have great little free app on my iPhone called Pedometer++ that keeps track of how many steps I take each day. I exported the data, plotted up a time series with
ggplot2, and used the API to make the graph in plot.ly. It worked quite nicely. The only hiccup was that plot.ly did not recognize the local regression curve, so I had to add that separately.
You can see from the plot that I’m not consistently meeting my 10,000 step goal. In fact, I averaged 7,002 steps over this period. That still comes out to a total of 1,470,463 steps. From October through February my step count was trending slightly downward, but since then it’s picked up. Maybe that had something to do with the cold winter. Hopefully as the weather (and my motivation) improves, I’ll hit my goal.
Any here’s a bonus box plot showing steps taken by day of the week (also using the R API):
If there are any pedometer users out there who are interested, let me know and I can post the code.
One of the first posts on this blog was about using Tableau to visualize data on global emissions of mercury. I’ve gotten suggestions from a few people and given the graphic a bit of a face lift. Click on the image to see the interactive viz:
I also used the same dataset to make some static graphics using
ggplot2 and the
ggthemes package. I’d love any input on how to improve the the look and feel of both these and the Tableau viz. I’ve always found picking good colors very challenging, so thoughts on the palettes are especially welcome.
It’s no secret that interest in data visualization has been growing in recent years. Don’t believe me? Let me show you a graph:
Sure, humans have been presenting information graphically for hundreds, if not thousands, of years, with increasing sophistication. We still study the work of people like John Snow, William Playfair and Florence Nightingale for their innovations in graphical presentation. Today, however, with the increasing availability of large, rich, and easily accessible datasets, and the proliferation of software tools for creating graphics, we are seeing an explosion in the amount of data visualizations. This is a great development. I obviously think so, since I jumped on the bandwagon.
The recent ubiquity of the data visualization brings with it a new subgenre, the meta-visualization. Visualizations about visualizations. Some of these describe what data visualization is, or should be. Some present information about common types or characteristics of visualizations. Still others poke fun at cliches, poor practices, and the very pervasiveness of visualization as a medium for communicating information. Let’s take a look at some examples.
First, here’s the Infographic of Infographics:
Then there’s this periodic table of types of visualizations:
Continuing with the periodic table theme, here is a periodic table of period tables. This is very meta.
But does this periodic table of periodic tables contain itself? (It does.) And, more importantly, should a periodic table of all periodic tables that do not contain themselves contain itself.
Some graphics attempt to illustrate what characteristics a good data visualization should have. Like this 4-set Venn diagram, for example:
Or like this Venn-like diagram, which I’m not quite sure how to read:
Now if you really want to turn it up to 11, or more accurately, up to seven, you could employ this epic 7-set Venn diagram:
Another category of meta-visualizations contains humorous or satirical ones. These are not literally visualizations of other visualizations, but they are about visualization as a medium. These are funny, self-aware takes on the cliches and excesses in the field. Pie charts that skewer the graphical form of the pie chart itself are practically a sub-subgenre in themselves:
Really, nobody seems to have any love for the pie chart.
And on the topic of maps, here’s a gem from xkcd:
Finally, we venture into silliness with one of my all-time favorites, All You Need to Know about Lady Gaga’s Hit “Bad Romance” in One Chart:
To sum up, here is a word cloud visualization of this post:
A while ago I wrote a post suggesting that Ukraine’s propensity for revolution might have something to do with its high level of government corruption in combination with its relatively well-developed civil society. As evidence for this, I showed that Ukraine (together with Kyrgyzstan and Moldova, two countries that have also recently experienced political unrest) was an outlier among post-Soviet states with respect to the relationship between corruption perceptions and authoritarianism. This finding was interesting, but by no means robust enough to warrant broad generalizations about corruption and democracy and revolution.
Since then, a few others chimed in with some ideas. Ben Jones suggested looking at corruption and authoritarianism in countries that experienced revolutions over time. Cavendish McCay looked at corruption and authoritarianism data from the same sources but over the entire globe, and produced a very cool visualization. He also pointed me to the World Bank’s Worldwide Governance Indicators, which contains measures of democracy, corruption, and political stability. Perhaps it would be possible to test my hypothesis empirically using these data. This could be done for individual regions or for the whole world, and could also have a temporal component (the indicators have been published since 1996).
In order to determine if such an analysis is feasible, I decided to take a closer look at the dataset (which is free and downloadable from the website). The Worldwide Governance Indicators (WGI) project is an ambitious one. The authors compile data from 31 different sources (such as think tanks, NGOs, private firms) and produce annual scores for every country for six indicators of the quality of governance. The indicators are:
First off, we can look at the data on a map. Fortunately the WGI website has a series of nice Tableau interactive graphics, including maps:
Looking at the indicators geographically is helpful. But to evaluate whether they can be used to test the hypothesis, I want to see how each indicator is correlated with all the others. For this, we’ll turn to R. Here is a correlation matrix of the six indicators as calculated for 2012. Positive correlations are reflected as positive values. The closer the the number to one, the stronger the correlation. As you can see, all the indicators are positively correlated to each other, some very strongly. This is not surprising. We would expect well-governed countries to get high marks for rule of law, regulatory quality, control of corruption, etc. One interesting observation here is that Control of Corruption actually has the lowest correlations of all the indicators. A scatter plot matrix is a good way to look at the data in more detail:
The idea for this variation on the scatter plot matrix comes from Winston Chang’s R Graphics Cookbook. Its structure is similar to the correlation matrix in that all of the indicators are plotted against each other. The lower panels show scatter plots with LOESS regression lines for each indicator pair. This plot has some extra bells and whistles thrown in – histograms of the distribution of each in indicator in the diagonal panels and correlation coefficients (just like the correlation matrix) in the upper panels. The scatter plots show the strong to moderate correlations that we already saw in the correlation matrix, but allow us to make out some curious features of the data, like the non-linear relationship between Voice and Accountability and many of the other indicators.
The indicator values are in units of a standard normal distribution. A value of zero is the mean, while a value of one is one standard deviation higher than the mean. Given the distributions, the indicator values range from about -2.5 to 2.5. Positive values represent better governance, negative represent worse. Because each indicator is measured on the same scale, we can simply sum all six to determine the overall “best governed” country. The top six are:
NEW ZEALAND 10.83
And the bottom six:
CONGO, DEM. REP. -9.76
SYRIAN ARAB REPUBLIC -9.53
KOREA, DEM. REP. -9.35
I got a bit carried away examining the correlations between the governance indicators, but in a subsequent post I hope to look closer at the democracy – corruption – stability hypothesis. I’m still not quite sure what statistical tests to use and how to apply them, and I’d welcome any ideas. Data and code are posted on Github (github.com/caluchko/wgi)
In the previous post, I used Tableau Public to create a visualization of the Seafood Hg Database. That graphic showed the mean mercury content and number of samples by seafood category. But there are several other dimensions in the database, including the year of the study and the particular species of seafood sampled. I couldn’t resist playing around with the data a little more, this time using the
lattice package in R.
The plot below shows the mean mercury concentration (y-axis) in studies of the 12 seafood categories with the highest median mercury concentration. The x axis shows the date of the study. I’ve also plotted a trend line for each panel. This is a nice way to visualize the data, but I wouldn’t read too much into this plot. For one thing, many of the seafood categories contain multiple species, some of which are higher than others in mercury. Also, this plot does not account for the geographical region where the fish were sampled.
We can tease a little more from the dataset by looking at the individual species within a seafood category. Here is a plot of the six tuna species with the greatest number of studies. The larger species, like bluefin, seem have higher mercury contents than the smaller ones, like skipjack. One curious feature of the dataset is also visible here: there were very few studies of mercury in seafood in the 1980s.
I’ve written before about mercury emissions, mercury as a commodity, and mercury use in artisanal mining. But the reason we pay so much attention to mercury is because of its human health impacts, and these are primarily caused by eating contaminated seafood.
Different types of seafood have different amounts of mercury. Because mercury is bioaccumulative, organisms that are higher on the food chain tend to have greater mercury concentrations. Of course, the particular environment where the organism lives also plays a big part.
Scientists have been interested in the mercury content of seafood for decades. Recently, a group of researchers undertook the herculean task of aggregating data from almost 300 studies. The result is the Seafood Hg Database (and an accompanying paper). The database contains the mean mercury concentrations measured in each study for one or more of 62 seafood categories. Overall, the database represents over 62,000 individual measurements from around the world.
It’s a great dataset to play around with and experiment with visualizations. In the graphic below, I plot mercury concentrations for a subset of common seafood types. Each circle represents the mean concentration measured in one study, and the size of the circle is proportional to the number of samples in that study. I’ve overlaid box plots for each seafood category that show the median of all the means, as well as first and third quartiles (whiskers go to 1.5x the IQR).
I think this is much more instructive than simply plotting the grand mean (average of all the study averages) for each seafood category. For one thing, you lose a lot of information on how much mercury concentration varies within a category. Take tilefish, for example. This is one of the species that EPA and FDA advise pregnant women not to eat. But there are relatively few studies of tilefish, and the mean mercury concentrations they measured vary by an order of magnitude.
Click on the image below to bring up the full interactive Tableau Public visualization: