22 April 2016

Female ASA presidents

I recently received an e-mail from the American Statistical Association (ASA) about the Helen Walker Society. The e-mail listed the 14 “distinguished women leaders who have held the office of ASA president” going back to Helen Walker in 1944.

Really? Fourteen? All the way back to 1944? Wow.

I thought it would be interesting to visualize the percentage of female ASA presidents over time. I found a spreadsheet of past ASA presidents online and used the data to create a timeline. (R script shown.)

# download file
url <- "http://www.amstat.org/about/pdfs/History_of_ASA Presidents-JSMs.xlsx" 
local <- tempfile() 
download.file(url, local, mode='wb') 
wb <- loadWorkbook(local) 
dat1 <- readWorksheet(wb, sheet=1)

# keep first three columns and valid rows
dat2 <- dat1[, 1:3]
names(dat2) <- c("Year", "Number", "President")
dat2$Year <- as.numeric(dat2$Year)
dat2 <- dat2[!is.na(dat2$Year), ]

# fill in empty rows
dat2 <- na.locf(dat2, fromLast=TRUE)

# add sex
dat2$Sex <- ifelse(dat2$Year %in% c(1944, 1952, 1956, 1980, 1987, 
  1989, 1992, 1996, 2006, 2007, 2009, 2011, 2013, 2016), 
  "Female", "Male")

Most recent 10 years of ASA presidents.

Year Number President Sex
2017 112th Nussbaum, Barry Male
2016 111th Utts, Jessica Female
2015 110th Morganstein, David Male
2014 109th Schenker, Nathaniel Male
2013 108th Davidian, Marie Female
2012 107th Rodriguez, Robert N. Male
2011 106th Geller, Nancy Female
2010 105th Pantula, Sastry Male
2009 104th Morton, Sally C. Female
2008 103rd Lachenbruch, Peter A. Male

I then calculated the rolling 10-year percentage of ASA presidents that were female and plotted the results.

# 10-year rolling mean
mean10 <- 100*rollmean(dat2$Sex=="Female", 10, align="left", fill=NA)

# plot timeline
# par(mar=c(4, 4, 3, 1), las=1)
# plot(dat2$Year, mean10, type="n", xlab="Year", ylab="Females  (%)", 
#   main="Rolling 10-Year Percentage of Female ASA Presidents")
# abline(h=50, lty=2)
# points(dat2$Year, mean10, type="o", pch=16, col="orange")

There were three female ASA presidents between 1944 and 1956, followed by a 23-year period during which all the presidents were male (1957-1979). Since then, the percentage of female ASA presidents has risen, reaching 50% for the first time during the 10-year period from 2004 to 2013. Excellent.

09 April 2016

Wisconsin presidential primary

I was curious about the Wisconsin primary election results. I wanted to see how the votes stacked up among Democrats and Republicans. I also wanted to see how both parties compared to the number of eligible voters that didn’t vote.

I downloaded election results from WisconsinVote.org on 8 April 2016. According to Wisconsin’s Government Accountability Board, the State of Wisconsin has 4.44 million eligible voters, of which 3.39 million were registered by 1 April 2016.

Democrat Didn’t Vote Republican
Sanders 567,936 0 0
Clinton 432,767 0 0
Other Dem. 3,201 0 0
No Vote 0 2,334,973 0
Cruz 0 0 531,129
Trump 0 0 386,370
Kasich 0 0 155,200
Other Rep. 0 0 28,424

Are we so accustomed to low voter turn out, that we no longer see the elephant in the room - the large number of eligible voters that don't participate in the elections at all?

My primary (ha!) interest in this topic was not coding, but I provide the R script that I used below.

03 April 2016

Order and color in stacked bar charts

A stacked bar chart is a useful way to represent data when you want to show totals (the y-axis, usually some sort of count) for some set of categories (the columns, often time), and you also want to show how those totals are divvied up among some subgroups (the stacks).  If you just want to see the totals without the subgroups, a simple column chart (without stacks) will suffice.  If you want to show how the subgroups compare but are not particularly interested in the totals, a line chart would be better.

Let's use these data from Dutch Finance Minister Jeroen Dijsselbloem as an example.
For simplicity I share the 2010-2014 data in the R code below, recreating the right 5 columns of the original chart.  I selected similar colors from this Chart of R Colors.
div <- matrix(
  c(182, 20, 30, 76, 107, 52, 64, 128, 1, 0, 0, 0, 23,
    0, 60, 30, 97, 90, 74, 65, 64, 2, 0, 0, 0, 0,
    215, 59, 90, 108, 94, 92, 86, 83, 7, 0, 0, 0, 0,
    325, 98, 112, 135, 95, 0, 87, 71, 5, 0, 0, 0, 0,
    362, 117, 113, 139, 83, 48, 89, 32, 5, 1, 0, 0, 0),
  dimnames = list(
    c("Gasunie", "Tennet", "UCN", "Schiphol", "Staatsloterij",
      "Nederlandse Spoorwegen", "Havenbedrijf Rotterdam", "BNG Bank", "FMO",
      "De Munt", "COVRA", "Holland Casino", "NWB Bank"),
    c("2010", "2011", "2012", "2013", "2014")))

orig.col <- colors()[c(518, 558, 477, 430, 11, 131, 652, 477, 459, 153, 456,
  148, 92)]

par(mar=c(4, 4, 2, 12), yaxs="i", las=1)
barplot(div, ylim=1.05*c(0, max(apply(div, 2, sum))),
  ylab="Dividends  (in millions of Euros)",
  col=orig.col, border=NA, legend.text=TRUE,
  args.legend=list(x="right", inset=-0.55, border=NA, bty="n"))

Note that the dividends reported for Havenbedrijf Rotterdam (yellow) were greater than Nederlandse Spoorwegen (midnight blue) in 2014, although this does not agree with the chart shown in the tweet above.

The three main decisions in creating a stacked bar chart are (1) the order of the columns, (2) the order of the stacks, and (3) the color of the stacks.

Order of the Columns

In this case, ordering the columns is easy, because year is ordinal.  If this were not the case, you might want to order the columns by the totals, or some other metric important to the story you're telling with the chart.

Order of the Stacks

It's not clear to me how the companies (stacks) were ordered in the original chart.  It's not alphabetical, and it's not the average annual dividend.  What is it?

I contend that it's good practice to place the least variable stacks at the bottom of the column.  That way, the ups and downs of the more variable stacks do not disrupt the pattern of the less variable stacks.

In order to place those companies with the least variable dividends on the bottom of the bar chart, I ordered the rows by their variance and removed any rows with all zero dividends.
rowvar <- apply(div, 1, var)
div2 <- div[order(rowvar), ]
div3 <- div2[apply(div2, 1, sum)>0, ]
Color of the Stacks

It's good practice to assign colors in a meaningful way whenever possible, delivering more information than simply a random collection of colors.  Color could be used to identify the age, location, or product type of the companies, for example.  It depends on what's important to the story you are telling with the chart.

I thought it would be interesting and informative to define colors for the companies based on whether their dividends increased or decreased over time, on average.  So, I calculated the average annual change in dividends and assigned ordered positive changes to the "YlOrRd" (yellow-orange-red) color set using the R package RColorBrewer and ordered negative changes to the "Blues".
change <- apply(div3, 1, function(y) lsfit(1:dim(div3)[2], y)$coef[2])
new.col <- rep(NA, length(change))
sel <- change > 0
new.col[sel] <- brewer.pal(sum(sel), name="YlOrRd")[rank(change[sel],
new.col[!sel] <- brewer.pal(sum(!sel), name="Blues")[rank(-change[!sel],

par(mar=c(4, 4, 2, 12), yaxs="i", las=1)
barplot(div3, ylim=1.05*c(0, max(apply(div3, 2, sum))),
  xlab="Year", ylab="Dividends  (in millions of Euros)",
  main="Companies Partially Owned by the Netherlands",
  col=new.col, border=NA, legend.text=TRUE,
  args.legend=list(x="right", inset=-0.55, border=NA, bty="n"))


What a difference it makes to have the company with the most variable dividends, Gasunie, at the top of the stack rather than at the bottom.  It makes it much easier to see that the dividends of the other companies were relatively consistent during 2010-2014.

And using color to identify which companies had increasing and decreasing dividends is also very informative.  It makes it easy to see that the dividends of Gasunie increased more and those of BNG Bank declined more than the other companies during 2010-2014.