If you haven’t read “How big are the Leafs?” then I urge you to go do that. That’s the interesting post, this is just the DVD extras, and that’s nearly always just a vanity project.
How the sausage was made
This has been an enormously fun and challenging exercise for me, and I wanted to share some of the work of others that made it possible. I did this all in R, and you could, were you inclined, repeat this work in Python or any other programing language. You could use Tableau, and make charts in that blue and red theme beloved on Twitter.
If you want to do projects like this, I urge you to try it and to know that the resources are out there to make it possible for you. Here’s where you start:
Drew Hynes has a page that documents the NHL API. This API is public, but the public is left to work out for themselves what they can get from it. Hynes lists some of the projects made with the API, the most valuable to me was this live explorer app from Sebastien Blanchet, who is clearly a Habs fan, so merci, Sebastien.
The NHL API spits out JSON, and if you don’t know what that is, that’s where you have to go next. If you do, and you want to play with it in R, then HockeyR, the package by Daniel Morse, is a good place to begin.
That’s were I started, and I quickly learned that most scraping projects involving NHL data are aimed at getting the live play-by-play data and making shot plots from it. Or for gathering it in the aggregate and creating sites like Evolving Hockey, HockeyViz, Moneypuck or all the long list of others that have come and gone. I love those sites, and I love a good shot plot, but I don’t see any need for new wheels in the marketplace.
While HockeyR was fantastic, I did end up writing my own custom functions to do everything I wanted in this project and others. And how I got to the point of being able to do that began in high school and a computer that ran BASIC, and it involves years of learning since then. If you’re really curious, I’ll tell you the formal classes I took.
Can smart young people learn this on the fly without formal training? You bet, but I’m going to tell you the thing you don’t want to hear: Take a proper programming class, and try to find one other people complain is too hard. If you do, you will be so far ahead of the game, you will be able to Stack Overflow your way to whatever you want to do from there out. If you just want to noodle around, and have the patience for Youtube tutorials (I really don’t), you can find a lot of fun projects to give you a taste of what you can do with sports numbers.
One thing I did learn in this process: R is often presented to people like a set of magic incantations. You say the magic Tidy words and the data appears before you. And that’s great until it isn’t. Find the lessons that put away the wand and tell you why you do things, and you’ll be much less frustrated when things don’t work, and you’ll also find doors opening to you to do things you’d never considered before.
But more than anything else, do the projects that interest you, not the same thing 10 people have already done in yet another blue and red bar chart that fits in a tweet.
Trial runs and training camp
If you remember the post about handedness in the NHL, you might recall that I just happened to have data on everyone in training camp at that time.
That was 1,300 players on September 16. The reason I had that handy was because I’d run through the process of creating the size plots and tables and deciding how to present the information with that larger dataset. I saw the way the results changed from the bigger set to the smaller. This is not the proper way to answer questions about the NHL prospect pool vs the NHL rosters, so this is very much anecdotes, not least because not all teams had all or even most of their prospects on hand.
So, having said that, here’s the team plot then (not the same resolution, unfortunately) and now:
You can’t help but notice that the Leafs got taller and heavier as they cut players. In that earlier sample, the Leafs had the tallest player in the NHL (Curtis Douglas) and the smallest overall mean height and weight. The means for the league overall didn’t change much, and most of the change was for forwards. But it is clear that several teams are bigger than their prospect pool.
Is that age — where the prospects or minor leaguers cut haven’t packed on the pounds in the gym yet? Is it a shift in NHL player size we just don’t see in the NHL as quickly? Both? That’s not something a single season of information can answer. And that irritating problem of the dynamic NHL database crops up when you want to dig into the past for weight data in ways it doesn’t for height. Historical draft data would be required to look at how the broader pool of elite hockey talent is changing or not changing over time.
One other thing I noticed is that the European players in the earlier dataset were bigger than the overall average, while in the final roster dataset, they are right on the average. This is most likely just what happens when you start chopping up relatively small datasets and looking for meaning. It should be a lesson to not do that, and not take anything from it to build narratives on. At best, it says this: In this current season, Europeans are not smaller as is anecdotally believed.
There’s a lot of that kind of anecdotal common wisdom about the NHL. The road to enlightenment doesn’t start with, “Everybody knows...” however, it begins with, “Is that really true, though?”
If you want to know what is true, dig in.