Can We Build a Better Model to Evaluate NHL Teams than the Pythagorean Theorem?
This past March, I published an article on All Sportlytics using the Pythagorean Theorem to analyze how teams this season had been performing. I published the article with the intent to publish weekly articles analyzing how teams had changed from week-to-week according to predictions, but I wanted to dig a little deeper and find a better model to predict a team’s win/loss percentage using regression.
To back up first, I am going to describe the background behind the formula used in the original article. If you aren’t aware, the Pythagorean Theorem is a tool used by sports analysts (you thought it was used only by mathematicians? Nope!) for predicting a team’s winning percentage. It can be used for pretty much any sport, it’s basic, and can be used for a lot of varying analyses, which makes it a very versatile and useful tool for sports analysts. So, what exactly is the formula in action?
The formula goes as follows for hockey:
Quick note: the exponent can vary by sport and estimation model. I am using 2 just to demonstrate how I estimated in my original article. 2 is also the exponent used in the actual Pythagorean Theorem.
I go into much further detail about the data and results specifically relating to the past two NHL seasons in my original article. The conclusion from that article that is important for this piece is that using the Pythagorean Theorem as a method to predict a team’s win/loss percentage is pretty efficient. I found the mean absolute deviation of the past two seasons and found them varying from 1.5% in the 2019–2020 season to 4.6% in the 2020–2021 season.
I wanted to go further in predicting a team’s win/loss percentage and see if we could use any additional statistics that would help make the model more accurate. Using the Professional Hockey Database from Kaggle, I ran a linear regression on a team’s win/loss percentage for the 2005–2011 seasons and examined if the following factors have any effect, and if they do, what effect they have on the percentage: goals for, goals against, penalty minutes, bench minors, power play goals, power play chances, shorthanded goals against, power play goals against, penalty kill chances, shorthanded goals for. You might notice that power play goals and shorthanded goals for are both counted in goals for and that power play goals against and shorthanded goals against are both counted in goals against. The linear regression noticed this and they were soon removed from the model (except for one of them).
When all was said and done, the model recognized the following factors as significant for predicting a team’s win/loss percentage: goals for, goals against, penalty minutes, and shorthanded goals against.
New Model = (0.201962*GoalsFor)+(-0.2293857*GoalsAgainst)+( -0.0023443*PenaltyMinutes)+(0.2742612*ShorthandedGoalsAgainst)+56.89731
Yes, this model is clearly off, but I wanted to roll with it as a test. This model accounts for about 86% of the data points (otherwise know as its r²), so it seems to be fairly accurate. How accurate is it compared to the Pythagorean Theorem model? For comparison of the two models over the 2005–2011 seasons, I have the following graph covering their average absolute errors over time:
As you can see, the new model appears to be a better estimator for an NHL team’s win/loss percentage — at least in the 2005–2011 seasons. To further test this hypothesis, I used recent team records (the past two seasons) from Hockey Reference and compared them to what was estimated from the model. This is where the model was exposed as having some inefficiencies and errors. You can probably see one of them by just looking at the model: why would the regression think of shorthanded goals against as a significant indicator when those are already included in goals against? Also, it just flat-out wasn’t a good indicator. The mean absolute deviation was much higher than estimated in the 2005–2011 seasons.
So I threw out the model and turned to improve it. Again using Hockey Reference, I built a bigger database to run another regression, but this time the data points would be from 2010–11 season to the 2019-20 season. This includes about a hundred more data points than the data used in the previous regression, and it will allow me to include more variables. (It also reminded me how fluid hockey standings are from season to season!)
After running a regression using goals for, goals against, strength of schedule, powerplay opportunities, powerplay opportunities against, penalty minutes per game, opposing penalty minutes per game, shots, shots against, and average age of the team, the model quickly found that the only variables that were significant to be goals for and goals against.
Newer Model = (.2346285*GoalsFor)+(-.2330846*GoalsAgainst)+49.6724
Funny enough, the r² on this model was 86% again, which is exactly what the previous model had. When trying to use each model to estimate the winning percentage of each team from the 2010–11 season to the 2019–20 season, the mean absolute deviation for each model was as follows:
Pythagorean Theorem: 2.39%
Old Regression Model: 2.73%
New Regression Model: 2.48%
So, the Pythagorean Theorem holds up against all. That’s the only conclusion, right? Not quite.
I turned to one more test: using advanced team statistics. I turned to build a database using advanced team statistics from Natural Stat Trick. I was determined to build the greatest estimator model of all them. The one to rule them all.
I used data starting from the 2007–2008 season all the way to the current season, for over 400 data points. I ran a regression on the winning percentage and a host of other advanced statistics, like Corsi, Fenwick, expected goals, scoring chances, and so on. There were a lot.
The model narrowed down to the following:
Model with Advanced Stats = (0.2313911*GoalsFor)+(-0.2269357*GoalsAgainst)+(-0.0028384*CF)+(0.0027777*CA)+(0.1046442*expectedGoalsFor)+(-0.1241864*expectedGoalsAgainst)+(-0.0127441*HighDangerChancesFor)+(0.0162828*HighDangerChancesAgainst)+50.29831
There are some obvious issues with this model right off the bat (similar to the first regression), but comparing this with the last regression model and the Pythagorean Theorem, they had the following mean absolute deviations:
Pythagorean Theorem: 2.54%
Old Regression Model: 2.80%
New Regression Model: 2.78%
When I started writing this article, I titled it “Building a Better Model to Evaluate NHL Teams” but changed it to the current title after finding that the Pythagorean Theorem’s evaluation to be the best fit.
So, what can we conclude from this journey?
The Pythagorean Theorem rules all, and it is a useful tool to estimating an NHL team’s winning percentage. Based on the variables used above, it actually seems to be the best model to estimating an NHL team’s winning percentage. The other models seemed to come to similar conclusions when I tried to push the model to overfit. And also, don’t try to shove as many variables in to a regression to see what works and what doesn’t. It doesn’t work.