April 22, 2017

Introducing StatHeads.com

By Brandon Wirth

With so many sites offering sports statistics, you would think it would be easy to do meaningful analysis. Surprisingly, it's not.

#1: Getting data is hard

How would you answer these questions about the NBA:

  • How predictive is o-rating/d-rating of a game's final outcome?
  • Who is leading in adjusted assists?
  • What are the lineup entropies for each team? (critical for +/-)

These are relatively basic questions... and yet if you wanted to answer them, you would have to build a webscraper to get the necessary data. If you manage to find a site which generates your data, you're still going to have to dump it and parse it. Go into any analytics blog or forum and almost everyone needs to webscrape. It's 2017... that's absurd.

#2: Current tools don't allow you to calculate metrics 

Forms will allow you to filter, but they don't allow you to calculate your own metrics on the fly. This is severely limiting. However, if we can't use forms, we'll have to use something else.

Enter SQL

SQL really is the most flexible way to query data. If you want to retrieve non-standard sets of data, you have to use SQL. For example, let's say you wanted to retrieve the number of players with more than 100 3pt shots made per season. With SQL, it's easy:

SELECT season, COUNT(*) as number_of_players
FROM player_season ps
WHERE ps.three_pm > 100
GROUP BY season
ORDER BY season DESC

source

Let's take another example. Maybe you think that teams which out rebound their opponents have a higher chance of winning. Seems reasonable, but let's take a look:

SELECT ROUND(SUM(tg.won::integer)::numeric/COUNT(*), 2) as winning_percentage
FROM 
    team_game_stats tgs, /* stats for a team's particular game */
    team_game_stats_opponent tgso, /* stats for a team's opponent's game */
    team_game tg /* a team's game metadata (who played, won, home/away, ... etc */
WHERE 
    tg.game_id = tgs.game_id AND
    tg.team_id = tgs.team_id AND
    tgs.game_id = tgso.game_id AND 
    tgs.team_id = tgso.team_id AND
    (tgs.orb + tgs.drb) > (tgso.orb + tgso.drb) /* team out rebounds their opponent */

source

The result: they win 65% of the time. This is where you start to see the power of SQL.. you can do calculations inside your query. This allows you to rapidly experiment on your original premise because you can just change the rebounding condition. Maybe you want to see if out-rebounding and out-stealing the opponent leads to a much higher win probability... that's a trivial tweak and of course it does (77% chance of winning). Or how about just out-stealing (59% chance of winning).


SQL may take a bit more to master, but it's a much more powerful way to get data.


Where StatHeads comes in

We're the only place where you can write PostgreSQL (a dialect of SQL) queries online for NBA & NFL stats. We think that real sports analysis hasn't been well served by existing tools and we're hoping we can fill that gap. All our data is compiled from publicly available sources and we'll be working to improve it constantly. 


It's very much in beta, so just shoot us an email with any questions or feedback.


But most of all, happy querying!