September 7, 2017

StatHeads and R - Seeing the Game in More Detail

By Sean Rodgers

Well NFL fans, it’s that time of year. We look forward to another season of American Football waiting to see some amazing feats of athleticism and sportsmanship. It is in these moments that fans of Fantasy Football look to find a great advantage against their fellow players. Before our season begins, let's take a few moments to analyze what we saw last year and find some benchmarks to look for this season.

Where do we start?

Finding the data! - StatHeads.com is where we will pull all of our data! - Use SQL statements to pull the data we need. - Use R to analyze the data we pull.

Learning SQL: A Basic Query

Let's start with a simple query for player_id based on the player's full name:

SELECT player_id, team_id
FROM player
WHERE full_name = 'PLAYERNAME'

SELECT tells SQL what column data we want.

FROM tells SQL what TABLE we should pull this data.

WHERE allows us to tell SQL how we should filter the data.

Why does player_id matter?

  • Tables use certain values to connect each other.
  • Tables connected by player_id:
  • player, player_career, player_game, player_season, player_team
Now we see that player_id connects all these tables we can start connecting even more tables.
Knowing player_id, we can use a JOIN to see more data about a selected player

Finding Stats

Lets use this query to find Aaron Rodgers player_id and team_id:

SELECT player_id, team_id
FROM player
WHERE full_name = 'Aaron Rodgers'

Information from this query:
player_id: 993 team_id: GB


Now that we have the player_id, lets find all games for Green Bay in 2016

SELECT week, 
  season_year,
  home_team, 
  home_score, 
  away_team, 
  away_score
FROM game
WHERE season_year = 2016 AND (home_team = 'GB' OR away_team = 'GB')
Now lets pull Aaron Rodgers' passing yards and passing touchdowns for these games. How do we do that when game doesn’t have a player_id?


Enter JOIN! JOIN in SQL allows us to pull data from multiple tables where certain values are the same data.

JOIN requires two things

  • JOIN [TABLE_YOU_WANT_TO_JOIN]
  • ON [TABLE_YOU_WANT_TO_JOIN].column=[OTHER_TABLE].column
So let's change that last query just a little bit.

SELECT game.week, 
  game.season_year, 
  game.home_team, 
  game.home_score, 
  game.away_team, 
  game.away_score, 
  player_game.passing_yds, 
  player_game.passing_tds
FROM game
JOIN player_game
  ON game.game_id=player_game.game_id
WHERE game.season_year = 2016 AND player_game.player_id = 993

Because we are using two tables, remember we need to distinguish which tables you want to pull data from - [TABLE].[COLUMN]. 
Now that we have our data, time to save that data and bring it into R.

Taking this to R

Setting up libraries in R for what we are looking at:

library(dplyr)
library(ggplot2)
library(plyr)
library(stats)
library(yaml)

Simple line of code:

arodgers <- read.csv(file = "data/arodgers.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

Calculating Win Loss

Little more tricky. We will use chained if else statements:

calc_winloss <- function(x){
  winloss <- ifelse((x$home_team == x$team_id), ifelse(x$home_score - x$away_score > 0, "win", "loss"), ifelse(x$away_score - x$home_score > 0, "win", "loss"))
  return(winloss)
}
arodgers$winloss <- calc_winloss(arodgers)

ifelse in R works: ifelse([CONDITION], [IF TRUE], [IF FALSE])
This function provides us the ability to create a new column whose value is either ‘win’ or ‘loss.’ This is done by first comparing if the Home Team is the selected players team. If it is then that passes our first ifelse statement with a true and we evaluate if the Home Team score is higher than they Away Team Score. If the Home Team score is higher than we mark that as a win or if it isn’t it is marked a loss. If the first statement evaluated as false we check the opposite for which the team won. This provides us with a teams away game win or loss.

Plotting Results

Next we move to plotting our results visual graphs.

plot <- ggplot(data = arodgers, aes(x=week, y=passing_yds, fill=winloss)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = scales::pretty_breaks(n = 18)) + scale_fill_manual(values=c("firebrick2", "steelblue2", "steelblue2")) + guides(fill=FALSE)
plot + labs(title="Aaron Rodgers - Passing Yards 2016 Season", x = "Week", y="Passing Yards")
Aaron Rodgers -- Passing Yards 2016
plot <- ggplot(data = arodgers, aes(x=week, y=passing_tds, fill=winloss)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = scales::pretty_breaks(n = 18)) + scale_fill_manual(values=c("firebrick2", "steelblue2", "steelblue2")) + guides(fill=FALSE)
plot + labs(title="Aaron Rodgers - Passing Touchdowns 2016 Season", x = "Week", y="Passing Touchdowns")
Aaron Rodgers -- Passing Touchdowns 2016

While these graphs reveal seemingly little we have to take a look at them through the eyes of a Fantasy Football coach. While we created the ‘win’ or ‘loss’ column, it actually means very little in terms of Fantasy Football. Aaron Rodgers had amazing games where he threw nearly 250 yards and threw at least two touchdowns and still lost. Some of Aaron Rodgers’s weakest weeks, in teams of Fantasy points, were Green Bay wins. Now these answers bring up more questions:

  • What did the running game look like those weeks?
  • Is Aaron Rodgers efficient at getting wins and the games Green Bay lost were inefficient attempts to win?

These are the sorts of answers that I hope to be answering this season, and I hope now that you too have the ability to seek these answers for yourself.