game_data_all <- rio::import(“https://raw.githubusercontent.com/nflverse/nfldata/refs/heads/master/data/games.csv”) |> filter(season %in% c(2024, 2025) & !is.na(result))
The load_schedules()
function returns a data frame with 46 variables for metrics including game time, temperature, wind, playing surface, outdoor or dome, point spreads, and more. Run print(dictionary_schedules
) to see a data frame with a data dictionary of all the fields.
Now that I have the data, I need to process it. I’m going to remove some ID fields I know I don’t want and keep everything else:
cols_to_remove <- c("old_game_id", "gsis", "nfl_detail_id", "pfr", "pff",
"espn", "ftn", "away_qb_id", "home_qb_id", "stadium_id")
games <- game_data_all |>
select(-all_of(cols_to_remove))
Although it’s obvious from the scores which teams won and lost, there aren’t actually columns for the winning and losing teams. In my tests, the LLM didn’t always write appropriate SQL when I asked about winning percentages. Adding team_won
and team_lost
columns makes that clearer for a model and simplifies the SQL queries needed. Then, I save the results to a feather file, a fast format for either R or Python: