Scraping NBA data in R with rjson
Asked Answered
A

1

12

I have been spending a long time using R to try to scrape NBA data, so far I was doing it a little by trial and error, but finally I found this documentation. Some time ago I had some problems scraping the shotchartdetail, and I figured out the problem when I found this

This works

For that this is what I did:

shotURLtotal <- paste0("http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS=2016-17&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=0&PlusMinus=N&Position=&Rank=N&RookieYear=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&showZones=0&PlayerPosition=")

Season <- rjson::fromJSON(file = shotURLtotal, method="C")
Names <- Season$resultSets[[1]][[2]]

Season <- data.frame(matrix(unlist(Season$resultSets[[1]][[3]]), ncol = length(Names), byrow = TRUE))

colnames(Season) <- Names

But this does not

but when I try to do the same with the shotchartlineupdetail, and it does not work, I suspect it has to do with the CFID, which I don't know what it means, this is what I tried.

shoturl <- "http://stats.nba.com/stats/shotchartlineupdetail/?leagueId=00&season=2016-17&seasonType=Regular+Season&teamId=0&outcome=&location=&month=0&seasonSegment=&dateFrom=&dateTo=&opponentTeamId=0&vsConference=&vsDivision=&gameSegment=&period=0&lastNGames=0&gameId=&group_id=0&contextFilter=&contextMeasure=FGA"


Season <- rjson::fromJSON(file = shoturl, method="C")
Names <- Season$resultSets[[1]][[2]]

Season <- data.frame(matrix(unlist(Season$resultSets[[1]][[3]]), ncol = length(Names), byrow = TRUE))

colnames(Season) <- Names

Expected Results

The expected result should be a dataframe with the following columns:

c("GRID_TYPE", "GAME_ID", "GAME_EVENT_ID", "GROUP_ID", "GROUP_NAME", "PLAYER_ID", "PLAYER_NAME", "TEAM_ID", "TEAM_NAME", "PERIOD", "MINUTES_REMAINING", "SECONDS_REMAINING", "EVENT_TYPE", "ACTION_TYPE", "SHOT_TYPE", "SHOT_ZONE_BASIC", "SHOT_ZONE_AREA", "SHOT_ZONE_RANGE", "SHOT_DISTANCE", "LOC_X", "LOC_Y", "SHOT_ATTEMPTED_FLAG", "SHOT_MADE_FLAG", "GAME_DATE", "HTM", "VTM")

which you can get by doing:

shoturl <- "http://stats.nba.com/stats/shotchartlineupdetail/?leagueId=00&season=2016-17&seasonType=Regular+Season&teamId=0&outcome=&location=&month=0&seasonSegment=&dateFrom=&dateTo=&opponentTeamId=0&vsConference=&vsDivision=&gameSegment=&period=0&lastNGames=0&gameId=&group_id=0&contextFilter=&contextMeasure=FGA"


Season <- rjson::fromJSON(file = shoturl, method="C")
Names <- Season$resultSets[[1]][[2]]

So Names would be the columns of the dataframe, the problem is that by not using the CFID you get that the list where the data for those columns should be are empty, the answer that @be_green gives are the league average, and I need the team specific data

Ansel answered 11/12, 2017 at 1:3 Comment(22)
Could you give an example of the output you expect?Hogtie
Hi @Hogtie per your request I added the expected results, they are very similar to the ones shown in the example that worked but it also has the players that are in the court as a variableAnsel
It looks like your API request returns a null rowset for those variables--that might be the issue?Hogtie
Hi @Hogtie that is the issue, if you try the first example I give and you take out the ?CFID=33&CFPARAMS=2016-17& part it gives out an empty dataframe as well. so it seems that the CFID is what we need to figure out in order to get the data, unfortunately, CFID is not documented. And I am not sure how to get that parameterAnsel
Oh sorry, I misunderstood the problem completely! My bad.Hogtie
No problem @be_green, I hope you don't get discouraged by it a keep on trying :D. Get those 50 exp points!!!Ansel
This might be part of the problem--looks like the underlying endpoints changed: github.com/seemethere/nba_py/issues/67Hogtie
Is there a web address on the stats.nba.com site that shows what you want? You can get the query from loading that page.Glob
@Glob I have not found it yetAnsel
@DerekCorcoran As near as I can tell, the API isn't really an API at all. It's clunky and seems to be designed only to serve the tables that appear on the website. It's entirely possible that the shotchartlineupdetail endpoint is deprecated if there's no page that uses it.Glob
@DerekCorcoran to be clear, what difference do you expect between the shotchartdetail and shotchartlineupdetail? Is it just the group column? Can you use the shotchartdetail table and get the group columns from somewhere else?Glob
@Eumenedies, it has two extra columns which identify the identity of the players present at the moment the shot was takenAnsel
@Glob the two columns that I am missing are "GROUP_ID", "GROUP_NAME"Ansel
@DerekCorcoran endpoint /stats/teamdashlineups has "GROUP_ID" and "GROUP_NAME"Weisbrodt
Hi Derek - you might look at gregreda.com/2015/02/15/web-scraping-finding-the-api it is an example of figuring out API parameters.Heterography
@JamesThomasDurant That's the sort of technique I was thinking of but, if you can't find a page that uses the endpoint you are looking for then you can't monitor the request.Glob
@Glob - I looked as well and could not find the page as well. I did see this: rstudio-pubs-static.s3.amazonaws.com/… which seems to be another way to get summarized data. I also played with this: nycdatascience.com/blog/student-works/nba-lineup-data which seems to download the player and lineup data and combine them. It might be another approach - although the format of the data has changed slightly so modifications would be needed.Heterography
@Glob I will cehck the last think you said, I am trying to figure this outAnsel
I tried emailing the NBA too see if they would provide an explanation - I guess the mere name of "Durant" would provoke a response.Heterography
Hahahaha, I am @JamesThomasDurant brother of a former MVP, hahahhahahah, let me know if you get a responseAnsel
@JamesThomasDurant please let me know if they answerAnsel
No responses... They can probably tell from my shooting statistics that I am not even close to related to an MVP.Heterography
S
1

So I believe the issue here is that you need to pass a PlayerID and TeamID to the API. Using PlayerID = 2544 and TeamID = 1610612739 below as an example seems to work:

library(tidyverse)
res <- jsonlite::read_json("https://stats.nba.com/stats/shotchartdetail?AheadBehind=&ClutchTime=&ContextFilter=&ContextMeasure=PTS&DateFrom=&DateTo=&EndPeriod=&EndRange=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&PlayerID=2544&PlayerPosition=&PointDiff=&Position=&RangeType=&RookieYear=&Season=&SeasonSegment=&SeasonType=Regular+Season&StartPeriod=&StartRange=&TeamID=1610612739&VsConference=&VsDivision=")
# res %>% str(max.level = 3)

header_names <- flatten_chr(res$resultSets[[1]]$headers)
header_names
#>  [1] "GRID_TYPE"           "GAME_ID"             "GAME_EVENT_ID"      
#>  [4] "PLAYER_ID"           "PLAYER_NAME"         "TEAM_ID"            
#>  [7] "TEAM_NAME"           "PERIOD"              "MINUTES_REMAINING"  
#> [10] "SECONDS_REMAINING"   "EVENT_TYPE"          "ACTION_TYPE"        
#> [13] "SHOT_TYPE"           "SHOT_ZONE_BASIC"     "SHOT_ZONE_AREA"     
#> [16] "SHOT_ZONE_RANGE"     "SHOT_DISTANCE"       "LOC_X"              
#> [19] "LOC_Y"               "SHOT_ATTEMPTED_FLAG" "SHOT_MADE_FLAG"     
#> [22] "GAME_DATE"           "HTM"                 "VTM"

res$resultSets[[1]]$rowSet %>%
  map(`[`, 1:24) %>%
  map(~ set_names(., header_names)) %>%
  bind_rows()
#> # A tibble: 8,369 x 24
#>    GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME
#>    <chr>     <chr>           <int>     <int> <chr>         <int> <chr>    
#>  1 Shot Cha~ 002030~            20      2544 LeBron Jam~  1.61e9 Clevelan~
#>  2 Shot Cha~ 002030~            28      2544 LeBron Jam~  1.61e9 Clevelan~
#>  3 Shot Cha~ 002030~            35      2544 LeBron Jam~  1.61e9 Clevelan~
#>  4 Shot Cha~ 002030~            54      2544 LeBron Jam~  1.61e9 Clevelan~
#>  5 Shot Cha~ 002030~            67      2544 LeBron Jam~  1.61e9 Clevelan~
#>  6 Shot Cha~ 002030~            76      2544 LeBron Jam~  1.61e9 Clevelan~
#>  7 Shot Cha~ 002030~           224      2544 LeBron Jam~  1.61e9 Clevelan~
#>  8 Shot Cha~ 002030~           233      2544 LeBron Jam~  1.61e9 Clevelan~
#>  9 Shot Cha~ 002030~           235      2544 LeBron Jam~  1.61e9 Clevelan~
#> 10 Shot Cha~ 002030~           322      2544 LeBron Jam~  1.61e9 Clevelan~
#> # ... with 8,359 more rows, and 17 more variables: PERIOD <int>,
#> #   MINUTES_REMAINING <int>, SECONDS_REMAINING <int>, EVENT_TYPE <chr>,
#> #   ACTION_TYPE <chr>, SHOT_TYPE <chr>, SHOT_ZONE_BASIC <chr>,
#> #   SHOT_ZONE_AREA <chr>, SHOT_ZONE_RANGE <chr>, SHOT_DISTANCE <int>,
#> #   LOC_X <int>, LOC_Y <int>, SHOT_ATTEMPTED_FLAG <int>,
#> #   SHOT_MADE_FLAG <int>, GAME_DATE <chr>, HTM <chr>, VTM <chr>

Created on 2019-03-26 by the reprex package (v0.2.1)

Snowfall answered 26/3, 2019 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.