library(httr2)
library(purrr)
library(dplyr)
library(glue)
library(tidyr)
Lately I’ve been thinking about two things: getting old and ObservableJS. Getting old is self-explanatory. ObservableJS (ojs) is maybe less so. Normally ojs is used in a JavaScript notebook environment that enables you to create amazing interactive visualizations. This is a great platform and I suggest that if you are really interested in learning about ojs, you create an account there and start practicing. But sometimes, you want to take your interactive visualizations with you. This is where quarto comes in. Quarto is the successor to rmarkdown and is what this blog is written in. Like rmarkdown you can write prose and code in the same document. The biggest improvement with quarto is that you can also (more) easily write code chunks in other language like python or ojs.
This post is about ojs and quarto using some data sourced from the National Hockey League api. So back to talking about being old – I was wondering what happens to NHL players when they get old? In particular what happens to their productivity? I was not able to find any interactive visualizations that satisfactorily provided any ability to explore this question. So then I ended up having to learn about the NHL api and how I could get what I wanted out of it. So this is also a post about wrangling data from the NHL api. Almost accidentally, this really highlights the power of quarto. I can much more readily wrangle data in R. Sure it is possible to do this solely in ojs but for me it is easier in R. But then when I want to make the visualization, I can seamlessly switch to ojs right in the same quarto doc.
Let’s start with how we get the data. We start with a pretty standard suite of tidyverse packages (and httr2).
Next we need to define some helper functions.
<- function(time_str) {
convert_time # Split the time into minutes and seconds
<- strsplit(time_str, ":")[[1]]
time_parts # Convert to integers
<- as.integer(time_parts[1])
minutes <- as.integer(time_parts[2])
seconds # Convert the time to seconds
* 60 + seconds
minutes
}
<- function(start, end) {
generate_year_string # Extract the starting and ending years
<- as.numeric(substr(start, 1, 4))
startYear <- as.numeric(substr(end, 1, 4))
endYear # Generate the sequence of years
<- seq(startYear, endYear)
years # Concatenate each year with the following one to get the year pairs
paste0(years, years + 1)
}
convert_time
turns a time string like 824:13
into the number of seconds.
convert_time("824:13")
[1] 49453
generate_year_string
is a helper function to generate the year strings that the NHL api expects.
generate_year_string(19951996, 19981999)
[1] "19951996" "19961997" "19971998" "19981999"
This is a function to help sending messages to the console. Defining it here saves lots of space where I use it later on.
<- function(..., verbose = TRUE) {
api_message if (verbose) message(...)
invisible(TRUE)
}
query_endpoint
is the core function that actually sends a data request to the NHL api. It takes an endpoint and a list of query parameters and returns a tibble of the results. It is always helpful to isolate the code that does the querying from the code that does the data wrangling.
<- function(endpoint, ...) {
query_endpoint <- "https://statsapi.web.nhl.com/"
base_url <- httr2::request(base_url) %>%
req ::req_url_path_append(endpoint) %>%
httr2::req_url_query(...)
httr2
<- httr2::req_perform(req)
resp
<- httr2::resp_body_json(resp, simplifyVector = TRUE)
parsed $copyright <- NULL
parsed::as_tibble(parsed[[1]])
dplyr }
With every request to the NHL api, the following note is returned in the response, specifying exactly who owns all the trademarks and logos. We drop it from the response because it is not really useful for our data wrangling example but I am replicating it here:
NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the property of the NHL and its teams. © NHL 2023. All Rights Reserved.
The get_points_by_age
is the workhorse function for this post but is also one of those functions that does just what you want it to do but it may not be broadly useful. For example, this function requests data for every player that is currently on an NHL roster. That’s a lot of data and the service may get cranky with you for requesting that much. The code is annotated (another cool feature of quarto) to outline the steps of this function. Click on the step and it will highlight the line of code it is referencing.
<- function(team, min_games_played, verbose = FALSE) {
get_points_by_age api_message(glue("Getting data for {team}."), verbose = verbose)
1<- query_endpoint("api/v1/teams") %>%
team_df filter(teamName == team)
2<- query_endpoint(team_df$link, expand = "team.roster")
team_plus_roster
## Drop goalies
<- team_plus_roster[["roster"]][["roster"]][[1]]
roster 3<- roster[roster$position != "G", ][["person"]]$link
person_link <- person_link[!is.na(person_link)]
person_link
4<- map_df(seq_along(person_link), ~ {
roster_with_ages query_endpoint(person_link[.x]) %>%
select(id, fullName, link, birthDate, currentAge) %>%
mutate(birthDate = as.Date(birthDate))
})
<- map_df(seq_along(roster_with_ages$link), ~ {
point_per_60_by_season <- glue("{roster_with_ages$link[.x]}/stats")
query_url
<- map_df(generate_year_string("19951996", "20222023"), \(season) {
season_stats 5<- query_endpoint(
stats
query_url,stats = "statsSingleSeason",
season = season
)
## when there are no stats, return an empty tibble
if (length(stats[["splits"]][[1]]) == 0) {
api_message(
glue("No data for {roster_with_ages$fullName[.x]} for the {season} season", verbose = verbose)
)return(tibble())
}api_message(
glue("Getting data for {roster_with_ages$fullName[.x]} for the {season} season", verbose = verbose)
)
6<- stats[["splits"]][[1]][["stat"]] %>%
games_played_threshold filter(games > min_games_played)
## when they are not above the threshold, return an empty tibble
if (nrow(games_played_threshold) == 0) {
api_message(glue("{roster_with_ages$fullName[.x]} did not play at least {min_games_played} games in the {season} season", verbose = verbose))
return(tibble())
}
%>%
games_played_threshold mutate(evenTimeOnIce = convert_time(evenTimeOnIce)) %>%
mutate(even_strength_points = points - powerPlayPoints) %>%
7mutate(even_strength_points_per_60 = (even_strength_points / evenTimeOnIce) * 3600) %>%
select(even_strength_points_per_60, shotPct) %>%
mutate(link = roster_with_ages$link[.x]) %>%
mutate(season = season) %>%
select(link, season, even_strength_points_per_60, shotPct)
})
season_stats
})
if (length(point_per_60_by_season) == 0) {
api_message(glue("No players over found for {team}.", verbose = verbose))
return(tibble())
else {
} %>%
point_per_60_by_season mutate(team = team) %>%
left_join(roster_with_ages, by = "link") %>%
8mutate(age_at_start_of_season = currentAge - (2022 - as.numeric(substr(season, 1, 4)))) %>%
filter(!is.na(fullName)) %>%
relocate(fullName, birthDate, season, team, .before = even_strength_points_per_60) %>%
9as_tibble()
} }
- 1
- Get the team id from the team name
- 2
- Get the roster for that team
- 3
- Filter out the goalies
- 4
- Query for the age of each player on the roster
- 5
- Get the stats for each player for each season when they actually played
- 6
- Filter out the players that don’t meet the minimum games played threshold
- 7
- Calculate the points per 60 minutes for each player
- 8
- Calculate the age of the player at the start of the season
- 9
- Return a tibble with the results
Use the helper functions to get the data
The way that I’ve written get_points_by_age
, you supply it a team name and it will look for stats on those players that are currently on the roster. You also are able to specify a threshold number of games played by individual players. So you would invoke it like this:
get_points_by_age("Penguins", min_games_played = 60, verbose = FALSE)
However, since we are interested in all players in the NHL, we need all teams. So first we can submit a query to the teams endpoint to get all the teams .
<- query_endpoint("api/v1/teams")
teams teams
# A tibble: 32 × 15
id name link venue$name abbreviation teamName locationName
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 New Jersey Devils /api… Prudentia… NJD Devils New Jersey
2 2 New York Islanders /api… UBS Arena NYI Islande… New York
3 3 New York Rangers /api… Madison S… NYR Rangers New York
4 4 Philadelphia Flyers /api… Wells Far… PHI Flyers Philadelphia
5 5 Pittsburgh Penguins /api… PPG Paint… PIT Penguins Pittsburgh
6 6 Boston Bruins /api… TD Garden BOS Bruins Boston
7 7 Buffalo Sabres /api… KeyBank C… BUF Sabres Buffalo
8 8 Montréal Canadiens /api… Bell Cent… MTL Canadie… Montréal
9 9 Ottawa Senators /api… Canadian … OTT Senators Ottawa
10 10 Toronto Maple Leafs /api… Scotiaban… TOR Maple L… Toronto
# ℹ 22 more rows
# ℹ 12 more variables: venue$link <chr>, $city <chr>, $timeZone <df[,3]>,
# $id <int>, firstYearOfPlay <chr>, division <df[,5]>, conference <df[,3]>,
# franchise <df[,3]>, shortName <chr>, officialSiteUrl <chr>,
# franchiseId <int>, active <lgl>
Then we use purrr::map_df
to iterate over each team and then bind the results together into a single tibble. This is the data we will be working with.
<- map_df(
age_curve_df $teamName, ~{
teamsget_points_by_age(.x, min_games_played = 40, verbose = interactive())
})
The pass off to ojs
Now that we have the data, we can pass it off to JavaScript. We do this by using the ojs_define
function. This function is available in R and python and it allows you to make data available to any ojs chunk in your quarto document. The age_curve_df
data we gathered in the previous steps will now be available for any ojs chunk as ojs_age_curve_df
regardless of where it is in the document.
ojs_define(ojs_age_curve_df = age_curve_df)
Working with Observable JavaScript
The rest of the code in this post is all written in ojs.
We do need import libraries (just like R) that don’t automatically come bundled with quarto. In this example, we are importing the arquero data wrangling library which as far as I can tell, is more or less dplyr for Observable JavaScript. Here is how they describe it:
inspired by the design of dplyr, Arquero provides a fluent API for manipulating column-oriented data frames.
import { aq, op } from '@uwdata/arquero';
Cool! I know I said that I was going to do most of my data wrangling in R but it really does help to be able to do some of it in JavaScript.
The first function that we are using is aq.from
which is the equivalent of dplyr::as_tibble
. It takes a data frame and converts it to an arquero table. We then call the view
method on the table to see what it looks like. We are also need to transpose
our data frame because arquero expects the data to be in a row-oriented format.
= aq.from(transpose(ojs_age_curve_df))
ageCurve .view() ageCurve
The question we are interested in looking at was at what age do players start to decline? Our evaluation metric will be even strength points per 60 minutes. Say, we are interested in looking at playing time for players that are 35 and older. We can filter the ageCurve
table using syntax that is very similar to dplyr. The only thing to note is what goes inside the aq.escape
function. This is because we are using a JavaScript function inside of a JavaScript function. This escapes the inner function so that it is evaluated properly.
ageCurve.filter(aq.escape(d => d.currentAge > 35))
Ok but you might asking yourself, why don’t I just do that in dplyr? The beauty of these ojs
chunks is that we let the user perform these operations on the fly using interactive tools like sliders, dropdown menus and radio buttons. For that we make use of ObservableJS Inputs
. Let’s make two of these – a slider and a dropdown. The slider (via Inputs.range
) will define the variable age_cutoff
and we can use that value to filter the ageCurve
table updating the filteredAgeCurve
data as a user interacts with the slider. We then take the filteredAgeCurve
to determine unique player names and provide those as the values in the dropdown menu (via Inputs.select
). One consequence of that is the values in the dropdown menu are dependent on the slider. Check for yourself. There are many fewer players to highlight at the age cutoff of 35.
It doesn’t get too exciting though until we start to visualize it. I am using the code-fold: true
chunk option to hide the code that generates the plot and the inputs so that you can see the plot, the slider and the dropdown more closely together. Have a look at the code comments for a better idea of what is going on.
Code
= Inputs.range([18, 40], {step: 1, value: 30})
viewof age_cutoff
// capture the value of the slider and filter the ageCurve table
= ageCurve
filteredAgeCurve .filter(aq.escape(d => d.currentAge > age_cutoff))
// get the unique player names alphabetically
= filteredAgeCurve
orderedNames .dedupe("fullName")
.orderby("fullName")
.array("fullName")
// setup the dropdown
= Inputs.select(orderedNames, {value: "Sidney Crosby"})
viewof player
// filter the filteredAgeCurve table to the selected player
= filteredAgeCurve
singlePlayer .filter(aq.escape(d => d.fullName == player))
Code
= function(data, color) {
plotMarksHelper return Plot.lineY(data, {
x: "age_at_start_of_season",
y: "even_strength_points_per_60",
z: "fullName",
stroke: color,
curve: "basis"
;
})
}
// setup the plot
.plot({
Ploty: {label: "Even Strength Points per 60"},
x: {
grid: true,
label: "Age at the start of the season"
,
}marks: [
// plot all lines above a certain age
// that filteredAgeCurve data changes as the slider changes
plotMarksHelper(filteredAgeCurve, "lightgray"),
// highlight the selected player in blue
plotMarksHelper(singlePlayer, "blue")
] })
We are using here the Plot
library which also comes bundled with ojs in quarto. I find the code to create plots in ojs pretty accessible. Allison Horst wrote a really nice transition guide from ggplot2 to Plot
that has proven to be very helpful. Plot
draws from the same grammar of graphics that ggplot2 does so it’s spirit should feel familiar. Visual properties, like colour, are mapped to variables in the data
I have not quite found the best way to write nice (read: not smelly) code in ojs but the results are undeniably useful. The plots are beautiful and interactive. Plot provides so many different opportunities for interactivity than other interactive plotting libraries like plotly or ggiraph. Those are great libraries but creating visualization from scratch in ojs feel more natural and flexible.
Conclusion
I would be remiss to just have a pretty plot and not say anything about the data. The slider and the player selector both provide a simple way to explore the data. A couple points to note:
- The overall trend is that many players start to decline at around 30. However there is some clear selection bias here. Players that are not producing at a high level are not going to be playing at 35.
- Sidney Crosby is a beast. He is still producing at a consistently high level at 35.
- The players that you would expect display truly stunning numbers. Nathan Mckinnon is just getting better and better and at 26, there is nothing to suggest that he is slowing down. Connor McDavid entered the league at a stunning pace and has consistency kept it up.
- Joel Pavelski is a complete of an outlier. He produced his best season at 38.
- Corey Perry is also an outlier but in the opposite direction. He has been in a strong decline since he was 30. And yet all he does is compete for Cups so clearly he’s still doing something right.
To wrap this up, I think that ojs is a great way to create interactive visualizations. I’ve tried to highlight the handoff from R to ojs and illustrate how you might go about creating a plot.
Photo by Jeremy Bishop on Unsplash