Talking (Hockey) Age with ObservableJS

R
JavaScript
ObservableJS
nhl
api
Author

Sam Albers

Published

September 2, 2023

Lately I’ve been thinking about two things: getting old and ObservableJS. Getting old is self-explanatory. ObservableJS (ojs) is maybe less so. Normally ojs is used in a JavaScript notebook environment that enables you to create amazing interactive visualizations. This is a great platform and I suggest that if you are really interested in learning about ojs, you create an account there and start practicing. But sometimes, you want to take your interactive visualizations with you. This is where quarto comes in. Quarto is the successor to rmarkdown and is what this blog is written in. Like rmarkdown you can write prose and code in the same document. The biggest improvement with quarto is that you can also (more) easily write code chunks in other language like python or ojs.

This post is about ojs and quarto using some data sourced from the National Hockey League api. So back to talking about being old – I was wondering what happens to NHL players when they get old? In particular what happens to their productivity? I was not able to find any interactive visualizations that satisfactorily provided any ability to explore this question. So then I ended up having to learn about the NHL api and how I could get what I wanted out of it. So this is also a post about wrangling data from the NHL api. Almost accidentally, this really highlights the power of quarto. I can much more readily wrangle data in R. Sure it is possible to do this solely in ojs but for me it is easier in R. But then when I want to make the visualization, I can seamlessly switch to ojs right in the same quarto doc.

Let’s start with how we get the data. We start with a pretty standard suite of tidyverse packages (and httr2).

library(httr2)
library(purrr)
library(dplyr)
library(glue)
library(tidyr)

Next we need to define some helper functions.

convert_time <- function(time_str) {
  # Split the time into minutes and seconds
  time_parts <- strsplit(time_str, ":")[[1]]
  # Convert to integers
  minutes <- as.integer(time_parts[1])
  seconds <- as.integer(time_parts[2])
  # Convert the time to seconds
  minutes * 60 + seconds
}

generate_year_string <- function(start, end) {
  # Extract the starting and ending years
  startYear <- as.numeric(substr(start, 1, 4))
  endYear <- as.numeric(substr(end, 1, 4))
  # Generate the sequence of years
  years <- seq(startYear, endYear)
  # Concatenate each year with the following one to get the year pairs
  paste0(years, years + 1)
}

convert_time turns a time string like 824:13 into the number of seconds.

convert_time("824:13")
[1] 49453

generate_year_string is a helper function to generate the year strings that the NHL api expects.

generate_year_string(19951996, 19981999)
[1] "19951996" "19961997" "19971998" "19981999"

This is a function to help sending messages to the console. Defining it here saves lots of space where I use it later on.

api_message <- function(..., verbose = TRUE) {
  if (verbose) message(...)
  invisible(TRUE)
}

query_endpoint is the core function that actually sends a data request to the NHL api. It takes an endpoint and a list of query parameters and returns a tibble of the results. It is always helpful to isolate the code that does the querying from the code that does the data wrangling.

query_endpoint <- function(endpoint, ...) {
  base_url <- "https://statsapi.web.nhl.com/"
  req <- httr2::request(base_url) %>%
    httr2::req_url_path_append(endpoint) %>%
    httr2::req_url_query(...)

  resp <- httr2::req_perform(req)

  parsed <- httr2::resp_body_json(resp, simplifyVector = TRUE)
  parsed$copyright <- NULL
  dplyr::as_tibble(parsed[[1]])
}

With every request to the NHL api, the following note is returned in the response, specifying exactly who owns all the trademarks and logos. We drop it from the response because it is not really useful for our data wrangling example but I am replicating it here:

Copyright

NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the property of the NHL and its teams. © NHL 2023. All Rights Reserved.

The get_points_by_age is the workhorse function for this post but is also one of those functions that does just what you want it to do but it may not be broadly useful. For example, this function requests data for every player that is currently on an NHL roster. That’s a lot of data and the service may get cranky with you for requesting that much. The code is annotated (another cool feature of quarto) to outline the steps of this function. Click on the step and it will highlight the line of code it is referencing.

get_points_by_age <- function(team, min_games_played, verbose = FALSE) {
  api_message(glue("Getting data for {team}."), verbose = verbose)

1  team_df <- query_endpoint("api/v1/teams") %>%
    filter(teamName == team)

2  team_plus_roster <- query_endpoint(team_df$link, expand = "team.roster")

  ## Drop goalies
  roster <- team_plus_roster[["roster"]][["roster"]][[1]]
3  person_link <- roster[roster$position != "G", ][["person"]]$link
  person_link <- person_link[!is.na(person_link)]

4  roster_with_ages <- map_df(seq_along(person_link), ~ {
    query_endpoint(person_link[.x]) %>%
      select(id, fullName, link, birthDate, currentAge) %>%
      mutate(birthDate = as.Date(birthDate))
  })

  point_per_60_by_season <- map_df(seq_along(roster_with_ages$link), ~ {
    query_url <- glue("{roster_with_ages$link[.x]}/stats")

    season_stats <- map_df(generate_year_string("19951996", "20222023"), \(season) {
5      stats <- query_endpoint(
        query_url,
        stats = "statsSingleSeason",
        season = season
      )

      ## when there are no stats, return an empty tibble
      if (length(stats[["splits"]][[1]]) == 0) {
        api_message(
          glue("No data for {roster_with_ages$fullName[.x]} for the {season} season", verbose = verbose)
        )
        return(tibble())
      }
      api_message(
        glue("Getting data for {roster_with_ages$fullName[.x]} for the {season} season", verbose = verbose)
      )

6      games_played_threshold <- stats[["splits"]][[1]][["stat"]] %>%
        filter(games > min_games_played)

      ## when they are not above the threshold, return an empty tibble
      if (nrow(games_played_threshold) == 0) {
        api_message(glue("{roster_with_ages$fullName[.x]} did not play at least {min_games_played} games in the {season} season", verbose = verbose))
        return(tibble())
      }

      games_played_threshold %>%
        mutate(evenTimeOnIce = convert_time(evenTimeOnIce)) %>%
        mutate(even_strength_points = points - powerPlayPoints) %>%
7        mutate(even_strength_points_per_60 = (even_strength_points / evenTimeOnIce) * 3600) %>%
        select(even_strength_points_per_60, shotPct) %>%
        mutate(link = roster_with_ages$link[.x]) %>%
        mutate(season = season) %>%
        select(link, season, even_strength_points_per_60, shotPct)
    })
    season_stats
  })

  if (length(point_per_60_by_season) == 0) {
    api_message(glue("No players over found for {team}.", verbose = verbose))
    return(tibble())
  } else {
    point_per_60_by_season %>%
      mutate(team = team) %>%
      left_join(roster_with_ages, by = "link") %>%
8      mutate(age_at_start_of_season = currentAge - (2022 - as.numeric(substr(season, 1, 4)))) %>%
      filter(!is.na(fullName)) %>%
      relocate(fullName, birthDate, season, team, .before = even_strength_points_per_60) %>%
9      as_tibble()
  }
}
1
Get the team id from the team name
2
Get the roster for that team
3
Filter out the goalies
4
Query for the age of each player on the roster
5
Get the stats for each player for each season when they actually played
6
Filter out the players that don’t meet the minimum games played threshold
7
Calculate the points per 60 minutes for each player
8
Calculate the age of the player at the start of the season
9
Return a tibble with the results

Use the helper functions to get the data

The way that I’ve written get_points_by_age, you supply it a team name and it will look for stats on those players that are currently on the roster. You also are able to specify a threshold number of games played by individual players. So you would invoke it like this:

get_points_by_age("Penguins", min_games_played = 60, verbose = FALSE)

However, since we are interested in all players in the NHL, we need all teams. So first we can submit a query to the teams endpoint to get all the teams .

teams <- query_endpoint("api/v1/teams")
teams
# A tibble: 32 × 15
      id name                link  venue$name abbreviation teamName locationName
   <int> <chr>               <chr> <chr>      <chr>        <chr>    <chr>       
 1     1 New Jersey Devils   /api… Prudentia… NJD          Devils   New Jersey  
 2     2 New York Islanders  /api… UBS Arena  NYI          Islande… New York    
 3     3 New York Rangers    /api… Madison S… NYR          Rangers  New York    
 4     4 Philadelphia Flyers /api… Wells Far… PHI          Flyers   Philadelphia
 5     5 Pittsburgh Penguins /api… PPG Paint… PIT          Penguins Pittsburgh  
 6     6 Boston Bruins       /api… TD Garden  BOS          Bruins   Boston      
 7     7 Buffalo Sabres      /api… KeyBank C… BUF          Sabres   Buffalo     
 8     8 Montréal Canadiens  /api… Bell Cent… MTL          Canadie… Montréal    
 9     9 Ottawa Senators     /api… Canadian … OTT          Senators Ottawa      
10    10 Toronto Maple Leafs /api… Scotiaban… TOR          Maple L… Toronto     
# ℹ 22 more rows
# ℹ 12 more variables: venue$link <chr>, $city <chr>, $timeZone <df[,3]>,
#   $id <int>, firstYearOfPlay <chr>, division <df[,5]>, conference <df[,3]>,
#   franchise <df[,3]>, shortName <chr>, officialSiteUrl <chr>,
#   franchiseId <int>, active <lgl>

Then we use purrr::map_df to iterate over each team and then bind the results together into a single tibble. This is the data we will be working with.

age_curve_df <- map_df(
  teams$teamName, ~{
    get_points_by_age(.x, min_games_played = 40, verbose = interactive())
  })

The pass off to ojs

Now that we have the data, we can pass it off to JavaScript. We do this by using the ojs_define function. This function is available in R and python and it allows you to make data available to any ojs chunk in your quarto document. The age_curve_df data we gathered in the previous steps will now be available for any ojs chunk as ojs_age_curve_df regardless of where it is in the document.

ojs_define(ojs_age_curve_df = age_curve_df)

Working with Observable JavaScript

The rest of the code in this post is all written in ojs.

We do need import libraries (just like R) that don’t automatically come bundled with quarto. In this example, we are importing the arquero data wrangling library which as far as I can tell, is more or less dplyr for Observable JavaScript. Here is how they describe it:

inspired by the design of dplyr, Arquero provides a fluent API for manipulating column-oriented data frames.

import { aq, op } from '@uwdata/arquero';

Cool! I know I said that I was going to do most of my data wrangling in R but it really does help to be able to do some of it in JavaScript.

The first function that we are using is aq.from which is the equivalent of dplyr::as_tibble. It takes a data frame and converts it to an arquero table. We then call the view method on the table to see what it looks like. We are also need to transpose our data frame because arquero expects the data to be in a row-oriented format.

ageCurve = aq.from(transpose(ojs_age_curve_df))
ageCurve.view()

The question we are interested in looking at was at what age do players start to decline? Our evaluation metric will be even strength points per 60 minutes. Say, we are interested in looking at playing time for players that are 35 and older. We can filter the ageCurve table using syntax that is very similar to dplyr. The only thing to note is what goes inside the aq.escape function. This is because we are using a JavaScript function inside of a JavaScript function. This escapes the inner function so that it is evaluated properly.

ageCurve
  .filter(aq.escape(d => d.currentAge > 35))

Ok but you might asking yourself, why don’t I just do that in dplyr? The beauty of these ojs chunks is that we let the user perform these operations on the fly using interactive tools like sliders, dropdown menus and radio buttons. For that we make use of ObservableJS Inputs. Let’s make two of these – a slider and a dropdown. The slider (via Inputs.range) will define the variable age_cutoff and we can use that value to filter the ageCurve table updating the filteredAgeCurve data as a user interacts with the slider. We then take the filteredAgeCurve to determine unique player names and provide those as the values in the dropdown menu (via Inputs.select). One consequence of that is the values in the dropdown menu are dependent on the slider. Check for yourself. There are many fewer players to highlight at the age cutoff of 35.

It doesn’t get too exciting though until we start to visualize it. I am using the code-fold: true chunk option to hide the code that generates the plot and the inputs so that you can see the plot, the slider and the dropdown more closely together. Have a look at the code comments for a better idea of what is going on.

Code
viewof age_cutoff = Inputs.range([18, 40], {step: 1, value: 30})

// capture the value of the slider and filter the ageCurve table
filteredAgeCurve = ageCurve
  .filter(aq.escape(d => d.currentAge > age_cutoff)) 

// get the unique player names alphabetically
orderedNames = filteredAgeCurve
  .dedupe("fullName")
  .orderby("fullName")
  .array("fullName")

// setup the dropdown
viewof player = Inputs.select(orderedNames, {value: "Sidney Crosby"})

// filter the filteredAgeCurve table to the selected player
singlePlayer = filteredAgeCurve
  .filter(aq.escape(d => d.fullName == player))
Code
plotMarksHelper = function(data, color) {
  return Plot.lineY(data, {
    x: "age_at_start_of_season", 
    y: "even_strength_points_per_60", 
    z: "fullName",
    stroke: color,
    curve: "basis"
  });
}

// setup the plot
Plot.plot({
  y: {label: "Even Strength Points per 60"},
  x: {
    grid: true,
    label: "Age at the start of the season"
  },
  marks: [
    // plot all lines above a certain age
    // that filteredAgeCurve data changes as the slider changes
    plotMarksHelper(filteredAgeCurve, "lightgray"), 
    // highlight the selected player in blue
    plotMarksHelper(singlePlayer, "blue") 
  ]
})

We are using here the Plot library which also comes bundled with ojs in quarto. I find the code to create plots in ojs pretty accessible. Allison Horst wrote a really nice transition guide from ggplot2 to Plot that has proven to be very helpful. Plot draws from the same grammar of graphics that ggplot2 does so it’s spirit should feel familiar. Visual properties, like colour, are mapped to variables in the data

I have not quite found the best way to write nice (read: not smelly) code in ojs but the results are undeniably useful. The plots are beautiful and interactive. Plot provides so many different opportunities for interactivity than other interactive plotting libraries like plotly or ggiraph. Those are great libraries but creating visualization from scratch in ojs feel more natural and flexible.

Conclusion

I would be remiss to just have a pretty plot and not say anything about the data. The slider and the player selector both provide a simple way to explore the data. A couple points to note:

  • The overall trend is that many players start to decline at around 30. However there is some clear selection bias here. Players that are not producing at a high level are not going to be playing at 35.
  • Sidney Crosby is a beast. He is still producing at a consistently high level at 35.
  • The players that you would expect display truly stunning numbers. Nathan Mckinnon is just getting better and better and at 26, there is nothing to suggest that he is slowing down. Connor McDavid entered the league at a stunning pace and has consistency kept it up.
  • Joel Pavelski is a complete of an outlier. He produced his best season at 38.
  • Corey Perry is also an outlier but in the opposite direction. He has been in a strong decline since he was 30. And yet all he does is compete for Cups so clearly he’s still doing something right.

To wrap this up, I think that ojs is a great way to create interactive visualizations. I’ve tried to highlight the handoff from R to ojs and illustrate how you might go about creating a plot.

Photo by Jeremy Bishop on Unsplash