In January, I was lucky enough to attend the 2020 edition of RStudio::conf. Perhaps predictably, the conference and workshops were exceptional and to see all the wonderful things that folks in the R community are capable of was quite inspiring. People are really quite clever. Attending the tidy dev day was such a nice epilogue to the conference because after spending so much time listening to people talk about their code, I was pretty keen to crack open R and have at it myself. Before I lose everything from the conference to memory leaks, I am going to try to catalogue a few things that I learned at the conference by trying to weave them together into a single workflow.
A brief detour about where to get these packages
Several of the packages that I am using here are at development stages and aren’t yet on CRAN. I’m including the installation instructions here but eventually this process should be as easy as the typical
install.packages. Here I am using the nightly build of the
arrow package which can be installed like this:
install.packages('arrow', repos = 'https://dl.bintray.com/ursalabs/arrow-r')
And then for
ggtext, which isn’t on CRAN, we install it (and the dev version of
ggplot2) from GitHub:
Other packages that I am using are loaded here:
library(fs) library(purrr) library(dplyr) library(arrow) library(ggtext) library(ggplot2) library(lubridate) library(here) library(stringr) library(glue)
One of my main goals at the conference was to find out new ways of dealing with very big flat files. I work in an environment where big flat files are sort of our only option. Adding them to a proper database is not currently possible but I was hopeful that maybe the Apache Arrow project might offer up some solutions. I was not disappointed. Neal Richardson with UrsaLabs gave a great presentation on the status of the project with a specific focus on the R package
Here I am mostly parroting what Neal did with his presentation just replacing taxi data with Canadian hydrometric data. Whether we are provisioned data that way or create it ourselves, consider data organized in a hierarchical folder structure. Here at the top level we have Canadian
## C:/_dev/GitHub_repos/samalbers.science/data/rivers-data ## +-- AB ## +-- BC ## +-- MB ## +-- NB ## +-- NL ## +-- NS ## +-- NT ## +-- NU ## +-- ON ## +-- PE ## +-- QC ## +-- SK ## \-- YT
province folder we have
## C:/_dev/GitHub_repos/samalbers.science/data/rivers-data/AB ## +-- 2017 ## \-- 2018
year folder we have
## C:/_dev/GitHub_repos/samalbers.science/data/rivers-data/AB/2017 ## +-- 01 ## +-- 02 ## +-- 03 ## +-- 04 ## +-- 05 ## +-- 06 ## +-- 07 ## +-- 08 ## +-- 09 ## +-- 10 ## +-- 11 ## \-- 12
and finally within that directory you actually have your data file:
## C:/_dev/GitHub_repos/samalbers.science/data/rivers-data/AB/2017/01 ## \-- rivers.parquet
Normally in this situation my approach would be to do some sort of iterative process over each file (mind you still making use of
arrow to read the parquet file):
df_rivs <- list.files(here('data/rivers-data/'), pattern = '*.parquet', recursive = TRUE, full.names = TRUE) %>% map_dfr(read_parquet)
From there we might execute some typical sequence designed to filter our data down to a more manageable size.
df_rivs %>% filter(year(Date) == 2017) %>% filter(Parameter == 'Flow') %>% arrange(Date)
## # A tibble: 456,448 x 5 ## STATION_NUMBER Date Parameter Value Symbol ## <chr> <date> <chr> <dbl> <chr> ## 1 05AA008 2017-01-01 Flow 1.75 B ## 2 05AC003 2017-01-01 Flow 0.954 B ## 3 05AC012 2017-01-01 Flow 0.736 B ## 4 05AC941 2017-01-01 Flow 1.03 <NA> ## 5 05AD007 2017-01-01 Flow 24.6 B ## 6 05AE027 2017-01-01 Flow 4.81 B ## 7 05AG006 2017-01-01 Flow 24.7 B ## 8 05AH005 2017-01-01 Flow NA <NA> ## 9 05AH037 2017-01-01 Flow NA <NA> ## 10 05AH044 2017-01-01 Flow NA <NA> ## # ... with 456,438 more rows
What we learned in Neal’s presentation was the magic of the
open_dataset function and specifically its ability to map hierarchical directory structure to virtual columns in your data. If we read just one parquet file, it is apparent that there aren’t any
## # A tibble: 2,263 x 5 ## STATION_NUMBER Date Parameter Value Symbol ## <chr> <date> <chr> <dbl> <chr> ## 1 05AA008 2017-01-01 Flow 1.75 B ## 2 05AC003 2017-01-01 Flow 0.954 B ## 3 05AC012 2017-01-01 Flow 0.736 B ## 4 05AC941 2017-01-01 Flow 1.03 <NA> ## 5 05AD007 2017-01-01 Flow 24.6 B ## 6 05AE027 2017-01-01 Flow 4.81 B ## 7 05AG006 2017-01-01 Flow 24.7 B ## 8 05AH005 2017-01-01 Flow NA <NA> ## 9 05AH037 2017-01-01 Flow NA <NA> ## 10 05AH044 2017-01-01 Flow NA <NA> ## # ... with 2,253 more rows
Instead, if we assign partitions, using a vector based on the directory structure,
open_dataset can use that information to efficiently subset a larger dataset.
rivs <- open_dataset(here('data/rivers-data/'), partitioning = c('province','year', 'month')) rivs
## Dataset ## STATION_NUMBER: string ## Date: date32[day] ## Parameter: string ## Value: double ## Symbol: string ## province: string ## year: int32 ## month: int32 ## ## See $metadata for additional Schema metadata
Best of all, the
rename dplyr verbs are implemented much like
dbplyr and your query is executed lazily taking advantage of both the directory structure and the parquet files.
river_data <- rivs %>% filter(year == 2017) %>% filter(province %in% c('BC', 'YT', 'AB', 'SK', 'NL')) %>% filter(Parameter == 'Flow') %>% group_by(STATION_NUMBER, province) %>% collect() %>% arrange(Date)
While YMMV depending on the size of your data, using folder structure is a nifty way to only access the data we actually need. The Apache Arrow project, and for R users the
arrow package, are proceeding very nicely. Now that we have efficiently pared down our river flow data, the next exciting thing I want to explore is some really cool developments in the
ggtext package by Claus Wilke provides much improved rendering support for
ggplot2. What feels like should a simple task (e.g. colour some portion of text) is sometimes quite onerous in
ggplot2. Though the package is still in its infancy,
ggtext is breaking trail on making these steps much easier by providing a mini markdown engine directly inside
ggplot2. After audible ohhs and ahhs from the crowd while demoing
ggtext, Claus observed “I can see this fills a need”. Already it provides some support for markdown and html rendering.
So how can we use it to better visualize our river data? Because
ggtext has some minimal html rendering, we can actually include images right inside the ggplot call. My idea was to try and see if I could include provincial flags as axes labels in the plot. This requires steps to:
- get the files
- extract the province name from the file name
- create a new column in our rivers data because the provinces aren’t labelled in the same way
- glue the image names with the html snippets
I won’t go into too much detail but here are the steps:
Get the Data
dl <- file.path(tempdir(), 'flag.zip') download.file('https://www.ederflag.com/images/catalog_images/canadian-province-and-territory-images.zip', dest = dl) unzip(dl, exdir = here('data/flags')) unlink(dl)
flag_paths <- dir_ls(here('data/flags'), glob = '*.jpg') province_full <- path_file(flag_paths) %>% path_ext_remove() %>% str_sub(start = 12) %>% str_remove('-flag') %>% str_remove('-image') %>% str_replace_all('-', ' ') %>% str_to_title()
Crosswalk Province Names
river_data <- river_data %>% left_join(tibble(province_full = province_full, province = c('AB', 'BC', 'MB', 'NB', 'NL', 'NT', 'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT')), by = c('province'))
Design Nicer Plots
RStudio::conf had a thread of design thinking running through the workshops and conference. From the tidyverse’s near obsession with the design of their api to the inclusion of a live episode of Not So Standard deviation as a keynote, thinking about data science from a design perspective was a key theme that emerged for me. One example of this was Will Chase’s wonderful talk on the Glamour of Graphics. Will presented some very thoughtful approaches to creating better visualizations. I am going to
butcher apply some of those approaches to our plot above.
First of our plot is rather uninformative from hydrological perspective. A reasonable goal for this plot would be to aid the user to evaluate the distribution of annual river flow by province. In the above plot, the extreme values are stretching the scale too far out so let’s limit our analysis to rivers that output less that 10,000 m3/s per year.
annual_flow_sub <- annual_flow %>% filter(annual_flow < 10000)
Also the basic point plot doesn’t give us a great way to look at the distribution. For that task, another of Claus’s packages,
ggridges comes in handy.
ggridges is great for visualizing distributions and also forces us to flip the axes creating a more natural information flow (at least for those of us that read left to right).
library(ggridges) annual_flow_sub %>% ggplot(aes(y = province_full, x = annual_flow)) + geom_density_ridges() + scale_y_discrete(name = NULL, labels = img_tags) + theme(axis.text.y = element_markdown(color = 'black', size = 11))
A great line from Will’s presentation pertained to white space:
White space is like garlic; take the amount you think you need, then triple it.
Right then let’s create some more white space by getting rid of the classic ggplot2 grey background. Here we can also tweak the height of the ridges to better show the distributions.
annual_flow_sub %>% ggplot(aes(y = province_full, x = annual_flow)) + geom_density_ridges(scale = 1) + scale_y_discrete(name = NULL, labels = img_tags) + theme_minimal() + theme(axis.text.y = element_markdown(color = 'black', size = 11))
Ok looking a bit better. Another one of Will’s suggestion is to remove grid lines as much as possible. I basically agree and just keep the minor x values.
annual_flow_sub %>% ggplot(aes(y = province_full, x = annual_flow)) + geom_density_ridges(scale = 1) + scale_y_discrete(name = NULL, labels = img_tags) + theme_minimal() + theme(axis.text.y = element_markdown(color = 'black', size = 11), panel.grid.major = element_blank(), panel.grid.minor.y = element_blank())
✔️ Now we need some colour here. As Will stated colour is hard. My goal here is pretty modest. I just want to distinguish between provinces. To do that I am actually going to steal some colour from the flags and manually map those to fill in colour for the ridges. At the same time I am going to add some transparency to the ridges. I am going to deviate a little from Will’s advice here and keep the legend. I often get this way with plots and err on the side of caution. In this case I am thinking that folks won’t recognize the flags and therefore will use the legend. In general though I do like the approach of forcing legends to justify their existence - they need to earn their keep.
flag_cols <- c('#3853a4', '#f3ec18', '#da1a33', '#006b35', '#0054a5') annual_flow_sub %>% ggplot(aes(y = province_full, x = annual_flow, fill = province_full)) + geom_density_ridges(scale = 1, alpha = 0.5) + scale_fill_manual(values = flag_cols) + scale_y_discrete(name = NULL, labels = img_tags) + theme_minimal() + theme(axis.text.y = element_markdown(color = 'black', size = 11), panel.grid.major = element_blank(), panel.grid.minor.y = element_blank(), legend.position = 'bottom', legend.justification='right')
Lastly this plot needs a title, which according to Will’s sage advice is also a great way to remove axes labels - just explain it in the title.
annual_flow_sub %>% ggplot(aes(y = province_full, x = annual_flow, fill = province_full)) + geom_density_ridges(scale = 1, alpha = 0.5) + scale_fill_manual(name = NULL, values = flag_cols) + scale_y_discrete(name = NULL, labels = img_tags) + labs(title = 'Smoothed distribution of annual flow of gauged rivers (m^3^ s^-1^) by province') + theme_minimal() + theme(axis.text.y = element_markdown(color = 'black', size = 11), axis.title.x = element_blank(), panel.grid.major = element_blank(), panel.grid.minor.y = element_blank(), plot.title.position = 'plot', plot.title = element_markdown(size = 15), legend.position = 'bottom', legend.justification='right')
From a visualization perspective this isn’t perfect or even great. Legends are still problematic, I don’t even know if the flags add anything and grid lines feel like an addiction. Still I think this does provide a decent overview of the types of rivers that are gauged in each province. The Yukon gauges bigger rivers while Saskatchewan gauges many smaller rivers. BC and Newfoundland gauges a wide range of rivers. Alberta gauges rivers that reflect both its mountain and prairies landscapes.
Back to RStudio conf
This has been a mini tour through some concept and packages I took in while attending RStudio::conf 2020. I can’t wait to spend more time with these packages as they mature and development. Every time I connect with the R community, I am grateful to be part of it. RStudio itself presents itself with class and respect all while creating a positive and inclusive space. I’m looking forward to future opportunities to connect with all you nerds!