Exploring and Extracting Data in Lists

Attribution

This lesson was created and is copyrighted by Jenny Bryan, available at https://jennybc.github.io/purrr-tutorial/ls00_inspect-explore.html and distributed under the terms of a Creative Commons BY-NC 4.0 License. It has been adapted by Clarke Iakovakis and the adaptation is likewise distributed under a Creative Commons BY-NC 4.0 License.

Binder link to this notebook:

https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=jennybc_lists_lesson.ipynb

Load packages

Load purrr and repurrrsive, which contains recursive list examples.

library(purrr)
library(repurrrsive)

Inspect and explore

List inspection is very important and also fairly miserable. Before you can apply a function to every element of a list, you’d better understand the list!

You need to develop a toolkit for list inspection. Be on the look out for:

What is the length of the list?
Are the components homogeneous, i.e. do they have the same overall structure, albeit containing different data?
Note the length, names, and types of the constituent objects.

I have no idea what’s in this list or what its structure is! Please send help.

Understand this is situation normal, especially when your list comes from querying a poorly documented API. This is often true even when your list has been created completely within R. How many of us perfectly understand the structure of a fitted linear model object? You just have to embark on a voyage of discovery and figure out what’s in there. Happy trails.

Indexing, review

Remember, there are 3 ways to pull elements out of a list:

The $ operator. Extracts a single element by name. Name can be unquoted, if syntactic.
```
x <- list(a = "a", b = 2)
x$a
#> [1] "a"
x$b
#> [1] 2
```
[[ a.k.a. double square bracket. Extracts a single element by name or position. Name must be quoted, if provided directly. Name or position can also be stored in a variable.
```
x <- list(a = "a", b = 2)
x[["a"]]
#> [1] "a"
x[[2]]
#> [1] 2
nm <- "a"
x[[nm]]
#> [1] "a"
i <- 2
x[[i]]
#> [1] 2
```

[ a.k.a. single square bracket. Regular vector indexing. For a list input, this always returns a list!

x <- list(a = "a", b = 2)
x["a"]
#> $a
#> [1] "a"
x[c("a", "b")]
#> $a
#> [1] "a"
#> 
#> $b
#> [1] 2
x[c(FALSE, TRUE)]
#> $b
#> [1] 2

`str()`

str() can help with basic list inspection, although it’s still rather frustrating. Learn to love the max.level and list.len arguments. You can use them to keep the output of str() down to a manageable volume.

Once you begin to suspect or trust that your list is homogeneous, i.e. consists of sub-lists with similar structure, it’s often a good idea to do an in-depth study of a single element. In general, remember you can combine list inspection via str(..., list.len = x, max.level = y) with single [ and double [[ square bracket indexing.

The repurrrsive package provides examples of lists. We explore them below, to lay the groundwork for other lessons, and to demonstrate list inspection strategies.

listviewer and RStudio’s Object Explorer

The RStudio IDE (v1.1 and higher) offers an Object Explorer that provides interactive inspection and code generation tools for hierarchical objects, such as lists. You can invoke it via the GUI or in code as View(YOUR_UGLY_LIST).

However, that won’t help you expose list exploration in something like this website. I am using the listviewer package to do this below. It allows you to expose list exploration in a rendered .Rmd document. To replicate this experience locally, call, e.g., listviewer::jsonedit(got_chars, mode = "view").

library(listviewer)

Wes Anderson color palettes

wesanderson is a simple list containing color palettes from the wesanderson package. Each component is a palette, named after a movie, and contains a character vector of colors as hexadecimal triplets.

str(wesanderson)
#> List of 15
#>  $ GrandBudapest : chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
#>  $ Moonrise1     : chr [1:4] "#F3DF6C" "#CEAB07" "#D5D5D3" "#24281A"
#>  $ Royal1        : chr [1:4] "#899DA4" "#C93312" "#FAEFD1" "#DC863B"
#>  $ Moonrise2     : chr [1:4] "#798E87" "#C27D38" "#CCC591" "#29211F"
#>  $ Cavalcanti    : chr [1:5] "#D8B70A" "#02401B" "#A2A475" "#81A88D" ...
#>  $ Royal2        : chr [1:5] "#9A8822" "#F5CDB4" "#F8AFA8" "#FDDDA0" ...
#>  $ GrandBudapest2: chr [1:4] "#E6A0C4" "#C6CDF7" "#D8A499" "#7294D4"
#>  $ Moonrise3     : chr [1:5] "#85D4E3" "#F4B5BD" "#9C964A" "#CDC08C" ...
#>  $ Chevalier     : chr [1:4] "#446455" "#FDD262" "#D3DDDC" "#C7B19C"
#>  $ Zissou        : chr [1:5] "#3B9AB2" "#78B7C5" "#EBCC2A" "#E1AF00" ...
#>  $ FantasticFox  : chr [1:5] "#DD8D29" "#E2D200" "#46ACC8" "#E58601" ...
#>  $ Darjeeling    : chr [1:5] "#FF0000" "#00A08A" "#F2AD00" "#F98400" ...
#>  $ Rushmore      : chr [1:5] "#E1BD6D" "#EABE94" "#0B775E" "#35274A" ...
#>  $ BottleRocket  : chr [1:7] "#A42820" "#5F5647" "#9B110E" "#3F5151" ...
#>  $ Darjeeling2   : chr [1:5] "#ECCBAE" "#046C9A" "#D69C4E" "#ABDDDE" ...

Explore `wesanderson`

You can get a similar experience in RStudio via View(wesanderson).

Game of Thrones POV characters

got_chars is a list with information on the 30 point-of-view characters from the first five books in the Song of Ice and Fire series by George R. R. Martin. Retrieved from An API Of Ice And Fire. Each component corresponds to one character and contains 18 components which are named atomic vectors of various lengths and types.

str(got_chars, list.len = 3)
#> List of 30
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
#>   ..$ id         : int 1022
#>   ..$ name       : chr "Theon Greyjoy"
#>   .. [list output truncated]
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1052"
#>   ..$ id         : int 1052
#>   ..$ name       : chr "Tyrion Lannister"
#>   .. [list output truncated]
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1074"
#>   ..$ id         : int 1074
#>   ..$ name       : chr "Victarion Greyjoy"
#>   .. [list output truncated]
#>   [list output truncated]
str(got_chars[[1]], list.len = 8)
#> List of 18
#>  $ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
#>  $ id         : int 1022
#>  $ name       : chr "Theon Greyjoy"
#>  $ gender     : chr "Male"
#>  $ culture    : chr "Ironborn"
#>  $ born       : chr "In 278 AC or 279 AC, at Pyke"
#>  $ died       : chr ""
#>  $ alive      : logi TRUE
#>   [list output truncated]

Explore `got_chars`

You can get a similar experience in RStudio via View(got_chars).

GitHub users and repositories

gh_users is a list with information on 6 GitHub users. gh_repos is a nested list, also of length 6, where each component is another list with information on up to 30 of that user’s repositories. Retrieved from the GitHub API.

str(gh_users, max.level = 1)
#> List of 6
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30

Explore `gh_users`

You can get a similar experience in RStudio via View(gh_users).

Explore `gh_repos`

You can get a similar experience in RStudio via View(gh_repos).

Exercises

Read the documentation on str(). What does max.level control? Apply str() to wesanderson and/or got_chars and experiment with max.level = 0, max.level = 1, and max.level = 2. Which will you use in practice with deeply nested lists?
What does the list.len argument of str() control? What is its default value? Call str() on got_chars and then on a single component of got_chars with list.len set to a value much smaller than the default. What range of values do you think you’ll use in real life?
Call str() on got_chars, specifying both max.level and list.len.
Call str() on the first element of got_chars, i.e. the first Game of Thrones character. Use what you’ve learned to pick an appropriate combination of max.level and list.len.

Vectorized and “list-ized” operations

Recall that many operations “just work” in a vectorized fashion in R:

(3:5) ^ 2
#> [1]  9 16 25
sqrt(c(9, 16, 25))
#> [1] 3 4 5

Through the magic of R, the operations “raise to the power of 2” and “take the square root” were applied to each individual element of the numeric vector input. Someone – but not you! – has written a for() loop:

for (i in 1:n) {
  output[[i]] <- f(input[[i]])
}

Automatic vectorization is possible because our input is an atomic vector: the individual atoms are always of length one, always of uniform type.

What if the input is a list? You have to be more intentional to apply a function f() to each element of a list, i.e. to “list-ize” computation. This makes sense because the data structure itself does not guarantee that it makes any sense at all to apply a common function f() to each element of the list. You must guarantee that.

purrr::map() is a function for applying a function to each element of a list. The closest base R function is lapply(). Here’s how the square root example of the above would look if the input was in a list.

map(c(9, 16, 25), sqrt)
#> [[1]]
#> [1] 3
#> 
#> [[2]]
#> [1] 4
#> 
#> [[3]]
#> [1] 5

A template for basic map() usage:

map(YOUR_LIST, YOUR_FUNCTION)

Below we explore these useful features of purrr::map() and friends:

Shortcuts for YOUR_FUNCTION when you want to extract list elements by name or position
Simplify and specify the type of output via map_chr(), map_lgl(), etc.

This is where you begin to see the differences between purrr::map() and base::lapply().

Name and position shortcuts

Who are these Game of Thrones characters?

We want the elements with name “name”, so we do this (we restrict to the first few elements purely to conserve space):

map(got_chars[1:4], "name")
#> [[1]]
#> [1] "Theon Greyjoy"
#> 
#> [[2]]
#> [1] "Tyrion Lannister"
#> 
#> [[3]]
#> [1] "Victarion Greyjoy"
#> 
#> [[4]]
#> [1] "Will"

We are exploiting one of purrr’s most useful features: a shortcut to create a function that extracts an element based on its name.

A companion shortcut is used if you provide a positive integer to map(). This creates a function that extracts an element based on position.

The 3rd element of each character’s list is his or her name and we get them like so:

map(got_chars[5:8], 3)
#> [[1]]
#> [1] "Areo Hotah"
#> 
#> [[2]]
#> [1] "Chett"
#> 
#> [[3]]
#> [1] "Cressen"
#> 
#> [[4]]
#> [1] "Arianne Martell"

To recap, here are two shortcuts for making the .f function that map() will apply:

provide “TEXT” to extract the element named “TEXT”
- equivalent to function(x) x[["TEXT"]]
provide i to extract the i-th element
- equivalent to function(x) x[[i]]

You will frequently see map() used together with the pipe %>%. These calls produce the same result as the above.

got_chars %>% 
  map("name")
got_chars %>% 
  map(3)

Exercises

Use names() to inspect the names of the list elements associated with a single character. What is the index or position of the playedBy element? Use the character and position shortcuts to extract the playedBy elements for all characters.
What happens if you use the character shortcut with a string that does not appear in the lists’ names?
What happens if you use the position shortcut with a number greater than the length of the lists?
What if these shortcuts did not exist? Write a function that takes a list and a string as input and returns the list element that bears the name in the string. Apply this to got_chars via map(). Do you get the same result as with the shortcut? Reflect on code length and readability.
Write another function that takes a list and an integer as input and returns the list element at that position. Apply this to got_chars via map(). How does this result and process compare with the shortcut?

Type-specific map

map() always returns a list, even if all the elements have the same flavor and are of length one. But in that case, you might prefer a simpler object: an atomic vector.

If you expect map() to return output that can be turned into an atomic vector, it is best to use a type-specific variant of map(). This is more efficient than using map() to get a list and then simplifying the result in a second step. Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length. This is the increased rigor about type alluded to in the section about coercion.

Our current examples are suitable for demonstrating map_chr(), since the requested elements are always character.

map_chr(got_chars[9:12], "name")
#> [1] "Daenerys Targaryen" "Davos Seaworth"     "Arya Stark"        
#> [4] "Arys Oakheart"
map_chr(got_chars[13:16], 3)
#> [1] "Asha Greyjoy"    "Barristan Selmy" "Varamyr"         "Brandon Stark"

Besides map_chr(), there are other variants of map(), with the target type conveyed by the name:

map_lgl(), map_int(), map_dbl()

Exercises

For each character, the second element is named “id”. This is the character’s id in the API Of Ice And Fire. Use a type-specific form of map() and an extraction shortcut to extract these ids into an integer vector.
Use your list inspection strategies to find the list element that is logical. There is one! Use a type-specific form of map() and an extraction shortcut to extract these values for all characters into a logical vector.

Extract multiple values

What if you want to retrieve multiple elements? Such as the character’s name and culture? First, recall how we do this with the list for a single user:

got_chars[[3]][c("name", "culture", "gender", "born")]
#> $name
#> [1] "Victarion Greyjoy"
#> 
#> $culture
#> [1] "Ironborn"
#> 
#> $gender
#> [1] "Male"
#> 
#> $born
#> [1] "In 268 AC or before, at Pyke"

We use single square bracket indexing and a character vector to index by name. How will we ram this into the map() framework? To paraphrase Chambers, “everything that happens in R is a function call” and indexing with [ is no exception.

It feels (and maybe looks) weird, but we can map [ just like any other function. Recall map() usage:

map(.x, .f, ...)

The function .f will be [. And we finally get to use ...! This is where we pass the character vector of the names of our desired elements. We inspect the result for two characters.

x <- map(got_chars, `[`, c("name", "culture", "gender", "born"))
str(x[16:17])
#> List of 2
#>  $ :List of 4
#>   ..$ name   : chr "Brandon Stark"
#>   ..$ culture: chr "Northmen"
#>   ..$ gender : chr "Male"
#>   ..$ born   : chr "In 290 AC, at Winterfell"
#>  $ :List of 4
#>   ..$ name   : chr "Brienne of Tarth"
#>   ..$ culture: chr ""
#>   ..$ gender : chr "Female"
#>   ..$ born   : chr "In 280 AC"

Some people find this ugly and might prefer the extract() function from magrittr.

library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
x <- map(got_chars, extract, c("name", "culture", "gender", "born"))
str(x[18:19])
#> List of 2
#>  $ :List of 4
#>   ..$ name   : chr "Catelyn Stark"
#>   ..$ culture: chr "Rivermen"
#>   ..$ gender : chr "Female"
#>   ..$ born   : chr "In 264 AC, at Riverrun"
#>  $ :List of 4
#>   ..$ name   : chr "Cersei Lannister"
#>   ..$ culture: chr "Westerman"
#>   ..$ gender : chr "Female"
#>   ..$ born   : chr "In 266 AC, at Casterly Rock"

Exercises

Use your list inspection skills to determine the position of the elements named “name”, “gender”, “culture”, “born”, and “died”. Map [ or magrittr::extract() over users, requesting these elements by position instead of name.

Data frame output

We just learned how to extract multiple elements per user by mapping [. But, since [ is non-simplifying, each user’s elements are returned in a list. And, as it must, map() itself returns list. We’ve traded one recursive list for another recursive list, albeit a slightly less complicated one.

How can we “stack up” these results row-wise, i.e. one row per user and variables for “name”, “gender”, etc.? A data frame would be the perfect data structure for this information.

This is what map_dfr() is for.

map_dfr(got_chars, extract, c("name", "culture", "gender", "id", "born", "alive"))

Finally! A data frame! Hallelujah!

Notice how the variables have been automatically type converted. It’s a beautiful thing. Until it’s not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.

library(tibble)
got_chars %>% {
  tibble(
       name = map_chr(., "name"),
    culture = map_chr(., "culture"),
     gender = map_chr(., "gender"),       
         id = map_int(., "id"),
       born = map_chr(., "born"),
      alive = map_lgl(., "alive")
  )
}

Syntax notes: The dot . above is the placeholder for the primary input: got_chars in this case. The curly braces {} surrounding the tibble() call prevent got_chars from being passed in as the first argument of tibble().

Exercises

Use map_dfr() to create the same data frame as above, but indexing with a vector of positive integers instead of names.

Exploring and Extracting Data in Lists

Jenny Bryan, adapted by Clarke Iakovakis

Attribution

Binder link to this notebook:

Load packages

Inspect and explore

Indexing, review

str()

listviewer and RStudio’s Object Explorer

Wes Anderson color palettes

Explore wesanderson

Game of Thrones POV characters

Explore got_chars

GitHub users and repositories

Explore gh_users

Explore gh_repos

Exercises

Vectorized and “list-ized” operations

Name and position shortcuts

Exercises

Type-specific map

Exercises

Extract multiple values

Exercises

Data frame output

Exercises

`str()`

Explore `wesanderson`

Explore `got_chars`

Explore `gh_users`

Explore `gh_repos`