https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=rorcid.ipynb
Download a script for these files at:
rorcid
rorcid
is a package developed by Scott Chamberlain, co-founder of rOpenSci, to serve as an interface to the ORCID API.
You can find more information about the API on the ORCID site.
Credit to Paul Oldham at https://www.pauloldham.net/introduction-to-orcid-with-rorcid/ for inspiring some of the structure and ideas throughout this document. I highly recommend reading it.
Workshop code for this module is at https://raw.githubusercontent.com/ciakovx/ciakovx.github.io/master/rorcid_workshopcode.R
This walkthrough is distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
If you haven’t done so already, create an ORCID account at https://orcid.org/signin. If you have an ORCID but can’t remember it, search for your name at https://orcid.org. If you try to sign in with an email address already associated with an ORCID, you’ll be prompted to sign into the existing record. If you try to register with a different address, when you enter your name you’ll be asked to review existing records with that name and verify that none of them belong to you–see more on duplicate ORCID records. Make sure you have verified your email address.
Next, install and load the rorcid
package in R. Also install usethis
in order to set the API key, as well as tidyverse
, which is a package that contains many packages we’ll use throughout.
install.packages("rorcid")
install.packages("usethis")
install.packages("tidyverse")
install.packages("anytime")
install.packages("httpuv")
install.packages("janitor")
library(rorcid)
library(usethis)
library(tidyverse)
library(anytime)
library(lubridate)
library(janitor)
Next, you need to authenticate with an ORCID API Key. According to the ORCID API tutorial, anyone can receive a key to access the public API.
In R Studio, call:
orcid_auth()
You should see a message stating: no ORCID token found; attempting OAuth authentication
and a window will open in your default internet browser. Log-in to your orcid account. You will be asked to give rorcid
authorization to access your ORCID Record for the purposes of getting your ORCID iD. Click “Authorize.”
If successful, the browser window will state: “Authentication complete. Please close this page and return to R.” Return to R Studio and you should see in your R console the word Bearer, followed by a long string of letters and numbers. These letters and numbers are your API key. At this point, this should be cached locally in your working directory.
Highlight and copy the API key (the letters and numbers only–exclude the word “Bearer” and the space). Now you can use the edit_r_environ()
function from the usethis
package to store the token in your R environment.
usethis::edit_r_environ()
A new window will open in R Studio. Type ORCID_TOKEN="my-token"
, replacing my-token
with the API key. Then press enter to create a new line, and leave it blank. It will look like something like this (below is a fake token):
ORCID_TOKEN="4dsw1e14-7212-4129-9f07-aaf7b88ba88f"
Press Ctrl + S (Mac: Cmd + S to save the API key to your R environment and close the window. In R Studio, click Session > Restart R. Your token should now be saved to your R environment. You can confirm this by calling orcid_auth()
, and it will print the token.
If this does not work for you, there is another option:
orcid_client_id <- "APP-FDFJKDSLF320SDFF"
and orcid_client_secret <- "c8e987sa-0b9c-82ed-91as-1112b24234e"
). Then execute the code chunk.# copy/paste your client ID from https://orcid.org/developer-tools
orcid_client_id <- "APP-UXL71DIF91UFKDA"
# copy/paste your client secret from https://orcid.org/developer-tools
orcid_client_secret <- "c7e221dc-0b9c-48cf-92sq-24446b8490231e"
POST
request (from the httr
package) to ORCID and return to you an access token.orcid_request <- POST(url = "https://orcid.org/oauth/token",
config = add_headers(`Accept` = "application/json",
`Content-Type` = "application/x-www-form-urlencoded"),
body = list(grant_type = "client_credentials",
scope = "/read-public",
client_id = orcid_client_id,
client_secret = orcid_client_secret),
encode = "form")
Now that we have the response, we can use content
from httr
to get the information we want. If you look at the result, you’ll see it includes a variable called access_token
.
orcid_response <- content(orcid_request)
print(orcid_response$access_token)
Copy that to the clipboard. Now you can use the edit_r_environ()
function from the usethis
package to store the token in your R environment.
usethis::edit_r_environ()
A new window will open in R Studio. Type ORCID_TOKEN="my-token"
, pasting my-token
with the access_token
you just copied. Then press enter to create a new line, and leave it blank. It will look like something like this (below is a fake token):
ORCID_TOKEN="4dsw1e14-7212-4129-9f07-aaf7b88ba88f"
Press Ctrl + S (Mac: Cmd + S to save the API key to your R environment and close the window. In R Studio, click Session > Restart R. Your token should now be saved to your R environment. You can confirm this by calling orcid_auth()
, and it will print the token.
rorcid::orcid_search()
The rorcid::orcid_search()
function takes a query and returns a data frame with three columns: first name, last name, and ORCID iD. We use this when we have some data about a person or people, and we want to get their ORCID iDs.
Call ?orcid_search
to view the available parameters.
For this example, we will use the fictitious professor Josiah S(tinkney) Carberry, “legendary professor of psychoceramics (the study of cracked pots) since 1929,” at Brown University. Despite being make-believe, Carberry has a profile in ORCID that can be used for test cases such as these.
We start with a simple search by family name with the family_name
argument:
carberry <- rorcid::orcid_search(family_name = 'carberry')
carberry
Looking at the data frame in your R environment, you will see it returns 10 observations of three variables. By default, orcid_search
returns a limit of 10. We can increase the number of results returned by using the rows
argument:
carberry <- rorcid::orcid_search(family_name = 'carberry',
rows = 50)
carberry
We now get 28 results. However, we need not look through that long list to find Josiah, but can add another argument: given_name
. Multiple arguments are combined with AND, such that the above example gets passed to ORCID as given-names:josiah AND family-name:carberry.
carberry <- rorcid::orcid_search(given_name = 'josiah',
family_name = 'carberry')
carberry
It looks like there are two people in the public ORCID registry with the first name Josiah and the last name Carberry. We can launch the actual ORCID profiles to our browser by using rorcid::browse()
function. We use brackets to look up the first and second items in the orcid
variable:
rorcid::browse(carberry$orcid[1])
rorcid::browse(carberry$orcid[2])
When we look at the profiles on the web, we that the first (0000-0002-1028-6941) is more recent and less complete, and the second (0000-0002-1825-0097) has much more data.
What other fields can we search? Call View(rorcid:::fields)
to see a map of arguments.
field | description | path |
---|---|---|
orcid | The ORCID identifier for the researcher or contributor | //orcid-profile/orcid |
given-names | The given names of the researcher of contributor | //orcid-profile/orcid-bio/personal-details/given-names |
family-name | The family name of the researcher of contributor | //orcid-profile/orcid-bio/personal-details/family-name |
past-institution-affiliation-name | The name of any past institution in the researcher or contributors profile | //orcid-profile/orcid-bio/affiliations/affiliation[affiliation-type=“past-institution”]/affiliation-name |
current-primary-institution-affiliation-name | The name of the primary institution of the researcher or contributor | //orcid-profile/orcid-bio/affiliations/affiliation[affiliation-type=“current-primary-institution”]/affiliation-name |
current-institution-affiliation-name | The name of non-primary institutions of the researcher or contributor | //orcid-profile/orcid-bio/affiliations/affiliation[affiliation-type=“current-institution”]/affiliation-name |
credit-name | The name that normally appears on publications by the researcher or contributor | //orcid-profile/orcid-bio/personal-details/credit-name |
other-names | Alternative names that may have appeared on publications by the researcher or contributor | //orcid-profile/orcid-bio/personal-details/other-names |
The email address of the researcher or contributor | //orcid-profile/orcid-bio/contact-details/email | |
digital-object-ids | DOI of any work in the researcher or contributors profile | //orcid-profile/orcid-activities/orcid-works/orcid-work/work-external-identifiers/work-external-identifier[work-external-identifier-type=“doi”]/work-external-identifier-id |
work-titles | The titles of any work in the researcher or contributors profile | //orcid-profile/orcid-activities/orcid-works/orcid-work/work-title/(title|subtitle) |
grant-numbers | The grant number of any grant associated with the researcher or contributor | //orcid-profile/orcid-activities/orcid-grants/orcid-grant/grant-number |
patent-numbers | The patent numbers of any patent associated with the researcher or contributor | //orcid-profile/orcid-activities/orcid-patents/orcid-patent/patent-number |
keywords | Any keywords associated with the researcher or contributor | //orcid-profile/orcid-bio/keywords/keyword |
text | All the above fields. This is also the default field for Lucene syntax queries. | //orcid-bio |
orcid_search
includes the argument affiliation_org
, which searches across all of one’s affiliation data (employment, education, invited positions, membership & service). Because this is such a broad search, it has the potential to return false positives if you are using it on it’s own.
carberry <- rorcid::orcid_search(family_name = 'carberry',
affiliation_org = 'Wesleyan')
carberry
There are also arguments for past institution (past_inst
) and current institution (current_inst
), as well as institutional identifiers (see below).
We can search by email address, however, according to ORCID, as of February 2017, fewer than 2% of the 3+ million email addresses on ORCID records are public, so this one may not be incredibly helpful. Since Josiah doesn’t have email, we’ll use mine.
clarke <- rorcid::orcid_search(email = 'clarke.iakovakis@okstate.edu')
clarke
If the person has chosen to keep their email address private, the function will return an empty dataframe.`
If the individual has added keywords to their ORCID profile, we can search those. Dr. Carberry’s profile includes the keyword “psychoceramics” (the study of cracked pots) (*note: this function is currently under development - :
carberry <- rorcid::orcid_search(family_name = 'carberry',
keywords = 'psychoceramics')
carberry
If you know the name of a work (i.e. article, book chapter, etc.) or its DOI, you can obtain the associated ORCID iD by using the work_title
or digital_object_ids
if and only if the authos have added it to their ORCID profile.
We will search for the article “Building Software Building Community: Lessons from the rOpenSci Project”. Notice how the title is in double quotes, inside of single quotes, and the colon is removed.
ropensci1 <- rorcid::orcid_search(work_title = '"Building Software Building Community Lessons from the rOpenSci Project"')
ropensci1
This gives us two authors: Edmund Hart and Scott Chamberlain. Notice we get a different result when we look the same article up by DOI:
ropensci2 <- rorcid::orcid_search(digital_object_ids = '"10.5334/jors.bu"')
ropensci2
Using the browse()
function to navigate to the author profiles, we can see this is because Edmund Hart added the article to his ORCID profile, whereas Scott Chamberlain added the dataset (which has a different DOI).
rorcid::browse(ropensci1$orcid[1])
rorcid::browse(ropensci1$orcid[2])
Thus a word of caution when searching by article title or DOI: a significant amount of data in ORCID is manually added or added with incorrect, inconsistent, or incomplete metadata. In other words, the fact that you didn’t get results or got erroneous results may not be due to errors in your queries, but rather errors in the data itself.
ringgold-org-id:
When filling out an ORCID profile, users are encouraged to select their institutions from the drop-down menu, which will ensure it includes the Ringgold ID and any other unique identifiers, such as ISNI and GRID, that ORCID has for that institution. Read the ORCID report, “Organization identifiers: current provider survey” to learn more.
carberry <- rorcid::orcid_search(family_name = 'carberry',
ringgold_org_id = '5468')
carberry
You have to register with Ringgold to search in their registry. GRID is open for searching. Unfortunately, most organizations in ORCID have a Ringgold and not a GRID.
Sometimes different entities on campus will have separate Ringgold IDs; you may consider contacting Ringgold to get the full list of your institution’s identifiers.
We can search by name and email address domain by using an asterisk followed by the domain name
clarke <- rorcid::orcid_search(family_name = 'iakovakis',
email = '*@okstate.edu')
clarke
orcid_search
is a wrapper for another rorcid
function–orcid()
, which allows for a more advanced range of searching, including Boolean OR operators.
According to help(orcid)
:
You can use any of the following within the query statement: given-names, family-name, credit-name, other-names, email, grant-number, patent-number, keyword, worktitle, digital-objectids, current-institution, affiliation-name, current-primary-institution, text, past-institution, peer-review-type, peer-review-role, peer-review-group-id, biography, external-id-type-and-value
Note that current_prim_inst
and patent_number
parameters have been removed as ORCID has removed them.
We can combine affiliation names, Ringgold IDs, and email addresses using the OR
operator to cover all our bases, in case the person or people we are looking for did not hit on of those values. This will return all records that either have a Ringgold of 7618, have an affiliation name of “Oklahoma State,” or have an email domain ending in “@okstate.edu.”
clarke <- rorcid::orcid(query = 'family-name:iakovakis AND(ringgold-org-id:7618 OR
email:*@okstate.edu OR
affiliation-org-name:"Oklahoma State")')
clarke
This can also be helpful if you want to cast a very wide net and capture everyone affiliated with your institution who has an ORICID iD. Keep in mind that this searches across all of an individuals listed affiliations (employment, education, invited positions, membership & service) past and present. So it has will return false positives–in other words, one should not use it to get ORCID iDs of all individuals currently at an institution, because it will include those who previously worked there or got their degree from there.
The maximum number of returned results is 200, which we can modify with the rows
argument:
my_osu_orcids <- rorcid::orcid(query = 'ringgold-org-id:7618 OR email:*@okstate.edu OR
affiliation-org-name:"Oklahoma State"',
rows = 25)
my_osu_orcids
If you want to retrieve a complete set of all results above 200, we have to write a small function. First, we will wrap our API call in base::attr
and include a "found"
argument to see how many results are found with that call:
my_osu_orcid_count <- base::attr(rorcid::orcid(query = 'ringgold-org-id:7618 OR
email:*@okstate.edu OR affiliation-org-name:"Oklahoma State"'),
"found")
my_osu_orcid_count
## [1] 2326
There are 1,903 records at the time of writing. Next we will first create a numeric vector using seq
that starts with 0 and ends with 1903 (which has been assigned to my_osu_orcid_count
), adding 200 to each value incrementally.
my_pages <- seq(from = 0, to = my_osu_orcid_count, by = 200)
my_pages
## [1] 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Finally, we will write a small function using map
from the purrr
package. In essence, this takes each value from our my_pages
vector, and passes it into the page
argument of the orcid()
query. In other words, the first loop will get results 0-200, the next will get 200-400, and so on.
my_osu_orcids <- purrr::map(
my_pages,
function(page) {
print(page)
my_orcids <- rorcid::orcid(query = 'ringgold-org-id:7618 OR
email:*@okstate.edu OR affiliation-org-name:"Oklahoma State"',
rows = 200,
start = page)
return(my_orcids)
})
We can then use the map_dfr()
function from purrr
to pull the data together, coercing it with as_tibble()
, and the clean_names()
function from janitor
, described below, to make the column names easier to handle. We also introduce here the Pipe Operator, which is also described below.
my_osu_orcids_data <- my_osu_orcids %>%
map_dfr(., as_tibble) %>%
janitor::clean_names()
Of course, these are only the ORCID iDs, with no other data. The next sections will describe other functions in rorcid
to get biographical, employment, and works data from the profiles.
clean_names()
and %>%
Often times the columns returned from the ORCID API have a complicated combination of punctuation that can make them hard to use. The clean_names()
function from the janitor
package is optional and used only to simplify the column names of the data. It converts all punctuation to underscores, so the field orcid-identifier.uri
becomes orcid_identifier_uri
.
A Pipe Operator %>%
. A pipe takes the output of one statement and makes it the input of the next statement. You can think of it as “then” in natural language. So the above script first runs the orcid()
API call, then it clean the column names of the data that was pulled into R as a result of that call. So for example, in the expression above, we first call up the my_osu_orcids
data. We then apply the bind_rows()
function to pull all the data together into a single data frame, and then clean the names with clean_names()
.
The orcid()
function gets the IDs, but no information about the person. For that, you will need to use orcid_person()
.
Unlike orcid()
, orcid_person()
does not take a query; it accepts only ORICID iDs in the form XXXX-XXXX-XXXX-XXXX. So we can get the ORICID iD itself into it’s own vector. We can then pass that argument on to orcid_person()
.
carberry_orcid <- "0000-0002-1825-0097"
carberry_person <- rorcid::orcid_person(carberry_orcid)
If you look at the result for carberry_person in the Environment Pane in R Studio, you will see it returned a List of 1. We can view it here on this website using the listviewer
package.
listviewer::jsonedit(carberry_person, mode = "view")
Click the drop-down arrow next to the ORICID iD. We see one list here (his ORCID iD) and inside of that list is more lists. And inside those lists is even more lists! We can see the names of the top-level elements of the list by running names(carberry_person[[1]])
.
## [1] "last-modified-date" "name" "other-names"
## [4] "biography" "researcher-urls" "emails"
## [7] "addresses" "keywords" "external-identifiers"
## [10] "path"
If you click the drop-down arrow next to name. You can run names(carberry_person[[1]]$name)
to see the names of those elements. While this is great data, if we want to run some analysis on it, we need to get it into a nice, tidy data frame.
This is not an easy or straightforward process. I provide below one strategy to get some of the relevant data, using map
functions from the purrr
package and building a tibble
(the tidyverse
’s more efficient data frame) piece by piece.
carberry_data <- carberry_person %>% {
dplyr::tibble(
created_date = purrr::map_dbl(., purrr::pluck, "name", "created-date", "value", .default=NA_integer_),
given_name = purrr::map_chr(., purrr::pluck, "name", "given-names", "value", .default=NA_character_),
family_name = purrr::map_chr(., purrr::pluck, "name", "family-name", "value", .default=NA_character_),
credit_name = purrr::map_chr(., purrr::pluck, "name", "credit-name", "value", .default=NA_character_),
other_names = purrr::map(., purrr::pluck, "other-names", "other-name", "content", .default=NA_character_),
orcid_identifier_path = purrr::map_chr(., purrr::pluck, "name", "path", .default = NA_character_),
biography = purrr::map_chr(., purrr::pluck, "biography", "content", .default=NA_character_),
researcher_urls = purrr::map(., purrr::pluck, "researcher-urls", "researcher-url", .default=NA_character_),
emails = purrr::map(., purrr::pluck, "emails", "email", "email", .default=NA_character_),
keywords = purrr::map(., purrr::pluck, "keywords", "keyword", "content", .default=NA_character_),
external_ids = purrr::map(., purrr::pluck, "external-identifiers", "external-identifier", .default=NA_character_)
)
}
carberry_data
map_dbl()
because it is in double
format (a numeric data type in R) .map_chr()
because they are both character
types.map()
because there may be multiple values (unlike the other names, in which you can only have one). For example, someone may have multiple other names, or multiple keywords. So this will return a nested list to the tibble; we will discuss below how to unnest it.Each of these functions includes a .default = NA_character_
argument because if the value is NULL (if the ORCID author didn’t input the information) then it will convert that NULL to NA.
View the created date by running carberry_data$created_date
and you will see this is a number, not a date:
## 0000-0002-1825-0097
## 1.460758e+12
The dates are in Unix time, which is the number of seconds that have elapsed since January 1, 1970. In ORCID, this is in milliseconds. We can use the anytime()
function from the anytime
package created by Dirk Eddelbuettel to convert it and return a POSIXct object. You have to divide it by 1000 because it’s in milliseconds. Below we use the mutate()
function from dplyr
to overwrite the created_date
and last_modified_date
UNIX time with the human readable POSIXct dates.
carberry_datesAndTimes <- carberry_data %>%
dplyr::mutate(created_date = anytime::anytime(created_date/1000))
carberry_datesAndTimes$created_date
## 0000-0002-1825-0097
## "2016-04-15 17:00:17 CDT"
That looks much better: April 15th, 2016 at 5:00 PM and 17 seconds Central Daylight Time.
If you’d prefer to do away with the time altogether (and keep only the month/day/year), you can use anydate()
instead of anytime()
.
carberry_datesOnly <- carberry_data %>%
dplyr::mutate(created_date = anytime::anydate(created_date/1000))
carberry_datesOnly$created_date
## [1] "2016-04-15"
Check out the lubridate
package for more you can do with dates. It is installed with tidyverse
, but not loaded, so you have to load it with its own call to library()
(we did this at the beginning of the session). For example, you may be more interested in year of creation than month. So after you run the conversion with anytime
, you can create year variables with mutate()
:
carberry_years <- carberry_datesOnly %>%
dplyr::mutate(created_year = lubridate::year(created_date))
carberry_years$created_year
## [1] 2016
There are nested lists in this data frame that can be unnested. The other_names and keywords values are character vectors, while the researcher_urls and external_ids values are data frames themselves. We can use the unnest()
function from the tidyr
package to unnest both types. In other words, this will make each element of the list its own row. For instance, since there are two keywords for carberry (“psychoceramics” and “ionian philology”), there will now be two rows that are otherwise identical except for the keywords column:
carberry_keywords <- carberry_data %>%
tidyr::unnest(keywords)
carberry_keywords$keywords
## [1] "psychoceramics" "ionian philology"
We can see which columns are lists by calling is_list()
in the map_lgl()
function (this will return a TRUE/FALSE for each column that is a list), and subsetting the names()
of carberry_data
by those values:
carberry_list_columns <- map_lgl(carberry_data, is_list)
names(carberry_data)[carberry_list_columns]
## [1] "other_names" "researcher_urls" "emails" "keywords"
## [5] "external_ids"
Rather than having 1 observation of 11 variables, the data frame now has 2 observations of 7 variables. We know why there are two observations (because there are 2 keywords), but why are there fewer variables? Because there is an argument to unnest()
called .drop
, which is set to TRUE
by default, meaning all additional list columns will be dropped. If you want to keep them, just set it to FALSE
Note, however, that it will not unnest them.
carberry_keywords <- carberry_data %>%
tidyr::unnest(keywords, .drop = FALSE)
carberry_keywords
You can unnest multiple nested columns, but keep in mind that this will multiply the duplicated columns in your data frame, because there will be it is spreading the key-value pairs across multiple columns. For more on wide and long data, read Hadley Wickham’s paper “Tidy data,” published in The Journal of Statistical Software.
carberry_keywords_otherNames <- carberry_data %>%
tidyr::unnest(keywords, .drop = FALSE) %>%
tidyr::unnest(other_names, .drop = FALSE)
carberry_keywords_otherNames
When we unnest researcher_urls or external_ids, we will see many more columns added. That is because each of these nested lists contains multiple variables:
carberry_researcherURLs <- carberry_data %>%
tidyr::unnest(researcher_urls, .drop = FALSE)
carberry_researcherURLs
Carberry has two URLs: his Wikipedia page and a page about him on the Brown University Library. So a row is created for each of these URLs, and multiple columns are added such as the last modified date, the url value, and so on. You can keep or remove columns you don’t want using select()
from the dplyr
package.
We will use the write_csv()
function from the readr
package to write our data to disk. This package was loaded when you called library(tidyverse)
at the beginning of the session.
With a typical data frame, you can simply write the carberry_data
data frame to a CSV with the following code:
write_csv(carberry_keywords, "C:/Users/MyUserName/Desktop/carberry_data.csv")
The problem is, due to the nested lists we described above, R will throw an error: "Error in stream_delim_(df, path, ...) : Don't know how to handle vector of type list."
You have a few choices:
.drop
set to TRUE
. This will add rows for all the values in the nested lists, and drop the additional nested lists.Replace “MyUserName” below with your actual user name to write this file to your desktop. You can get your username by calling Sys.getenv("USERNAME")
.
my_user_name <- Sys.getenv("USERNAME")
carberry_keywords <- carberry_data %>%
tidyr::unnest(keywords)
write_csv(carberry_keywords, file.path("C:/Users", my_user_name, "Desktop/carberry_data1.csv"))
select_if()
from dplyr
and negate()
from purrr
to drop all lists in the data frame. This is essentially saying, only keep the columns that are not lists. In this example, the number of variables falls to 6, since we have 5 list columns.carberry_data_short <- carberry_data %>%
dplyr::select_if(purrr::negate(is_list))
mutate()
from dplyr
to coerce the list columns into character vectors.carberry_data_mutated <- carberry_data %>%
dplyr::mutate(keywords = as.character(keywords)) %>%
dplyr::mutate(other_names = as.character(other_names)) %>%
dplyr::mutate(researcher_urls = as.character(map(carberry_data$researcher_urls, purrr::pluck, "url.value", .default=NA_character_))) %>%
dplyr::mutate(external_ids = as.character(map(carberry_data$external_ids, purrr::pluck, "external-id-url.value", .default=NA_character_)))
write_csv(carberry_data_mutated, "C:/Users/MyUserName/Desktop/carberry_data2.csv")
orcid_person()
orcid_person()
is vectorized, so you can pass in multiple ORICID iDs and it will return a list of results for each ID, with each element named by the ORICID iD.
my_orcids <- c("0000-0002-1825-0097", "0000-0002-9260-8456")
my_orcid_person <- rorcid::orcid_person(my_orcids)
listviewer::jsonedit(my_orcid_person, mode = "view")
We see that we are given a list of 2, each containing the person data. We can put this into a data frame using the same code as above.
my_orcid_person_data <- my_orcid_person %>% {
dplyr::tibble(
created_date = purrr::map_dbl(., purrr::pluck, "name", "created-date", "value", .default=NA_integer_),
given_name = purrr::map_chr(., purrr::pluck, "name", "given-names", "value", .default=NA_character_),
family_name = purrr::map_chr(., purrr::pluck, "name", "family-name", "value", .default=NA_character_),
credit_name = purrr::map_chr(., purrr::pluck, "name", "credit-name", "value", .default=NA_character_),
other_names = purrr::map(., purrr::pluck, "other-names", "other-name", "content", .default=NA_character_),
orcid_identifier_path = purrr::map_chr(., purrr::pluck, "name", "path", .default = NA_character_),
biography = purrr::map_chr(., purrr::pluck, "biography", "content", .default=NA_character_),
researcher_urls = purrr::map(., purrr::pluck, "researcher-urls", "researcher-url", .default=NA_character_),
emails = purrr::map(., purrr::pluck, "emails", "email", "email", .default=NA_character_),
keywords = purrr::map(., purrr::pluck, "keywords", "keyword", "content", .default=NA_character_),
external_ids = purrr::map(., purrr::pluck, "external-identifiers", "external-identifier", .default=NA_character_)
)
} %>%
dplyr::mutate(created_date = anytime::anydate(created_date/1000))
my_orcid_person_data
We now have a nice, neat dataframe of both people’s ORCID name data.
When we want data on multiple people and have only their names, we can build a query.
Now we can build a query that will work with the given-names:
and family-name:
arguments to query
in orcid
in order to get the ORICID iDs:
profs <- tibble("FirstName" = c("Josiah", "Clarke"),
"LastName" = c("Carberry", "Iakovakis"))
orcid_query <- paste0("given-names:",
profs$FirstName,
" AND family-name:",
profs$LastName)
orcid_query
## [1] "given-names:Josiah AND family-name:Carberry"
## [2] "given-names:Clarke AND family-name:Iakovakis"
This returns a vector with two queries formatted for nice insertion into rorcid::orcid()
. We can use purr::map()
to create a loop. What this is saying is, take each element of orcid_query
and run a function with it that prints it to the console and runs rorcid::orcid()
on it, then return each result to my_orcids_list().
This returns a list of two items. We can then wrap as_tibble()
in map_dfr
to create a data frame from those list elements.
my_orcids_df <- purrr::map(
orcid_query,
function(x) {
print(x)
orc <- rorcid::orcid(x)
}
) %>%
purrr::map_dfr(., as_tibble) %>%
janitor::clean_names()
## [1] "given-names:Josiah AND family-name:Carberry"
## [1] "given-names:Clarke AND family-name:Iakovakis"
my_orcids_df
First we want to remove the Carberry row that we don’t want (remember that there are two Carberry ORCID accounts, and one doesn’t have much data in it). We can do this using the filter()
function from dplyr
and the !=
symbol, which is equivalent to “is not equal to.”
my_orcids_df <- my_orcids_df %>%
dplyr::filter(orcid_identifier_path != "0000-0002-1028-6941")
This is a data frame of two items. , grab the ORICID iDs, and run the same function we ran above in order to get the name data and the IDs into a single data frame.
my_orcids <- my_orcids_df$orcid_identifier_path
my_orcid_person <- rorcid::orcid_person(my_orcids)
my_orcid_person_data <- my_orcid_person %>% {
dplyr::tibble(
created_date = purrr::map_dbl(., purrr::pluck, "name", "created-date", "value", .default=NA_integer_),
given_name = purrr::map_chr(., purrr::pluck, "name", "given-names", "value", .default=NA_character_),
family_name = purrr::map_chr(., purrr::pluck, "name", "family-name", "value", .default=NA_character_),
credit_name = purrr::map_chr(., purrr::pluck, "name", "credit-name", "value", .default=NA_character_),
other_names = purrr::map(., purrr::pluck, "other-names", "other-name", "content", .default=NA_character_),
orcid_identifier_path = purrr::map_chr(., purrr::pluck, "name", "path", .default = NA_character_),
biography = purrr::map_chr(., purrr::pluck, "biography", "content", .default=NA_character_),
researcher_urls = purrr::map(., purrr::pluck, "researcher-urls", "researcher-url", .default=NA_character_),
emails = purrr::map(., purrr::pluck, "emails", "email", "email", .default=NA_character_),
keywords = purrr::map(., purrr::pluck, "keywords", "keyword", "content", .default=NA_character_),
external_ids = purrr::map(., purrr::pluck, "external-identifiers", "external-identifier", .default=NA_character_)
)} %>%
dplyr::mutate(created_date = anytime::anydate(created_date/1000))
my_orcid_person_data
This will be exactly the same thing as we saw above, however we got it from a simple vector of names.
If the names you have are not already separated into first and last name variables, here is a trick to do that:
Create a tibble using the tibble()
function from the dplyr
package. Then, use the extract()
function from tidyr
, along with some regular expressions, to create a first and last name variable:
my_names <- dplyr::tibble("name" = c("Josiah Carberry", "Clarke Iakovakis"))
my_clean_names <- my_names %>%
tidyr::extract(name, c("FirstName", "LastName"), "([^ ]+) (.*)")
my_clean_names
Again, we can unnest if we wish, knowing we’ll multiply the number of rows even more now, because we have more values. For instance, if we unnest keywords, we’ll now have 5 columns (2 keywords for carberry, and 3 keywords for iakovakis):
my_orcid_person_keywords <- my_orcid_person_data %>%
tidyr::unnest(keywords)
my_orcid_person_keywords
We can write this data to CSV using one of the three strategies outlined above. I’ll use #3 and coerce all list columns to character.
my_orcid_person_data_mutated <- my_orcid_person_data %>%
dplyr::mutate(keywords = as.character(keywords)) %>%
dplyr::mutate(other_names = as.character(other_names)) %>%
dplyr::mutate(researcher_urls = map(my_orcid_person_data$researcher_urls, purrr::pluck, "url.value", .default=NA_character_)) %>%
dplyr::mutate(external_ids = as.character(map(my_orcid_person_data$external_ids, purrr::pluck, "external-id-url.value", .default=NA_character_)))
write_csv(carberry_data_mutated, "C:/Users/MyUserName/Desktop/carberry_data3.csv")
In addition to biographical data, we can also get employment data with orcid_employments()
.
clarke_employment <- rorcid::orcid_employments(orcid = "0000-0002-9260-8456")
listviewer::jsonedit(clarke_employment, mode = "view")
Again it comes in a series of nested lists, but we’ll just pluck()
what we need and use flatten_dfr()
to flatten the lists into a data frame. We will also use the anydate()
function to go ahead and convert the dates while we’re at it.
clarke_employment_data <- clarke_employment %>%
purrr::map(., purrr::pluck, "affiliation-group", "summaries") %>%
purrr::flatten_dfr() %>%
janitor::clean_names() %>%
dplyr::mutate(employment_summary_end_date = anytime::anydate(employment_summary_end_date/1000),
employment_summary_created_date_value = anytime::anydate(employment_summary_created_date_value/1000),
employment_summary_last_modified_date_value = anytime::anydate(employment_summary_last_modified_date_value/1000))
The column names are pretty messy here.
names(clarke_employment_data)
## [1] "employment_summary_put_code"
## [2] "employment_summary_department_name"
## [3] "employment_summary_role_title"
## [4] "employment_summary_end_date"
## [5] "employment_summary_external_ids"
## [6] "employment_summary_display_index"
## [7] "employment_summary_visibility"
## [8] "employment_summary_path"
## [9] "employment_summary_created_date_value"
## [10] "employment_summary_last_modified_date_value"
## [11] "employment_summary_source_source_client_id"
## [12] "employment_summary_source_assertion_origin_orcid"
## [13] "employment_summary_source_assertion_origin_client_id"
## [14] "employment_summary_source_assertion_origin_name"
## [15] "employment_summary_source_source_orcid_uri"
## [16] "employment_summary_source_source_orcid_path"
## [17] "employment_summary_source_source_orcid_host"
## [18] "employment_summary_source_source_name_value"
## [19] "employment_summary_start_date_year_value"
## [20] "employment_summary_start_date_month_value"
## [21] "employment_summary_start_date_day_value"
## [22] "employment_summary_organization_name"
## [23] "employment_summary_organization_address_city"
## [24] "employment_summary_organization_address_region"
## [25] "employment_summary_organization_address_country"
## [26] "employment_summary_organization_disambiguated_organization_disambiguated_organization_identifier"
## [27] "employment_summary_organization_disambiguated_organization_disambiguation_source"
## [28] "employment_summary_url_value"
## [29] "employment_summary_url"
## [30] "employment_summary_end_date_year_value"
## [31] "employment_summary_end_date_month_value"
## [32] "employment_summary_end_date_day_value"
## [33] "employment_summary_start_date_day"
## [34] "employment_summary_end_date_day"
We’ll clean them up a bit by using the str_replace()
function from stringr
. You can think of this as analogous to Find + Replace in word processing. We take the names()
of the data, and replace each of the phrases with nothing (i.e. the set of empty quotes).
names(clarke_employment_data) <- names(clarke_employment_data) %>%
stringr::str_replace(., "employment_summary_", "") %>%
stringr::str_replace(., "source_source_", "") %>%
stringr::str_replace(., "organization_disambiguated_", "")
names(clarke_employment_data)
## [1] "put_code"
## [2] "department_name"
## [3] "role_title"
## [4] "end_date"
## [5] "external_ids"
## [6] "display_index"
## [7] "visibility"
## [8] "path"
## [9] "created_date_value"
## [10] "last_modified_date_value"
## [11] "client_id"
## [12] "source_assertion_origin_orcid"
## [13] "source_assertion_origin_client_id"
## [14] "source_assertion_origin_name"
## [15] "orcid_uri"
## [16] "orcid_path"
## [17] "orcid_host"
## [18] "name_value"
## [19] "start_date_year_value"
## [20] "start_date_month_value"
## [21] "start_date_day_value"
## [22] "organization_name"
## [23] "organization_address_city"
## [24] "organization_address_region"
## [25] "organization_address_country"
## [26] "organization_disambiguated_organization_identifier"
## [27] "organization_disambiguation_source"
## [28] "url_value"
## [29] "url"
## [30] "end_date_year_value"
## [31] "end_date_month_value"
## [32] "end_date_day_value"
## [33] "start_date_day"
## [34] "end_date_day"
If you take a look at the data, you will see that there is no variable indicating whether these are current or past institutions of employment. In fact, the only way to check if the institution is a place of current employment is if the employment_summary_end_date_year_value
is NA
. Keep in mind that start and end dates are not required fields, and we can’t be certain that people are updating their profiles. However, we can get a data frame of only those items meeting this criteria by using the filter()
function from dplyr
:
clarke_employment_data_current <- clarke_employment_data %>%
dplyr::filter(is.na(employment_summary_end_date_year_value))
clarke_employment_data_current
This will remove my previous two institutions, and keep only my current one: Oklahoma State University.
Because orcid_employments()
is vectorized, we can feed it multiple ORCID iDs and it will return data for the entire lot, if the individuals have added it.
I’ll grab a random assortment of OSU ORCID iDs from the previous section. Recall that these were pulled based on the detection of OSU data in either Ringgold, email, or affiliation names across all fields. We will issue the initial API call with orcid_employments()
, then put it into a data frame with the same function we used above.
my_osu_orcid_ids <- c("0000-0002-6160-9587", "0000-0001-8330-8251", "0000-0003-2863-6724", "0000-0001-6810-5560", "0000-0003-1935-9729", "0000-0002-9088-2312", "0000-0001-9792-7870", "0000-0003-3959-6916", "0000-0002-2621-5320", "0000-0001-9103-3040")
my_osu_employment <- rorcid::orcid_employments(my_osu_orcid_ids)
my_osu_employment_data <- my_osu_employment %>%
purrr::map(., purrr::pluck, "affiliation-group", "summaries") %>%
purrr::flatten_dfr() %>%
janitor::clean_names() %>%
dplyr::mutate(employment_summary_end_date = anytime::anydate(employment_summary_end_date/1000),
employment_summary_created_date_value = anytime::anydate(employment_summary_created_date_value/1000),
employment_summary_last_modified_date_value = anytime::anydate(employment_summary_last_modified_date_value/1000))
my_osu_employment_data
Clean up the names.
names(my_osu_employment_data) <- names(my_osu_employment_data) %>%
stringr::str_replace(., "employment_summary_", "") %>%
stringr::str_replace(., "source_source_", "") %>%
stringr::str_replace(., "organization_disambiguated_", "")
Note that this may have multiple entries per person because it gathered their entire employment history. Now let’s take a look at the unique organizations in this dataset:
my_osu_organizations <- my_osu_employment_data %>%
group_by(organization_name) %>%
count() %>%
arrange(desc(n))
my_osu_organizations
Out of this set of 10 iDs, only 3 have Oklahoma State University Stillwater listed in their employment, and one has Oklahoma State University, and another has Oklahoma State University - Tulsa. The others may have achieved their degree from OSU, or done some service with OSU.
We can make this a bit more manageable by filtering to include only those institutions that include the word “Oklahoma.”
my_osu_organizations_filtered <- my_osu_organizations %>%
filter(str_detect(organization_name, "Oklahoma"))
my_osu_organizations_filtered
Then, out of those, we can decide which ones we want to keep. For instance, we may not want Oklahoma State University - Tulsa.
my_osu_employment_data_filtered <- my_osu_employment_data %>%
dplyr::filter(organization_name == "Oklahoma State University Stillwater"
| organization_name == "Oklahoma State University")
my_osu_employment_data_filtered
Then, out of those, let’s see how many are listed as current employees, as indicated with an NA
in their end_date_year_value
.
my_osu_employment_data_filtered_current <- my_osu_employment_data_filtered %>%
dplyr::filter(is.na(end_date_year_value))
my_osu_employment_data_filtered_current
It looks like the third row was removed because they have 1997 in their employment end date, and therefore are no longer employed by OSU.
Note that this will give you employment records ONLY. In other words, each row represents a single employment record for an individual. The name_value
variable refers specifically to the name of the person or system that wrote the record, NOT the name of the individual. To get that, you must first get all the unique ORCID iDs from the dataset.
Problem is, there is actually no distinct value identifying the orcid ID of the person. The orcid_path
value corresponds to the path of the person who added the employment record (which is usually, but not always the same). Therefore you have to strip out the ORCID iD from the ‘path’ variable first and put it in it’s own value and use it. We do this using str_sub from the stringr package. While we are at it, we can select and reorder the columns we want to keep.
osu_current_employment_all <- my_osu_employment_data_filtered_current %>%
mutate(orcid_identifier = str_sub(path, 2, 20)) %>%
select(orcid_identifier, organization_name, organization_address_city,
organization_address_region, organization_address_country,
organization_disambiguated_organization_identifier, organization_disambiguation_source, department_name, role_title, url,
display_index, visibility, created_date_value,
start_date_year_value, start_date_month_value, start_date_day_value,
end_date_year_value, end_date_month_value, end_date_day_value)
osu_current_employment_all
If we want to take the next step to join this with biographical information, we create a new vector unique_orcids that includes only unique()
ORCID iDs from our filtered dataset and remove NA values with na.omit()
.
osu_unique_orcids <- unique(osu_current_employment_all$orcid_identifier) %>%
na.omit(.)
osu_unique_orcids
## [1] "0000-0001-8330-8251" "0000-0001-6810-5560"
Use orcid_person()
as above, and construct our tibble()
:
my_osu_orcid_person <- rorcid::orcid_person(osu_unique_orcids)
my_osu_orcid_person_data <- my_osu_orcid_person %>% {
dplyr::tibble(
created_date = purrr::map_chr(., purrr::pluck, "name", "created-date", "value", .default=NA_character_),
given_name = purrr::map_chr(., purrr::pluck, "name", "given-names", "value", .default=NA_character_),
family_name = purrr::map_chr(., purrr::pluck, "name", "family-name", "value", .default=NA_character_),
orcid_identifier_path = purrr::map_chr(., purrr::pluck, "name", "path", .default = NA_character_))
} %>%
dplyr::mutate(created_date = anytime::anydate(as.double(created_date)/1000))
my_osu_orcid_person_data
Now we can use left_join()
from dplyr
to join this biographical data back to the employment data. It will include a single line of data for individual’s employment records
osu_orcid_person_employment_join <- my_osu_orcid_person_data %>%
left_join(osu_current_employment_all, by = c("orcid_identifier_path" = "orcid_identifier"))
osu_orcid_person_employment_join
Then you can write that to a CSV.
rorcid::works()
and rorcid::orcid_works()
There are two functions in rorcid
to get all of the works associated with an ORICID iD: orcid_works()
and works()
. The main difference between these is orcid_works()
returns a list, with each work as a list item, and each external identifier (e.g. ISSN, DOI) also as a list item. On the other hand, works()
returns a nice, neat data frame that can be easily exported to a CSV.
Like orcid_person()
, these functions require an ORICID iD, and do not use the query fields we saw with the orcid()
function.
carberry_orcid <- c("0000-0002-1825-0097")
carberry_works <- rorcid::works(carberry_orcid) %>%
as_tibble() %>%
janitor::clean_names() %>%
dplyr::mutate(created_date_value = anytime::anydate(created_date_value/1000))
carberry_works
Dr. Carberry has seven works. Because ORCID data can be manually entered, the integrity, completeness, and consistency of this data will sometimes vary.
You can see the external_ids_external_id column is actually a nested list, a concept we discussed above. This can be unnested with the tidyr::unnest()
function. Just as a single researcher can have multiple identifiers, a single work may also have multiple identifiers (e.g., DOI, ISSN, EID). If that is the case, when this column is unnested, there will be repeating rows for those items.
carberry_works_ids <- carberry_works %>%
tidyr::unnest(external_ids_external_id) %>%
janitor::clean_names()
carberry_works_ids
In this case, we now have 13 observations of 27 variables rather than 7 observations of 24 variables. The extra rows are there because all but one of the works has two external identifiers. The extra columns are there because four new columns were added with the unnest (that’s why we had to clean the names again):
SELF
), such as a DOI or a person identifier, or a whole that the item is part of (PART_OF
), such as an ISSN for a journal article.So we can follow one of the three strategies outlined above if we want to write this to a CSV file: 1) unnest the column (as above), 2) drop the nested lists, or 3) mutate them into character vectors.
orcid::works()
is not vectorized, meaning, if you have multiple ORICID iDs, you can’t use it. Instead, you have to pass them to the orcid::orcid_works()
function.
my_orcids <- c("0000-0002-1825-0097", "0000-0002-9260-8456", "0000-0002-2771-9344")
my_works <- rorcid::orcid_works(my_orcids)
listviewer::jsonedit(my_works, mode = "view")
This returns a list of 3 elements, with the works nested in group > work-summary. They can be plucked and flattened into a data frame:
my_works_data <- my_works %>%
purrr::map_dfr(pluck, "works") %>%
janitor::clean_names() %>%
dplyr::mutate(created_date_value = anytime::anydate(created_date_value/1000))
my_works_data
Now you may want to run some analysis using the external identifiers; for instance, you can use the roadoi
package to look at which DOIs are open access.
We run into a problem here when we try to unnest the external IDs:
my_works_externalIDs <- my_works_data %>%
tidyr::unnest(external_ids_external_id)
The error message reads: "Error: Each column must either be a list of vectors or a list of data frames [external_ids_external_id]".
This is because some of the list columns are empty. We can just filter them out before unnesting:
my_works_externalIDs <- my_works_data %>%
dplyr::filter(!purrr::map_lgl(external_ids_external_id, purrr::is_empty)) %>%
tidyr::unnest(external_ids_external_id)
my_works_externalIDs
If we want to keep them, there’s a workaround: use map_lgl
to first remove (filter()
out) the NULL
external_id
columns, then unnest
the ids, then bind back the NULL
author columns, and finally deselecting the extra author
and link
columns as these are no longer in the transformed, unnested data.
my_works_externalIDs_keep <- my_works_data %>%
dplyr::filter(!purrr::map_lgl(external_ids_external_id, purrr::is_empty)) %>%
tidyr::unnest(external_ids_external_id, .drop = TRUE) %>%
dplyr::bind_rows(my_works_data %>%
dplyr::filter(map_lgl(external_ids_external_id, is.null)) %>%
dplyr::select(-external_ids_external_id))
my_works_externalIDs_keep
The ORCID API is an excellent tool for analyzing research activity on multiple levels. rorcid
makes gathering and cleaning the data easier. Thanks to both ORCID and Scott Chamberlain for their contributions to the community. Again, read Paul Oldham’s excellent post at https://www.pauloldham.net/introduction-to-orcid-with-rorcid/ for more you can do. I hope this walkthrough helps. If you need to get in touch with me, find my contact info at https://info.library.okstate.edu/clarke-iakovakis.