Licensing

This walkthrough is distributed under a https://creativecommons.org/licenses/by/4.0/.

Crossref and rcrossref

Crossref

Crossref is a a not-for-profit membership organization dedicated to interlinking scholarly metadata, including journals, books, conference proceedings, working papers, technical reports, data sets, authors, funders, and more. The Crossref REST API allows anybody to search and reuse members’ metadata in a variety of ways. Read examples of user stories.

rcrossref

rcrossref is a package developed by Scott Chamberlain, Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram, part of the rOpenSci set of packages. rOpenSci is an incredible organization dedicated to open and reproducible research using shared data and reusable software. I strongly recommend you browse their set of packages at https://ropensci.org/packages/.

rcrossref serves as an interface to the Crossref API.

Key links

Getting publications from journals with cr_journals

cr_journals() takes either an ISSN or a general keyword query, and returns metadata for articles published in the journal, including DOI, title, volume, issue, pages, publisher, authors, etc. A full list of publications in Crossref is available on their website.

Getting journal details

Crossref is entirely dependent on publishers to supply the metadata. Some fields are required, while others are optional. You may therefore first be interested in what metadata publishers have submitted to Crossref for a given journal. By using cr_journals with works = FALSE, you can determine who publishes the journal, the total number of articles for the journal in Crossref, whether abstracts are included, if the full text of articles is deposited, if author ORCIDs are provided, and if the publisher supplies author affiliations, author ORCID iDs, article licensing data, funders for the article, article references, and a few other items.

First we will create a new vector plosone_issn with the ISSN for the journal PLoS ONE.

# assign the PLoS ISSN
plosone_issn <- "1932-6203"

We will then run rcrossref::cr_journals(), setting the ISSN equal to the plosone_issn we just created, and print the results.

# get information about the journal
plosone_details <- rcrossref::cr_journals(issn = plosone_issn, works = FALSE)
plosone_details
## $meta
## NULL
## 
## $data
##      title                 publisher      issn last_status_check_time
## 1 PLoS ONE Public Library of Science 1932-6203             2021-07-20
##   deposits_abstracts_current deposits_orcids_current deposits
## 1                       TRUE                    TRUE     TRUE
##   deposits_affiliations_backfile deposits_update_policies_backfile
## 1                          FALSE                              TRUE
##   deposits_similarity_checking_backfile deposits_award_numbers_current
## 1                                  TRUE                           TRUE
##   deposits_resource_links_current deposits_articles
## 1                           FALSE              TRUE
##   deposits_affiliations_current deposits_funders_current
## 1                         FALSE                     TRUE
##   deposits_references_backfile deposits_abstracts_backfile
## 1                         TRUE                        TRUE
##   deposits_licenses_backfile deposits_award_numbers_backfile
## 1                       TRUE                            TRUE
##   deposits_open_references_backfile deposits_open_references_current
## 1                              TRUE                             TRUE
##   deposits_references_current deposits_resource_links_backfile
## 1                        TRUE                            FALSE
##   deposits_orcids_backfile deposits_funders_backfile
## 1                     TRUE                      TRUE
##   deposits_update_policies_current deposits_similarity_checking_current
## 1                             TRUE                                 TRUE
##   deposits_licenses_current affiliations_current similarity_checking_current
## 1                      TRUE                    0                           1
##   funders_backfile licenses_backfile funders_current affiliations_backfile
## 1        0.1731985         0.9924777       0.5762672                     0
##   resource_links_backfile orcids_backfile update_policies_current
## 1                       0       0.1729574                       1
##   open_references_backfile orcids_current similarity_checking_backfile
## 1                        1      0.9389957                    0.9135426
##   references_backfile award_numbers_backfile update_policies_backfile
## 1           0.9135426              0.1471421                0.9999527
##   licenses_current award_numbers_current abstracts_backfile
## 1                1             0.4964669       3.782452e-05
##   resource_links_current abstracts_current open_references_current
## 1                      0          0.284577                       1
##   references_current total_dois current_dois backfile_dois
## 1                  1     253959        42456        211503
## 
## $facets
## NULL

This actually comes back as a list of three items: meta, data, and facets. The good stuff is in data.

We use the pluck() function from the purrr package to pull that data only. We will be using pluck throughout this tutorial; it’s an easy way of indexing deeply and flexibly into lists to extract information.

We don’t have time in this tutorial to discuss list items and purr. For an excellent in-depth tutorial, see Jenny Bryan’s Introduction to map(): extract elements, also provided on this website.

# get information about the journal and pluck the data
plosone_details <- rcrossref::cr_journals(issn = plosone_issn, works = FALSE) %>%
    purrr::pluck("data")

The purrr::pluck() function is connected to plosone_details with something called a Pipe Operator %&gt;%, which we will be using throughout the tutorial. A pipe takes the output of one statement and immediately makes it the input of the next statement. It helps so that you don’t have to write every intermediate, processing data to your R environment. You can think of it as “then” in natural language. So the above script first makes the API call with cr_journals(), then it applies pluck() to extract only the list element called "data", and returns it to the plosone_details value.

We now have a data frame including the details Croassref has on file about PLoS ONE. Scroll to the right to see all the columns.

plosone_details

There are a number of ways to explore this data frame:

# display information about the data frame
str(plosone_details)
## 'data.frame':    1 obs. of  53 variables:
##  $ title                                : chr "PLoS ONE"
##  $ publisher                            : chr "Public Library of Science"
##  $ issn                                 : chr "1932-6203"
##  $ last_status_check_time               : Date, format: "2021-07-20"
##  $ deposits_abstracts_current           : logi TRUE
##  $ deposits_orcids_current              : logi TRUE
##  $ deposits                             : logi TRUE
##  $ deposits_affiliations_backfile       : logi FALSE
##  $ deposits_update_policies_backfile    : logi TRUE
##  $ deposits_similarity_checking_backfile: logi TRUE
##  $ deposits_award_numbers_current       : logi TRUE
##  $ deposits_resource_links_current      : logi FALSE
##  $ deposits_articles                    : logi TRUE
##  $ deposits_affiliations_current        : logi FALSE
##  $ deposits_funders_current             : logi TRUE
##  $ deposits_references_backfile         : logi TRUE
##  $ deposits_abstracts_backfile          : logi TRUE
##  $ deposits_licenses_backfile           : logi TRUE
##  $ deposits_award_numbers_backfile      : logi TRUE
##  $ deposits_open_references_backfile    : logi TRUE
##  $ deposits_open_references_current     : logi TRUE
##  $ deposits_references_current          : logi TRUE
##  $ deposits_resource_links_backfile     : logi FALSE
##  $ deposits_orcids_backfile             : logi TRUE
##  $ deposits_funders_backfile            : logi TRUE
##  $ deposits_update_policies_current     : logi TRUE
##  $ deposits_similarity_checking_current : logi TRUE
##  $ deposits_licenses_current            : logi TRUE
##  $ affiliations_current                 : num 0
##  $ similarity_checking_current          : num 1
##  $ funders_backfile                     : num 0.173
##  $ licenses_backfile                    : num 0.992
##  $ funders_current                      : num 0.576
##  $ affiliations_backfile                : num 0
##  $ resource_links_backfile              : num 0
##  $ orcids_backfile                      : num 0.173
##  $ update_policies_current              : num 1
##  $ open_references_backfile             : num 1
##  $ orcids_current                       : num 0.939
##  $ similarity_checking_backfile         : num 0.914
##  $ references_backfile                  : num 0.914
##  $ award_numbers_backfile               : num 0.147
##  $ update_policies_backfile             : num 1
##  $ licenses_current                     : num 1
##  $ award_numbers_current                : num 0.496
##  $ abstracts_backfile                   : num 3.78e-05
##  $ resource_links_current               : num 0
##  $ abstracts_current                    : num 0.285
##  $ open_references_current              : num 1
##  $ references_current                   : num 1
##  $ total_dois                           : int 253959
##  $ current_dois                         : int 42456
##  $ backfile_dois                        : int 211503

Type ?str into the console to read the description of the str function. You can call str() on an R object to compactly display information about it, including the data type, the number of elements, and a printout of the first few elements.

# dimensions: 1 row, 53 columns
dim(plosone_details)
## [1]  1 53
# number of rows
nrow(plosone_details)
## [1] 1
# number of columns
ncol(plosone_details)
## [1] 53
# column names
names(plosone_details)
##  [1] "title"                                
##  [2] "publisher"                            
##  [3] "issn"                                 
##  [4] "last_status_check_time"               
##  [5] "deposits_abstracts_current"           
##  [6] "deposits_orcids_current"              
##  [7] "deposits"                             
##  [8] "deposits_affiliations_backfile"       
##  [9] "deposits_update_policies_backfile"    
## [10] "deposits_similarity_checking_backfile"
## [11] "deposits_award_numbers_current"       
## [12] "deposits_resource_links_current"      
## [13] "deposits_articles"                    
## [14] "deposits_affiliations_current"        
## [15] "deposits_funders_current"             
## [16] "deposits_references_backfile"         
## [17] "deposits_abstracts_backfile"          
## [18] "deposits_licenses_backfile"           
## [19] "deposits_award_numbers_backfile"      
## [20] "deposits_open_references_backfile"    
## [21] "deposits_open_references_current"     
## [22] "deposits_references_current"          
## [23] "deposits_resource_links_backfile"     
## [24] "deposits_orcids_backfile"             
## [25] "deposits_funders_backfile"            
## [26] "deposits_update_policies_current"     
## [27] "deposits_similarity_checking_current" 
## [28] "deposits_licenses_current"            
## [29] "affiliations_current"                 
## [30] "similarity_checking_current"          
## [31] "funders_backfile"                     
## [32] "licenses_backfile"                    
## [33] "funders_current"                      
## [34] "affiliations_backfile"                
## [35] "resource_links_backfile"              
## [36] "orcids_backfile"                      
## [37] "update_policies_current"              
## [38] "open_references_backfile"             
## [39] "orcids_current"                       
## [40] "similarity_checking_backfile"         
## [41] "references_backfile"                  
## [42] "award_numbers_backfile"               
## [43] "update_policies_backfile"             
## [44] "licenses_current"                     
## [45] "award_numbers_current"                
## [46] "abstracts_backfile"                   
## [47] "resource_links_current"               
## [48] "abstracts_current"                    
## [49] "open_references_current"              
## [50] "references_current"                   
## [51] "total_dois"                           
## [52] "current_dois"                         
## [53] "backfile_dois"

We see this data frame includes one observation of 53 different variables. This includes the total number of DOIs, whether the abstracts, orcids, article references are current; and other information.

You can use the $ symbol to work with particular variables. For example, the publisher column:

# print the publisher variable
plosone_details$publisher
## [1] "Public Library of Science"

The total number of DOIs on file:

# print the total number of DOIs
plosone_details$total_dois
## [1] 253959

Whether the data publishers provide on funders of articles they publish is current in Crossref (a TRUE/FALSE value–called “logical” in R):

# is funder data current on deposits?
plosone_details$deposits_funders_current
## [1] TRUE

TRY IT YOURSELF

  1. Assign an ISSN for a well-known journal to a new variable in R. Name it whatever you like. You can use the Scimago Journal Rank to look up the ISSN. If you need a couple examples, try RUSA or Library Hi Tech. Make sure to put the ISSN in quotes to create a character vector.
  2. Look up the journal details using cr_journals. Make sure to pass the argument works = FALSE.
  3. Print the data to your console by typing in the value name.

Does it matter if the ISSN has a hyphen or not? Try both methods.

# assign an ISSN to a value. Call the value what you want (e.g. plosone_issn)
# look up journal details using the cr_journals function and assign it to a new
# value (e.g. plosone_details).  Remember to include a %>% pipe and call
# purrr::pluck('data')
# print the journal details to the console by typing in the value name

Getting journal publications by ISSN

To get metadata for the publications themselves rather than data about the journal, we will again use the plosone_issn value in the issn = argument to cr_journals, but we now set works = TRUE.

# get metadata on articles by setting works = TRUE
plosone_publications <- cr_journals(issn = plosone_issn, works = TRUE, limit = 25) %>%
    pluck("data")

Let’s walk through this step by step:

  • First, we are creating a new value called plosone_publications
  • We are using the assignment operator &lt;- to assign the results of an operation to this new value
  • We are running the function cr_journals(). It is not necessary to add rcrossref:: to the beginning of the function.
  • We pass three arguments to the function:
    • issn = plosone_issn : We defined plosone_issn earlier in the session as ‘1932-6203’. We are reusing that value here to tell the cr_journals() function what journal we want information on
    • works = TRUE : When we earlier specified works = FALSE, we got back information on the publication. When works = TRUE, we will get back article level metadata
    • limit = 25 : We will get back 25 articles. The default number of articles returned is 20, but you can increase or decrease that with the limit argument. The max limit is 1000, but you can get more using the cursor argument (see below).
  • %&gt;% : Pipe operator says to R to take the results of this function and use it as the input for what follows
  • pluck("data") : This will grab only the contents of the list item “data” and return it to plosone_publications.

Let’s explore the data frame:

# print dimensions of this data frame
dim(plosone_publications)
## [1] 25 33

When we run dim() (dimensions) on this result, we now see a different number of rows and columns: 25 rows and 28 columns. This is therefore a different dataset than plosone_details. Let’s call names() to see what the column names are:

# print column names
names(plosone_publications)
##  [1] "container.title"        "created"                "deposited"             
##  [4] "published.online"       "doi"                    "indexed"               
##  [7] "issn"                   "issue"                  "issued"                
## [10] "member"                 "page"                   "prefix"                
## [13] "publisher"              "score"                  "source"                
## [16] "reference.count"        "references.count"       "is.referenced.by.count"
## [19] "title"                  "type"                   "update.policy"         
## [22] "url"                    "volume"                 "language"              
## [25] "short.container.title"  "author"                 "link"                  
## [28] "content_domain"         "license"                "reference"             
## [31] "update_to"              "clinical-trial-number"  "funder"

We view the entire data frame below. Because there are some nested lists within the data, we will use the select() function from the dplyr package to select only a few columns. This will make it easier for us to view here in the Azure Notebook environment. You can also use the select() function to rearrange the columns.

# print select columns from the data frame
plosone_publications %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Here we are just getting back the last 25 articles that have been indexed in Crossref by PLoS ONE. However, this gives you a taste of how rich the metadata is. We have the dates the article was deposited and published online, the title, DOI, the ISSN, the volume, issue, and page numbers, the number of references, the URL, and for some items, the subjects. The omitted columns include information on licensing, authors, and more. We will deal with those columns further down.

Getting multiple publications by ISSN

You can also pass multiple ISSNs to cr_journals. Here we create 2 new values, jama_issn and jah_issn. These are ISSNs for the Journal of American History and JAMA: The Journal of the American Medical Association. We then pass them to cr_journals by passing them to the c() function, which will combine them (it’s like CONCATENATE in Excel). We set works to TRUE so we’ll get the publications metadata, and we set the limit to 50, so we’ll get 50 publications per journal.

# assign the JAMA and JAH ISSNs
jama_issn <- "1538-3598"
jah_issn <- "0021-8723"

# get the last 10 publications on deposit from each journal. For multiple
# ISSNs, use c() to combine them
jah_jama_publications <- rcrossref::cr_journals(issn = c(jama_issn, jah_issn), works = T,
    limit = 10) %>%
    purrr::pluck("data")

Here we used c() to combine jama_issn and jah_issn. c() is used to create a vector in R. A vector is a sequence of elements of the same type. In this case, even though the ISSNs are numbers, we created them as character vectors by surrounding them in quotation marks. You can use single or double quotes. Above, when we assigned 5 to y, we created a numeric vector.

Vectors can only contain “homogenous” data–in other words, all data must be of the same type. The type of a vector determines what kind of analysis you can do on it. For example, you can perform mathematical operations on numeric objects, but not on character objects. You can think of vectors as columns in an Excel spreadsheet: for example, in a name column, you want every value to be a character; in a date column, you want every value to be a date; etc.

Going back to our jah_jama_publications object, we have a dataframe composed of 20 observations of 24 variables. This is a rich set of metadata for the articles in the given publications. The fields are detailed in the Crossref documentation, including the field name, type, description, and whether or not it’s required. Some of these fields are title, DOI, DOI prefix identifer, ISSN, volume, issue, publisher, abstract (if provided), reference count (if provided–i.e., the number of references in the given article), link (if provided), subject (if provided), and other information. The number of citations to the article are not pulled, but these can be gathered separately with cr_citation_count() (see below).

# print column names
names(jah_jama_publications)
##  [1] "container.title"        "created"                "deposited"             
##  [4] "published.print"        "doi"                    "indexed"               
##  [7] "issn"                   "issue"                  "issued"                
## [10] "member"                 "page"                   "prefix"                
## [13] "publisher"              "score"                  "source"                
## [16] "reference.count"        "references.count"       "is.referenced.by.count"
## [19] "title"                  "type"                   "url"                   
## [22] "volume"                 "author"                 "content_domain"        
## [25] "language"               "short.container.title"  "link"                  
## [28] "subject"                "published.online"
# print data frame with select columns
jah_jama_publications %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Filtering the cr_journals query with the filter argument

You can use the filter argument within cr_journals to specify some parameters as the query is executing. This filter is built into the Crossref API query. See the available filters by calling rcrossref::filter_names(), and details by calling rcrossref::filter_details. It’s also in the API documentation.

x
award_funder
award_number
content_domain
from_accepted_date
from_online_pub_date
from_posted_date
from_print_pub_date
has_assertion
has_authenticated_orcid
has_crossmark_restriction
has_relation
location
relation_object
relation_object_type
relation_type
until_accepted_date
until_online_pub_date
until_posted_date
until_print_pub_date
has_funder
funder
prefix
member
from_index_date
until_index_date
from_deposit_date
until_deposit_date
from_update_date
until_update_date
from_pub_date
until_pub_date
has_license
license_url
license_version
license_delay
has_full_text
full_text_version
full_text_type
full_text_application
has_references
reference_visibility
has_archive
archive
has_orcid
orcid
issn
isbn
type
directory
doi
updates
is_update
has_update_policy
container_title
category_name
type_name
from_created_date
until_created_date
has_affiliation
assertion_group
assertion
article_number
alternative_id
has_clinical_trial_number
has_abstract
has_content_domain
has_domain_restriction

Filtering by publication date with from_pub_date and until_pub_date

For example, you may only want to pull publications from a given year, or within a date range. Remember to increase the limit or use cursor if you need to. Also notice three things about the filter argument:

  • The query parameter is in backticks (the key next to the 1 on the keyboard)
  • The query itself is in single quotes
  • The whole thing is wrapped in c()

Here, we will get all articles from the Journal of Librarianship and Scholarly Communication published after January 1, 2019:

# assign the JLSC ISSN
jlsc_issn <- "2162-3309"

# get articles published since January 1, 2019
jlsc_publications_2019 <- rcrossref::cr_journals(issn = jlsc_issn, works = T, filter = c(from_pub_date = "2019-01-01")) %>%
    purrr::pluck("data")

# print the dataframe with select column
jlsc_publications_2019 %>%
    dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count,
        type, issn)

Filtering by funder with award.funder

You can also return all articles funded by a specific funder. See the Crossref Funder Registry for a list of funders and their DOIs.

Here, we will combine two filters: award.funder and from_pub_date to return all articles published in PLoS ONE where a) at least one funder is the National Institutes of Health, and b) the article was published after March 1, 2020. Note that we set a limit here of 25 because we are doing a teaching activity and we don’t want to send heavy queries. If you were doing this on your own, you would likely want to remove the limit.

# assign the PLoS ONE ISSN and the NIH Funder DOI
plosone_issn <- "1932-6203"
nih_funder_doi <- "10.13039/100000002"

# get articles published in PLoS since 3/1 funded by NIH
plosone_publications_nih <- rcrossref::cr_journals(issn = plosone_issn, works = T,
    limit = 25, filter = c(award.funder = nih_funder_doi, from_pub_date = "2020-03-01")) %>%
    purrr::pluck("data")

We will use unnest() from the tidyr package to view the data frame here. This is described below in Unnesting List Columns.

# print the dataframe, first unnesting the funder column
plosone_publications_nih %>%
    tidyr::unnest(funder)

If you scroll all the way to the right, you can see the funder information. Look at the title column and you will notice that some article titles are now duplicated, however you will see different funders in the name column. This is because a single article may have multiple funders, and a new row is created for each funder, with data including the award number.

Filtering by license with has_license

You may be interested in licensing information for articles; for instance, gathering publications in a given journal that are licensed under Creative Commons. First run cr_journals with works set to FALSE in order to return journal details so you can check if the publisher even sends article licensing information to Crossref–it’s not required. We will use PLOS ONE again as an example.

# assign the PLoS ONE ISSN and get journal details by setting works = FALSE
plosone_issn <- "1932-6203"
plosone_details <- rcrossref::cr_journals(issn = plosone_issn, works = FALSE) %>%
    purrr::pluck("data")

We can check the deposits_licenses_current variables to see if license data on file is current. If it is TRUE, PLoS ONE does send licensing information and it is current.

# is article licensing data on file current?
plosone_details$deposits_licenses_current
## [1] TRUE

We can now rerun the query but set works = TRUE, and set the has_license to TRUE. This will therefore return only articles that have license information. We will set our limit to 25.

# get last 25 articles on file where has_license is TRUE
plosone_license <- rcrossref::cr_journals(issn = plosone_issn, works = T, limit = 25,
    filter = c(has_license = TRUE)) %>%
    pluck("data")
# print the data with select columns
plosone_license %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn, license)

The license data comes in as a nested column. We can unnest it using tidyr::unnest, which we used above with funders and will be discussed more below.

# print the data frame with license unnested. The .drop argument will drop all
# other list columns.
plosone_license %>%
    tidyr::unnest(license, .drop = TRUE)
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.

This adds four columns all the way to the right: * date (Date on which this license begins to take effect) * URL (Link to a web page describing this license–in this case, Creative Commons) * delay in days (Number of days between the publication date of the work and the start date of this license), and * content.version, which specifies the version of the article the licensing data pertains to (VOR = Version of Record, AM = Accepted Manuscript, TDM = Text and Data Mining).

Browsing the rows, we see all are CC BY 4.0, which stands to reason given PLOS ONE is an open access publisher and applies the CC BY license to the articles they publish.

Filtering rows and selecting columns with dplyr

You can use the filter() and select() functions from the dplyr package if you want to get subsets of this data after you have made the query. Note that this is a completely different filter than the one used above inside the cr_journals() function. That one was an argument sent with the API call that filtered the results before they were returned. This is a separate function that is part of dplyr to help you filter a data frame in R.

To learn more about the dplyr package, read the “Data Transformation” chapter in R For Data Science.

Above, we retrieved all articles from the Journal of Librarianship & Scholarly Communication published after January 1, 2019. Let’s say you want only volume 8, issue 1:

# assign the JLSC ISSN and get all publications after January 1, 2019
jlsc_issn <- "2162-3309"
jlsc_publications_2019 <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
    filter = c(from_pub_date = "2019-01-01")) %>%
    purrr::pluck("data")

# print the data frame with select columns
jlsc_publications_2019 %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)
# use filter from dplyr to get only volume 8, issue 1
jlsc_8_1 <- jlsc_publications_2019 %>%
    dplyr::filter(volume == "8", issue == "1")

# print the data frame with select columns
jlsc_8_1 %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

filter() will go through each row of your existing jlsc_publications_2019 data frame, and keep only those rows with values matching the filters you input. Note: be careful of filtering by ISSN. If a journal has multiple ISSNs they’ll be combined in a single cell with a comma and the filter() will fail, as with JAMA above. In this case it may be wiser to use str_detect(), as described a couple code chunks down.

jah_jama_publications$issn[1]
## [1] "0098-7484,1538-3598"

We can use filter() to get a single article from within this data frame if we need, either by DOI:

# filter to get 'The Five Laws of OER' article by DOI
jlsc_article <- jlsc_publications_2019 %>%
    dplyr::filter(doi == "10.7710/2162-3309.2299")

# print data frame with select columns
jlsc_article %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Or by title:

# use str_detect to search the title column for articles that include the term
# OER
jlsc_article <- jlsc_publications_2019 %>%
    dplyr::filter(stringr::str_detect(title, "OER"))

# print the data frame with select column
jlsc_article %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Here, we use the str_detect() function from the stringr package, which is loaded as part of the tidyverse, in order to find a single term (OER) in the title.

Remember that these dplyr and stringr functions are searching through our existing data frame jlsc_publications_2019, not issuing new API calls.

Field queries

There is yet another way of making your query more precise, and that is to use a field query (flq) argument to cr_journals(). This allows you to search in specific bibliographic fields such as author, editor, titles, ISSNs, and author affiliation (not widely available). These are listed in the Crossref documentation and reproduced below. You must provide an ISSN–in other words, you can’t run a field query for authors across all journals.

Field query parameter Description
query.container-title Query container-title aka. publication name
query.author Query author given and family names
query.editor Query editor given and family names
query.chair Query chair given and family names
query.translator Query translator given and family names
query.contributor Query author, editor, chair and translator given and family names
query.bibliographic Query bibliographic information, useful for citation look up. Includes titles, authors, ISSNs and publication years
query.affiliation Query contributor affiliations

Field query by title

Here, we get all publications from the Journal of Librarianship and Scholarly Communication with the term “open access” in the title.

# assign JLSC ISSN and query the bibliographic field for terms mentioning open
# access.
jlsc_issn <- "2162-3309"
jlsc_publications_oa <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
    flq = c(query.bibliographic = "open access")) %>%
    purrr::pluck("data")

# print the data frame with select columns
jlsc_publications_oa %>%
    dplyr::select(title, doi, volume, issue, page, issued, issn, author)

Field query by author, contributor, or editor

The flq argument can also be used for authors, contributors, or editors. Here we search the same journal for authors with the name Salo (looking for all articles written by Dorothea Salo).

# Use the query.author field query to find JLSC articles with author name Salo
jlsc_publications_auth <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
    flq = c(query.author = "salo")) %>%
    purrr::pluck("data")

# print the data frame with select columns
jlsc_publications_auth %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

TRY IT YOURSELF

  1. Assign the ISSN for College & Research Libraries to a value - 2150-6701
  2. Use the query.author field query (flq) to find all articles written by Lisa Janicke Hinchliffe
# assign the C&RL ISSN 2150-6701
# use the query.author field query to search for articles written by Lisa
# Hinchliffe.

Viewing the JSON file

You can view these files in a JSON viewer using jsonedit() from the listviewer package. Note especially the last few variables. These are nested lists, as a single article can have multiple authors, and each author has a given name, family name, and sequence of authorship.

# assign the PLOS ISSN and get the last 25 articles on deposit
plosone_issn <- "1932-6203"
plosone_publications <- cr_journals(issn = plosone_issn, works = TRUE, limit = 5) %>%
    pluck("data")

listviewer::jsonedit(plosone_publications, mode = "view")

To save the JSON output, first use toJSON(), then write_json() from jsonlite.

# use the toJSON function to convert the output to JSON
plosone_publications_json <- jsonlite::toJSON(plosone_publications)

# write a JSON file
jsonlite::write_json(plosone_publications_json, "plosone_publications.json")

You can also open the json in an online json viewer such as Code Beautify. Either open a file or paste the JSON output on the left side. Click Tree Viewer to view the data.

Using cr_works() to get data on articles

cr_works() allows you to search by DOI or a general query in order to return the Crossref metadata.

It is important to note, as Crossref does in the documentation:

Crossref does not use “works” in the FRBR sense of the word. In Crossref parlance, a “work” is just a thing identified by a DOI. In practice, Crossref DOIs are used as citation identifiers. So, in FRBR terms, this means, that a Crossref DOI tends to refer to one expression which might include multiple manifestations. So, for example, the ePub, HTML and PDF version of an article will share a Crossref DOI because the differences between them should not effect the interpretation or crediting of the content. In short, they can be cited interchangeably. The same is true of the “accepted manuscript” and the “version-of-record” of that accepted manuscript.

Searching by DOI

You can pass a DOI directly to cr_works() using the dois argument:

# Get metadata for a single article by DOI
jlsc_ku_oa <- cr_works(dois = "10.7710/2162-3309.1252") %>%
    purrr::pluck("data")

# print the data frame with select columns
jlsc_ku_oa %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

You can also pass more than one DOI. Here we start by assigning our DOIs to a variable my_dois, then pass it to cr_works() in the doi argument:

# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", "10.1016/j.joi.2016.08.002", "10.1371/journal.pone.0020961",
    "10.3389/fpsyg.2018.01487", "10.1038/d41586-018-00104-7", "10.12688/f1000research.8460.2",
    "10.7551/mitpress/9286.001.0001")

# pass the my_dois vector to cr_works()
my_dois_works <- rcrossref::cr_works(dois = my_dois) %>%
    pluck("data")

# print the data frame with select columns
my_dois_works %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Unnesting list columns

Authors, links, licenses, funders, and some other values can appear in nested lists when you call cr_journals because there can be, and often are, multiple of each of these items per article. You can check the data classes on all variables by running typeof() across all columns using the map_chr() function from purrr:

# query to get data on a specific PLOS article
plos_article <- cr_works(dois = "10.1371/journal.pone.0228782") %>%
    purrr::pluck("data")

# print the type of each column (e.g. character, numeric, logical, list)
purrr::map_chr(plos_article, typeof)
##        container.title                created              deposited 
##            "character"            "character"            "character" 
##       published.online                    doi                indexed 
##            "character"            "character"            "character" 
##                   issn                  issue                 issued 
##            "character"            "character"            "character" 
##                 member                   page                 prefix 
##            "character"            "character"            "character" 
##              publisher                  score                 source 
##            "character"            "character"            "character" 
##        reference.count       references.count is.referenced.by.count 
##            "character"            "character"            "character" 
##                subject                  title                   type 
##            "character"            "character"            "character" 
##          update.policy                    url                 volume 
##            "character"            "character"            "character" 
##               language  short.container.title                 author 
##            "character"            "character"                 "list" 
##                 funder                   link         content_domain 
##                 "list"                 "list"                 "list" 
##                license              reference 
##                 "list"                 "list"

Our plos_article data frame has a nested list for author. We can unnest this column using unnest() from the tidyr package. The .drop = TRUE argument will drop any other list columns.

# unnest author column
plos_article %>%
    tidyr::unnest(author, .drop = TRUE)

We can see this has added 5 rows and 3 new columns: given (first name), family (last name), and sequence (order in which they appeared).

See https://ciakovx.github.io/rcrossref.html#unnesting_list_columns for more detailed strategies in unnesting nested lists in Crossref. For more details, call ?unnest and read the R for Data Science section on Unnesting.


TRY IT YOURSELF

  1. Refer back to the type of each column above. Try unnesting one of the other list columns. Scroll all the way to the right. What new columns have appeared? Have any rows been duplicated?

Getting more than 1000 results with the cursor argument to cr_journals

If our result will have more than 1000 results, we have to use the cursor argument. Here we will look at the journal Philosophical Transactions, the longest running scientific journal in the world. We first run cr_journals with works set to FALSE in order to get the journal details, specifically to find out how many articles from this journal are indexed in Crossref.

philo_issn <- "2053-9223"
philo_details <- rcrossref::cr_journals(philo_issn, works = FALSE) %>%
    pluck("data")
philo_details$total_dois
## [1] 8536

Because there are 8,534 articles, we need to pass the cursor argument to cr_journals, which is called “deep paging.” As described by Paul Oldham:

“the CrossRef API also permits deep paging (fetching results from multiple pages). We can retrieve all the results by setting the cursor to the wildcard * and the cursor_max to the total results. Note that the cursor and the cursor_max arguments must appear together for this to work.”

See more in the API documentation

We set cursor to * and max to the maximum number of results to return.

philo_articles <- rcrossref::cr_journals(philo_issn, works = TRUE, cursor = "*",
    cursor_max = 8534) %>%
    pluck("data")

Running general queries on cr_works()

You can also use cr_works() to run a query based on very simple text keywords. For example, you can run oa_works &lt;- rcrossref::cr_works(query = "open+access"). Paul Oldham gives a great example of this, but does make the comment:

> CrossRef is not a text based search engine and the ability to conduct text based searches is presently crude. Furthermore, we can only search a very limited number of fields and this will inevitably result in lower returns than commercial databases such as Web of Science (where abstracts and author keywords are available). Unfortunately there is no boolean AND for Crossref queries (see https://github.com/CrossRef/rest-api-doc/issues/135 and https://twitter.com/CrossrefSupport/status/1073601263659610113). However, as discussed above, the Crossref API assigns a score to each item returned giving a measure of the API’s confidence in the match, and if you connect words using + the Crossref API will give items with those terms a higher score.

Specifying field queries to cr_works() with flq

As with cr_journals, you can use flq to pass field queries on to cr_works(), such as author.

Here we search for the book Open Access by Peter Suber by doing a general keyword search for “open access” and an author search for “suber”:

# do a general query for the term open access and a field query to return
# results where the author name includes Suber
suber_oa <- cr_works(query = "open+access", flq = c(query.author = "suber")) %>%
    pluck("data")

# print the data frame with select columns
suber_oa %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

Dr. Suber has written lots of materials that includes the term “open access.” We can use the filter() function from dplyr to look only at books, from the type column:

# use filter() from dplyr to filter that result to include only books
suber_oa_books <- suber_oa %>%
    filter(type == "book")

# print the data frame with select columns
suber_oa_books %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

One is the book from MIT Press that we’re looking for; the other is Knowledge Unbound, which is a collection of his writings.

We could be more specific from the outset by adding bibliographic information in query.bibliographic, such as ISBN (or ISSN, if it’s a journal):

# run a different cr_works() query with author set to Suber and his book's ISBN
# passed to query.bibliographic
suber_isbn <- cr_works(flq = c(query.author = "suber", query.bibliographic = "9780262301732")) %>%
    pluck("data")

# print the data frame with select columns
suber_isbn %>%
    dplyr::select(title, doi, issued, url, publisher, type, author)

You can combine the filter argument with flq to return only items of type book published in 2012.

Getting formatted references in a text file

We can use the cr_cn() function from the rcrossref package to get the citations to those articles in text form in the style you specify. We’ll put it into Chicago. The cr_cn() function returns each citation into a list element. We can use the map_chr and the pluck functions from purrr to instead assign them to a character vector.

# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", "10.1016/j.joi.2016.08.002", "10.1371/journal.pone.0020961",
    "10.3389/fpsyg.2018.01487", "10.1038/d41586-018-00104-7", "10.12688/f1000research.8460.2",
    "10.7551/mitpress/9286.001.0001")

# Use cr_cn to get back citations formatted in Chicago for those DOIs
my_citations <- rcrossref::cr_cn(my_dois, format = "text", style = "chicago-note-bibliography") %>%
    purrr::map_chr(., purrr::pluck, 1)

# print the formatted citations
my_citations
## [1] "Frosio, Giancarlo F. “Open Access Publishing: A Literature Review.” SSRN Electronic Journal (2014). doi:10.2139/ssrn.2697412."                                                                                                                                                              
## [2] "Laakso, Mikael, and Bo-Christer Björk. “Hybrid Open access—A Longitudinal Study.” Journal of Informetrics 10, no. 4 (November 2016): 919–932. doi:10.1016/j.joi.2016.08.002."                                                                                                               
## [3] "Laakso, Mikael, Patrik Welling, Helena Bukvova, Linus Nyman, Bo-Christer Björk, and Turid Hedlund. “The Development of Open Access Journal Publishing from 1993 to 2009.” Edited by Marcelo Hermes-Lima. PLoS ONE 6, no. 6 (June 13, 2011): e20961. doi:10.1371/journal.pone.0020961."      
## [4] "Paulus, Frieder M., Nicole Cruz, and Sören Krach. “The Impact Factor Fallacy.” Frontiers in Psychology 9 (August 20, 2018). doi:10.3389/fpsyg.2018.01487."                                                                                                                                  
## [5] "Shotton, David. “Funders Should Mandate Open Citations.” Nature 553, no. 7687 (January 2018): 129–129. doi:10.1038/d41586-018-00104-7."                                                                                                                                                     
## [6] "Tennant, Jonathan P., François Waldner, Damien C. Jacques, Paola Masuzzo, Lauren B. Collister, and Chris. H. J. Hartgerink. “The Academic, Economic and Societal Impacts of Open Access: An Evidence-Based Review.” F1000Research 5 (June 9, 2016): 632. doi:10.12688/f1000research.8460.2."
## [7] "Suber, Peter. “Open Access” (2012). doi:10.7551/mitpress/9286.001.0001."

Beautiful formatted citations from simply a list of DOIs! You can then write this to a text file using writeLines.

# write the formatted citations to a text file
writeLines(my_citations, "my_citations_text.txt")

The above is helpful if you need to paste the references somewhere, and there are loads of other citation styles included in rcrossref–view them by calling rcrossref::get_styles() and it will print a vector of these styles to your console. I’ll just print the first 15 below:

# look at the first 15 styles Crossref offers
rcrossref::get_styles()[1:15]
##  [1] "academy-of-management-review"                                
##  [2] "accident-analysis-and-prevention"                            
##  [3] "aci-materials-journal"                                       
##  [4] "acm-sig-proceedings-long-author-list"                        
##  [5] "acm-sig-proceedings"                                         
##  [6] "acm-sigchi-proceedings-extended-abstract-format"             
##  [7] "acm-sigchi-proceedings"                                      
##  [8] "acm-siggraph"                                                
##  [9] "acme-an-international-journal-for-critical-geographies"      
## [10] "acta-amazonica"                                              
## [11] "acta-anaesthesiologica-scandinavica"                         
## [12] "acta-anaesthesiologica-taiwanica"                            
## [13] "acta-botanica-croatica"                                      
## [14] "acta-chiropterologica"                                       
## [15] "acta-chirurgiae-orthopaedicae-et-traumatologiae-cechoslovaca"

Getting formatted references in a BibTeX or RIS file

In addition to a text file, you can also write it to BibTeX or RIS:

# Use cr_cn() to get BibTeX files for my DOIs
my_citations_bibtex <- rcrossref::cr_cn(my_dois, format = "bibtex") %>%
    purrr::map_chr(., purrr::pluck, 1)

Write it to a .bib file using writeLines():

# write to bibtex file
writeLines(my_citations_bibtex, "my_citations_bibtex.bib")

Getting works from a typed citation in a Word document/text file

This can be helpful if you have a bibliography in a Word document or text file that you want to get into a reference management tool like Zotero. For instance, you may have written the citations in APA style and need to change to Chicago, but don’t want to rekey it all out. Or perhaps you jotted down your citations hastily and left out volume, issue, or page numbers, and you need a nice, fully-formatted citation.

If each citation is on its own line in your document’s bibliography, then you can probably paste the whole bibliography into an Excel spreadsheet. If it goes as planned, each citation will be in its own cell. You can then save it to a CSV file, which can then be read into R.

# read in a CSV file of citations
my_references <- read.csv("data/references.txt", stringsAsFactors = FALSE)

# print the file
my_references

As you can see, these are just raw citations, not divided into variables by their metadata elements (that is, with title in one column, author in another, etc.). But, we can now run a query to get precisely that from Crossref using cr_works. Because cr_works is not vectorized, we will need to build a loop using map() from the purrr package.

Don’t mind the technical details–it is basically saying to take each row and look it up in the Crossref search engine. Basically, this is the equivalent of copy/pasting the whole reference into the Crossref search engine. The loop will print() the citation before searching for it so we can keep track of where it is. We set the limit to 5 because if Crossref didn’t find it in the first 5 results, it’s not likely to be there at all.

# loop through the references column, using cr_works() to look the item up and
# return the top 5 hits
my_references_works_list <- purrr::map(my_references$reference, function(x) {
    print(x)
    my_works <- rcrossref::cr_works(query = x, limit = 5) %>%
        purrr::pluck("data")
})
## [1] "Frosio, G. F. (2014). Open Access Publishing: A Literature Review. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2697412"
## [1] "Laakso, M., & Bjork, B.-C. (2016). Hybrid open access: A longitudinal study. Journal of Informetrics, 10(4), 919-932. https://doi.org/10.1016/j.joi.2016.08.002"
## [1] "Laakso, M., Welling, P., Bukvova, H., Nyman, L., Bjork, B.-C., & Hedlund, T. (2011). The Development of Open Access Journal Publishing from 1993 to 2009. PLoS ONE, 6(6), e20961. https://doi.org/10.1371/journal.pone.0020961"
## [1] "Paulus, F. M., Cruz, N., & Krach, S. (2018). The Impact Factor Fallacy. Frontiers in Psychology, 9. https://doi.org/10.3389/fpsyg.2018.01487"
## [1] "Science, Digital; Hook, Daniel; Hahnel, Mark; Calvert, Ian (2019): The Ascent of Open Access. figshare. Journal contribution. https://doi.org/10.6084/m9.figshare.7618751"
## [1] "Shotton, D. (2018). Funders should mandate open citations. Nature, 553(7687), 129-129. https://doi.org/10.1038/d41586-018-00104-7"
## [1] "Suber, P. (2012). Open access. Cambridge, Mass: MIT Press."
## [1] "Tennant, J. P., Waldner, F., Jacques, D. C., Masuzzo, P., Collister, L. B., & Hartgerink, C. H. J. (2016). The academic, economic and societal impacts of Open Access: an evidence-based review. F1000Research, 5, 632. https://doi.org/10.12688/f1000research.8460.2"

The Crossref API assigns a score to each item returned within each query, giving a measure of the API’s confidence in the match. The item with the highest score is returned first in the datasets. We can return the first result in each item in the my_references_works_list by using map_dfr(), which is like map() except it returns the results into a data frame rather than a list:

# for each reference looked up, get back the first result
my_references_works_df <- my_references_works_list %>%
    purrr::map_dfr(., function(x) {
        x[1, ]
    })

# print the data frame with select columns
my_references_works_df %>%
    dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count,
        type, issn)

We can print just the titles to quickly see how well they match with the titles of the works we requested:

# print the title column
my_references_works_df$title
## [1] "Open Access Publishing: A Literature Review"                                                                                                                                       
## [2] "Hybrid open access—A longitudinal study"                                                                                                                                           
## [3] "The Development of Open Access Journal Publishing from 1993 to 2009"                                                                                                               
## [4] "The Impact Factor Fallacy"                                                                                                                                                         
## [5] "Table S1: Digital copies of visualizations, alignments, and phylogenetic trees deposited at Figshare (also accessible under Figshare Collection doi:10.6084/m9.figshare.c.3521787)"
## [6] "Funders should mandate open citations"                                                                                                                                             
## [7] "Open Access by Peter Suber. MIT Press, Cambridge, MA, U.S.A., 2012. 230 pp. ISBN: 978-0- 262-51763-8"                                                                              
## [8] "The academic, economic and societal impacts of Open Access: an evidence-based review"

Not bad! Not bad! Looks like we got 6 out of 8, with problems on number 5 and 7.

Let’s deal with 5 first. This was the result for “The Ascent of Open Access”, which was a report by Digital Science posted to figshare, didn’t come back. Even though this report does have a DOI (https://doi.org/10.6084/m9.figshare.7618751.v2) assigned via figshare, the cr_works() function searches only for Crossref DOIs. We should check to see if it came back in any of the 5 items we pulled. We do this by calling pluck() on the titles of the fifth item in the list:

my_references_works_list %>%
    purrr::pluck(5, "title")
## [1] "Table S1: Digital copies of visualizations, alignments, and phylogenetic trees deposited at Figshare (also accessible under Figshare Collection doi:10.6084/m9.figshare.c.3521787)"                                          
## [2] "Figure 4 from: Miko I, Ernst A, Deans A (2013) Morphology and function of the ovipositor mechanism in Ceraphronoidea (Hymenoptera, Apocrita). Journal of Hymenoptera Research 33: 25-61. https://doi.org/10.3897/jhr.33.5204"
## [3] "Figure 1 from: Miko I, Ernst A, Deans A (2013) Morphology and function of the ovipositor mechanism in Ceraphronoidea (Hymenoptera, Apocrita). Journal of Hymenoptera Research 33: 25-61. https://doi.org/10.3897/jhr.33.5204"
## [4] "Figure 6 from: Miko I, Popovici O, Seltmann K, Deans A (2014) The maxillo-labial complex of Sparasion (Hymenoptera, Platygastroidea). Journal of Hymenoptera Research 37: 77-111. https://doi.org/10.3897/jhr.37.5206"       
## [5] "Figure 3 from: Miko I, Ernst A, Deans A (2013) Morphology and function of the ovipositor mechanism in Ceraphronoidea (Hymenoptera, Apocrita). Journal of Hymenoptera Research 33: 25-61. https://doi.org/10.3897/jhr.33.5204"

Nope, unfortunately none of these are “The Ascent of Open Access”, so we’re out of luck. We can just throw this row out entirely using slice() from dplyr. We’ll overwrite our existing my_references_works_df because we have no future use for it in this R session.

my_references_works_df <- my_references_works_df %>%
    dplyr::slice(-5)

For row 7, it’s giving us the full citation for Peter Suber’s book when we asked for the title only, so something is fishy.

When we look at it more closely (we can call View(my_references_works_df)), we see the author of this item is not Peter Suber, but Rob Harle, and, checking the type column, it’s a journal article, not a book. This is a book review published in the journal Leonardo, not Peter Suber’s book. So let’s go back to my_references_works_list and pull data from all 5 items that came back with the API call and see if Suber’s book is in there somewhere:

suber <- my_references_works_list %>%
    purrr::pluck(7)
suber

It looks like it is the second item, confirming by seeing the author is Peter Suber, the publisher is MIT Press, the type is book, and the ISBN is “9780262301732”.

We do the following to correct it:

  • use filter() with the isbn to assign the correct row from suber to a variable called suber_correct
  • remove the incorrect row with slice (double checking that it is the 6th row)
  • use bind_rows() to add the correct one to our my_references_works_df data frame. We can just overwrite the existing my_references_works_df again
suber_correct <- suber %>%
    dplyr::filter(isbn == "9780262301732")
my_references_works_df <- my_references_works_df %>%
    dplyr::slice(-6) %>%
    bind_rows(suber_correct)

Writing publications to disk

We will use the write_csv() function from the readr package to write our data to disk. This package was loaded when you called library(tidyverse) at the beginning of the session.

First we’re going to create a vector that represents file path to where you want to save the file:

my_filepath <- "C:/EXAMPLE/crossref/data"

Unfortunately, you cannot simply write the jah_jama_publications data frame to a CSV, due to the nested lists. It will throw an error: "Error in stream_delim_(df, path, ...) : Don't know how to handle vector of type list."

You have a few choices here:

  1. You can unnest one of the columns and leave .drop set to TRUE. This will add rows for all the values in the nested lists, and drop the additional nested lists.
jah_jama_publications_auth <- jah_jama_publications %>%
    dplyr::filter(!purrr::map_lgl(author, is.null)) %>%
    tidyr::unnest(author, .drop = TRUE) %>%
    dplyr::bind_rows(jah_jama_publications %>%
        dplyr::filter(map_lgl(author, is.null)) %>%
        dplyr::select(-author, -link))
readr::write_csv(jah_jama_publications_auth, file.path(my_filepath, "jah_jama_publications_auth.csv"))
  1. You can drop the nested lists altogether using a combination of select_if() from dplyr and negate() from purrr to drop all lists in the data frame. This only works if you don’t need author or link.
jah_jama_publications_short <- jah_jama_publications %>%
    dplyr::select_if(purrr::negate(is.list))
readr::write_csv(jah_jama_publications_author, file.path(my_filepath, "jah_jama_publications_short.csv"))

We go from 24 to 22 variables because it dropped author and list.

  1. You can use mutate() from dplyr to coerce the list columns into character vectors:
jah_jama_publications_mutated <- jah_jama_publications %>%
    dplyr::mutate(author = as.character(author)) %>%
    dplyr::mutate(link = as.character(link))
readr::write_csv(jah_jama_publications_mutated, file.path(my_filepath, "jah_jama_publications_mutated.csv"))

Using roadoi to check for open access

roadoi was developed by Najko Jahn, with reviews from Tuija Sonkkila and Ross Mounce. It interfaces with Unpaywall (which used to be called oaDOI), an important tool developed by ImpactStory (Heather Piwowar and Jason Priem) for locating open access versions of scholarship–read more in this Nature article. See here for the roadoi documentation.

This incredible Introduction to roadoi by Najko Jahn provides much of what you need to know to use the tool, as well as an interesting use case. Also see his recently published article Open Access Evidence in Unpaywall, running deep analysis on Unpaywall data.

First install the package and load it.

# install the roadoi package
install.packages("roadoi")
# load the roadoi package
library(roadoi)

Setting up roadoi

Your API calls to Unpaywall must include a valid email address where you can be reached in order to keep the service open and free for everyone.

Checking OA status with oadoi_fetch

We then create DOI vector and use the oadoi_fetch() function from roadoi.

Be sure to replace the email below with your own

# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", "10.1016/j.joi.2016.08.002", "10.1371/journal.pone.0020961",
    "10.3389/fpsyg.2018.01487", "10.1038/d41586-018-00104-7", "10.12688/f1000research.8460.2",
    "10.7551/mitpress/9286.001.0001")

# use oadoi_fetch() to get Unpaywall data on those DOIs
my_dois_oa <- roadoi::oadoi_fetch(dois = my_dois, email = "clarke.iakovakis@okstate.edu")

Look at the column names.

# print column names
names(my_dois_oa)
##  [1] "doi"                    "best_oa_location"       "oa_locations"          
##  [4] "oa_locations_embargoed" "data_standard"          "is_oa"                 
##  [7] "is_paratext"            "genre"                  "oa_status"             
## [10] "has_repository_copy"    "journal_is_oa"          "journal_is_in_doaj"    
## [13] "journal_issns"          "journal_issn_l"         "journal_name"          
## [16] "publisher"              "published_date"         "year"                  
## [19] "title"                  "updated_resource"       "authors"

The returned variables are described on the Unpaywall Data Format page.

We can see that Unpaywall could not find OA versions for two of the seven of these, so we will filter them out:

# use filter() to overwrite the data frame and keep only items that are
# available OA
my_dois_oa <- my_dois_oa %>%
    dplyr::filter(is_oa == TRUE)

As above, it is easier to use unnest() to more closely view one of the variables:

# print the data frame with best open access location unnested
my_dois_oa %>%
    tidyr::unnest(best_oa_location, names_repair = "universal")

Next steps

There are several other excellent R packages that interface with publication metadata that can be used in conjunction with this package. Examples:

  • bibliometrix “is an open-source tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis.” See more information at https://bibliometrix.org/.
  • rromeo is a wrapper for the SHERPA-RoMEO API. You can retrieve a set of publications metadata from rcrossref, then use the ISSN to look up the policies of the journal regarding the archival of preprints, postprints, and publisher versions. https://cran.r-project.org/web/packages/rromeo/rromeo.pdf
  • crminer “includes functions for getting getting links to full text of articles, fetching full text articles from those links or Digital Object Identifiers (‘DOIs’), and text extraction from ‘PDFs’.” https://cran.r-project.org/web/packages/crminer/crminer.pdf