Advanced R Package Development
iris, mtcars, airquality - you have been using package data all along.
Beyond bundling examples, packages can:
Today we cover the first, second, and fourth of these in fishr.
| Location | Created by | Exported? | Use when |
|---|---|---|---|
data/*.rda |
use_data() |
Yes | Example data, datasets users work with directly |
R/sysdata.rda |
use_data(..., internal = TRUE) |
No | Lookup tables, constants used by functions |
inst/extdata/ |
Manual copy | No | Non-R formats (CSV, GeoJSON, shapefiles) |
| API function | httr2 |
Yes | Live data that changes over time |
Keep reproducible scripts in data-raw/ that generate your data objects. This script is the ground truth for how your data was created.
Creates data-raw/horsefly_sample.R and adds data-raw/ to .Rbuildignore - the directory is excluded from the installed package, but commit the scripts inside it.
Exported datasets need documentation just like functions. Create R/data.R:
#' Rainbow Trout gill net survey data from Horsefly Lake
#'
#' Catch and effort from a fictional gill net survey of Rainbow Trout
#' (*Oncorhynchus mykiss*) at Horsefly Lake, British Columbia.
#'
#' @format A data frame with columns:
#' - `set_number` Integer. Net set identifier.
#' - `catch` Integer. Number of fish caught per set.
#' - `effort` Integer. Number of net-nights per set.
#' @source Fictional data for illustration purposes.
#' @examples
#' cpue(horsefly_sample)
"horsefly_sample"fetch_kluane_data() is not a dataset - it is a function that returns data.
R/, documented with roxygen2, tested like any other functionAdds httr2 to Imports in DESCRIPTION.
The Kluane Lake Trout dataset is published via CKAN:
https://open.canada.ca/data/en/api/3/action/datastore_searchaf1e5730-34bd-4314-831c-1d940d99e1a7The response is JSON with a result.records array where each element is one row.
fetch_kluane_data <- function(limit = 2000) {
req <- httr2::request(
"https://open.canada.ca/data/en/api/3/action/datastore_search"
)
req <- httr2::req_url_query(
req,
resource_id = "af1e5730-34bd-4314-831c-1d940d99e1a7",
limit = limit
)
resp <- httr2::req_perform(req)
httr2::resp_check_status(resp)
records <- httr2::resp_body_json(resp, simplifyVector = TRUE)$result$records
records <- records[-1, ] # first row is a bilingual EN/FR header, not data
data.frame(
lake = records$Lake,
year = as.integer(records$Year),
set_number = as.integer(records[["Set Number"]]),
species = records$Species,
fork_length_mm = as.numeric(records[["Fork Length (millimetres)"]]),
weight_g = as.numeric(records[["Weight (grams)"]]),
stringsAsFactors = FALSE
)
}request() - create a request object pointing at the base URLreq_url_query() - append query parameters (?resource_id=...&limit=...)req_perform() - execute the HTTP GET and return a response objectresp_check_status() - raise an informative error on 4xx/5xx responsesresp_body_json(..., simplifyVector = TRUE) - parse JSON and coerce the array of records directly to a data framecpue() is a generic with a data.frame method - passing a data frame with catch and effort columns dispatches automatically. Real-world data flows through the same function without any changes to it.
What might be a better way to go from fetch_kluane_data to cpue?
Testing functions that make real HTTP requests requires care - tests should not depend on an external server being available.
The vcr and httptest2 packages solve this by recording real HTTP interactions and replaying them during tests, keeping your test suite fast and self-contained.
Some data belongs to the package but should not be visible to users.
Good candidates: lookup tables, constants, calibration coefficients.
Internal data powers package behaviour without appearing in the user’s namespace.
It lives in R/sysdata.rda, not data/.
# data-raw/gear_types.R
gear_types <- data.frame(
gear_type = c("nordic_gillnet", "sinking_longline", "fyke_net", "electrofishing"),
gear_factor = c(1.0, 0.72, 1.35, 0.45),
description = c(
"Nordic multi-mesh gillnet (standard reference gear)",
"Sinking baited longline",
"Passive fyke net trap",
"Electrofishing unit"
),
stringsAsFactors = FALSE
)
usethis::use_data(gear_types, internal = TRUE, overwrite = TRUE)Replace the numeric gear_factor argument with a named gear_type:
cpue.numeric <- function(
catch,
effort,
gear_type = "nordic_gillnet",
method = c("ratio", "log"),
verbose = getOption("fishr.verbose", FALSE),
...
) {
if (!gear_type %in% gear_types$gear_type) {
stop(
"`gear_type` must be one of: ",
paste(gear_types$gear_type, collapse = ", "),
call. = FALSE
)
}
gear_factor <- gear_types$gear_factor[gear_types$gear_type == gear_type]
# ...
}The error message is generated from the table at runtime:
Add a gear type to the table, re-run the script, and it appears in the error automatically - no manual string editing required.
Goal: extend the gear type table, verify the error updates automatically, fix the tests.
Open data-raw/gear_types.R and add a row for "trap_net" with a gear factor of 1.1
Re-run the script and reload:
"trap_net":check() - changing gear_factor to gear_type broke existing tests. Fix them.