See if you can add another argument to to_z() to allow the user to specify if NA values should be removed or not. Give it a default value.
Your turn
Create a function called percent_diff that calculates the percent difference between two (vectors of) values:
\[\%diff = \frac{|a - b|}{((a + b)/2)} * 100\]
Hint: you can use the abs() function to calculate the absolute value.
Your turn
Create a function called is_big() that takes a numeric vector and outputs a logical vector indicating whether each value is larger than a specified quantile threshold.
The function should have two arguments: the numeric vector and the quantile threshold (default to 0.75).
Types of function
vector functions: one or more vectors as input, one vector as output
✔️ output same length as input.
ii. ➡️ summary functions: input is vector, output is a single value
data frame functions: df as input and df as output
Summary functions
input is vector
output is a single value
could be used in summarise()
Example
Write a function to compute the standard error of a sample.
Note: sum(TRUE) = 1 and sum(FALSE) = 0 Thus,sum(!is.na(x)) gives you the number of TRUE (i.e., the number of non-NA values) and is a bit shorter than length(x[!is.na(x)])
Error in `summarise()`:
ℹ In argument: `mean = mean(column, na.rm = TRUE)`.
Caused by error:
! object 'bill_length_mm' not found
😕
Tidy evaluation
tidyverse functions like dplyr::summarise() use “tidy evaluation” so you can refer to the names of variables inside dataframes. For example, you can use:
either
penguins |>summarise(mean =mean(bill_depth_mm))
Or
summarise(penguins, mean =mean(bill_depth_mm))
Tidy evaluation
This is instead of having to use the full dataframe name with $, e.g.
summarise(penguins, mean =mean(penguins$bill_depth_mm))
This is known as data-masking: the dataframe environment masks the user environment by giving priority to the dataframe.
Because of data-masking, summarise() in my_summary() is looking for a column literally called column in the dataframe that has been passed in. It is not looking inside the variable column for the name of column you want to give it.
tells summarize() to look inside column variable to get the column name
style with spaces
.groups = "drop" to avoid message and leave the data in an ungrouped state
Use function
my_summary(penguins, bill_length_mm)
# A tibble: 1 × 4
mean n sd se
<dbl> <int> <dbl> <dbl>
1 43.9 342 5.46 0.295
🎉
When to embrace?
When tidy evaluation is used
Your turn
Write a new summary function which calculates the median, maximum and minimum values of a variable in a dataset. Incorporate an argument to allow the summary to be performed grouped by another variable.
Your turn
Try it out
my_summary(penguins, bill_length_mm, species)
# A tibble: 3 × 4
species median minimum maximum
<fct> <dbl> <dbl> <dbl>
1 Adelie 38.8 32.1 46
2 Chinstrap 49.6 40.9 58
3 Gentoo 47.3 40.9 59.6
Your turn
Improvement: Have a default of NULL for the grouping variable. Why?
Your turn
Try it out
my_summary(penguins, bill_length_mm)
# A tibble: 1 × 3
median minimum maximum
<dbl> <dbl> <dbl>
1 44.4 32.1 59.6
# A tibble: 5 × 5
species island median minimum maximum
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Biscoe 38.7 34.5 45.6
2 Adelie Dream 38.6 32.1 44.1
3 Adelie Torgersen 38.9 33.5 46
4 Chinstrap Dream 49.6 40.9 58
5 Gentoo Biscoe 47.3 40.9 59.6
Functions should be “pure”
Should not depend on external state (e.g., global variables)
Should not have side effects (e.g., modifying global variables, printing to console, plotting, etc.)
Given the same inputs, should always return the same outputs
Pure
add_one <-function(x) { x +1}
Impure
y <-10add_y <-function(x) { x + y}add_to_global <-function(x) { y <<- x +1}
Functions - data validation and error handling
It’s good practice to include data validation and error handling in your functions to ensure they behave as expected when given incorrect or unexpected inputs.
Use if statements to check that inputs meet certain criteria or if specific conditions are met.
Use stop() to throw an Error if issues are serious enough that it should not proceed.
Use warning() to issue a Warning for non-critical issues.
Errors
if (some_condition_not_met) {stop("Descriptive error message.")# Function exits and does not return a result}
Warnings
if (some_non_critical_condition) {warning("Descriptive warning message.")# Function continues executing and returns a result}
Warnings are often used later in the execution if the result to be returned is cause for concern. (e.g., result is all NA). It is a signal to the user to check the results carefully.
Example
If a function expects two numeric vectors of the same length, you should check and stop the if they are not.
percent_diff <-function(x, y) {if (length(x) !=length(y)) {stop("Input vectors must be of the same length.") } (abs(x - y) / ((x + y) /2)) *100}
a <-c(1, 2, 3)b <-c(4, 5)percent_diff(a, b)
Error in percent_diff(a, b): Input vectors must be of the same length.
Your turn
Add a check to my_summary() to ensure that the summary_var is numeric. If not, throw an error.
Add a warning to to_z() if the input vector has fewer than 3 non-NA values, since the standard deviation may not be meaningful in that case.
Practical use of functions in a project
Functions can be stored in separate R scripts and sourced into your analysis scripts
This keeps your analysis scripts cleaner and more focused on the analysis logic
It is best practice to keep your functions in a dedicated folder called R/
Then in the top of your scripts you can use source("R/your_function.R") to load them
Documenting functions
Use roxygen2 style comments to document your functions
In Positron - put your cursor on the name of the function and click the 💡 and choose “Generate a Roxygen Template”
Summary ☕
Writing functions can make you more efficient and make your code more readable. This can be just for your benefit.
Vector functions take one or more vectors as input; their output can be a vector (useful in mutate() and filter()) or a single value (useful in summarise()).
Dataframe functions take a dataframe as input and give a dataframe as output
Give arguments a default where possible
We use { var } embracing to manage data masking
We use pick({{ vars }}) to select more than one variable
Include data validation and error handling in your functions