This cheat sheet outlines key R commands and options for statistical analysis, as taught in PHCM9795 Foundations of Biostatistics.
Generative AI statement
This work was developed with the assistance of NotebookLM
1. RStudio Setup & Basics
- Install R: Download from
https://cran.r-project.org/
. [Section 1.12] - Install RStudio: Download from
https://posit.co/download/rstudio-desktop/
. You must install R before installing RStudio. [Section 1.12] - RStudio Preferences: Set to not preserve workspace between sessions by deselecting “Restore .RData into workplace at startup” and choosing “Save workspace to .RData on exit: Never” (Mac: Edit > Settings, Windows: Tools > Global Options). [Section 1.12]
- Create a Project to simplify file navigation: File > New Project… > New Directory > New Project, then provide a directory name (e.g., PHCM9795) and browse for location. [Section 1.12]
- Start a new R Script: File > New File > R Script (or Ctrl+Shift+N/Command+Shift+N). Write commands here to save your work. [Section 1.12]
- Run a Script: Code > Run Region > Run All or select all the code, and use Ctrl+Enter or Command+Enter. [Section 1.12]
- Comments: Use
#
to add comments, which R ignores. [Section 1.12] - Object Assignment: Use
<-
(e.g.,x <- 42
) to create objects. [Section 1.12] - Vectors: Best thought of as columns of data. Combine values of the same type using the combine function:
c()
(e.g.,age <- c(20, 25, 23, 29, 21, 27)
). [Section 1.12] - Data Frames: Essentially a spreadsheet of columns. [Section 1.12]
- Convert a single column to a data frame: Some packages (e.g.
jmv
) can only work with data frames, not single columns. Convert a single column to a data frame usingas.data.frame()
(e.g.,age_df <- as.data.frame(age)
). [Section 3.21] - Case Sensitivity: R is case-sensitive (e.g.,
age
,Age
,AGE
are all different names). [Section 1.12]
2. Packages
- Install a package: Use
install.packages("package_name")
(e.g.,install.packages("jmv")
) or Tools > Install Packages. Packages only need to be installed once for your R installation. [Section 1.12] - Load a package: Use
library(package_name)
(e.g.,library(jmv)
). Must be done in each R session before using the package. [Section 1.12]
3. Data Handling & Manipulation
- Reading in data:
- R data file (
.rds
):readRDS("file_path.rds")
(e.g.,pbc <- readRDS("data/activities/mod01_pbc.rds")
). [Section 1.13] - Comma Separated Values (
.csv
):read.csv("file_path.csv")
(e.g.,sample <- read.csv("data/examples/mod02_weight_1000.csv")
). [Section 2.20] - Excel file (
.xlsx
):read_excel("file_path.xlsx")
fromreadxl
package (e.g.,survey <- read_excel("data/examples/mod03_health_survey.xlsx")
). [Section 3.18]
- R data file (
- View data:
summary(dataframe)
: Quick overview of all variables. [Section 1.13]skim(dataframe)
(fromskimr
package): Provides detailed summaries and rudimentary histograms. [Section 1.13]head(dataframe)
: Shows the first 6 rows. [Section 3.19]tail(dataframe)
: Shows the last 6 rows. [Section 3.19]subset(dataframe, condition)
: Lists observations meeting a condition (e.g.,subset(sample, weight>200)
). [Section 1.14]
- Modify data:
- Create New Variable:
dataframe$new_variable <- formula
(e.g.,survey$bmi <- survey$weight / (survey$height^2)
). [Section 3.19] - Assign Meaningful Names: recommended for easy to read output. Simplest way is to create a new column:
dataframe$newvariablename <- datafram$oldvariable
. Spaces and punctuation can be incorporated by surrounding the new variable name in quotation marks:dataframe$"new variable name (units)" <- datafram$oldvariable
. - Convert to Factor (Categorical Variable):
dataframe$variable <- factor(dataframe$variable, levels=c(values), labels=c("labels")
) (e.g.,pbc$sex <- factor(pbc$sex, levels = c(1, 2), labels = c("Male", "Female"))
). [Module 2, R notes] - Re-order Factor Levels:
relevel(dataframe$factor_variable, ref="first_level")
(e.g.,drug$group <- relevel(drug$group, ref="Active")
). Essential for correct calculation of measures of effect like RR or OR. [Section 6.13] - Recode Continuous to Categorical:
cut(dataframe$continuous_var, breaks=c(limits), labels=c("labels")
) (e.g.,pbc$agegroup <- cut(pbc$age, breaks = c(0, 30, 50, 70, 100), right = FALSE, labels = c("Less than 30", "30 to less than 50", "50 to less than 70", "70 or more"))
). [Section 2.21] - Set value to missing (NA):
dataframe$variable = ifelse(condition, NA, dataframe$variable)
(e.g.,sample$weight_clean = ifelse(sample$weight==700.2, NA, sample$weight)
). [Section 1.14] - Log transformation:
dataframe$new_var <- log(dataframe$old_var)
(e.g.,hospital$ln_los <- log(hospital$los+1)
). [Section 9.12] - Back-transform from log:
exp(log_transformed_value)
(e.g.,exp(3.407232)
). [Section 9.12]
- Create New Variable:
4. Descriptive Statistics (Numerical)
- Continuous data:
summary(vector)
: Basic summary (Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum). [Section 1.13]descriptives(data=dataframe, vars=c(var1, var2), pc=TRUE, ci=TRUE, splitBy=group_var)
(fromjmv
): Comprehensive descriptive statistics, including percentiles, confidence interval of the mean, and can split by a categorical variable. [Section 1.13; Section 3.20]
- Categorical data:
- One-way Frequency Table:
descriptives(data=dataframe, vars=variable, freq=TRUE)
(fromjmv
). [Module 2, R notes] - Two-way Frequency Table (Cross-tabulation):
contTables(data=dataframe, rows=row_var, cols=col_var)
(fromjmv
). UsepcCol=TRUE
for column percents, andpcRow=TRUE
for row percents. Usecount=n
for summarised data. [Module 2, R notes; Section 7.11]
- One-way Frequency Table:
5. Descriptive Statistics (Graphical)
- Continuous data:
- Density plot:
plot(density(dataframe$variable, na.rm=TRUE), xlab="Label", main="Title")
. Can also usedescriptives()
function injmv
, with thedens=TRUE
option (e.g.descriptives(data=dataframe, vars=varname, dens=TRUE)
. [Section 1.13] - Box plot:
boxplot(dataframe$variable, xlab="Label", main="Title"
). Can also usedescriptives()
function injmv
, with thebox=TRUE
option. [Section 1.13] - Scatter plot:
plot(x=dataframe$x_var, y=dataframe$y_var, xlab="X Label", ylab="Y Label")
. Add regression line:abline(lm(dataframe$y_var ~ dataframe$x_var))
. Can also usescat(data=dataframe, x="X_var", y="Y_var", line="linear")
fromscatr
package. [Section 8.13]
- Density plot:
- Categorical data (Bar Charts):
- One Categorical Variable:
plot(dataframe$categorical_var, main="Title", ylab="Number of participants")
. [Section 2.18] - Two Categorical Variables (from
jmv
package):- Clustered (side-by-side):
contTables(data=dataframe, rows=row_var, cols=col_var, barplot=TRUE, xaxis="xcols")
. [Section 2.19] - Stacked:
contTables(data=dataframe, rows=row_var, cols=col_var, barplot=TRUE, xaxis="xcols", bartype="stack")
. [Section 2.19] - Stacked relative frequency:
contTables(data=dataframe, rows=row_var, cols=col_var, barplot=TRUE, bartype="stack", xaxis="xcols", yaxis="ypc", yaxisPc="column_pc")
. [Section 2.19]
- Clustered (side-by-side):
- Two Categorical Variables (from
surveymv
package):surveyPlot(data=dataframe, vars="var_to_plot", group="grouping_var", type="stacked"/"grouped", freq="perc"/"count")
. [Section 2.19]
- One Categorical Variable:
6. Probability Distributions
While R has functions to calculate probabilities, external applets may be easier.
- Binomial probabilities https://homepage.stat.uiowa.edu/~mbognar/applets/bin.html
- Binomial distribution in R [Section 2.22]:
- Probability of exactly x successes:
dbinom(x=k, size=n, prob=p)
. - Probability of q or fewer successes:
pbinom(q=k, size=n, prob=p)
. - Probability of more than q successes:
pbinom(q=k, size=n, prob=p, lower.tail=FALSE)
.
- Probability of exactly x successes:
- Normal probabilities https://homepage.stat.uiowa.edu/~mbognar/applets/normal.html
- Normal distribution in R [Section 3.22]:
- Probability of q or less:
pnorm(q, mean, sd)
. - Probability of more than q:
pnorm(q, mean, sd, lower.tail=FALSE)
.
- Probability of q or less:
7. Confidence Intervals
- Proportions:
BinomCI(x=k, n=n, method='wilson')
(fromDescTools
package). [Section 6.11] - Mean (individual data):
descriptives(data=dataframe, vars=variable, ci=TRUE)
(fromjmv
). [Section 3.23] - Mean (summarised data): Use the custom
ci_mean
function provided in the notes, e.g.ci_mean(n=242, mean=128.4, sd=19.56, width=0.95)
. [Section 3.24]
8. Hypothesis Testing
- One-Sample T-test:
t.test(dataframe$variable, mu=hypothesised_mean)
. [Section 4.12] - Independent Samples T-test:
ttestIS(data=dataframe, vars=continuous_var, group=group_var, meanDiff=TRUE, ci=TRUE, welchs=TRUE)
(fromjmv
). [Section 5.9]- Alternative base R:
t.test(continuous_var ~ group_var, data=dataframe, var.equal=FALSE)
[Section 5.9]
- Paired T-test:
ttestPS(data=dataframe, pairs=list(list(i1="var1", i2="var2")), meanDiff=TRUE, ci=TRUE)
(fromjmv
). [Section 5.11]- Alternative base R:
t.test(dataframe$var1, dataframe$var2, paired=TRUE)
. [Section 5.11]
- One-Sample Proportion (Binomial Test):
binom.test(x=successes, n=trials, p=hypothesised_proportion)
. Alsoprop.test()
for z-test. [Section 6.12]
Feedback? Contact me at phcm9795.biostatistics@unsw.edu.au