2  Minimum Background Check into R

2.1 What is R and why are we using it for CGM data?

R is a free, open‑source environment built for the statistical analysis of data. In this course, you will use R to analyse continuous glucose monitoring (CGM) data—things like calculating time in range, visualizing glucose trends, and generating summary reports. R is ideal for this because:

  • It handles time‑stamped sensor data cleanly.

  • It has ready‑made packages for diabetes metrics (e.g., dplyr, ggplot2, CGM‑related packages).

  • It makes reproducible work easy: you can save your analysis steps and repeat them for any patient.

  • You do not need to be a programmer. The course will give you step‑by‑step code and explain every piece.

2.2 Learning Objectives

  • Understand R objects and their classes.
  • Work with dates and times.
  • Use logical operators, conditionals, and loops (hopefully).
  • Manipulate data frames and merging datasets.

2.3 Objects and Classes in R

2.4 A different way of thinking

R does not work like point‑and‑click software. It is program‑oriented: you give commands (code) and R executes them step by step.

Hopefully, with the rise of natural language programming (AI), the whole analysis landscape will change, but in the meantime, learning some core concepts will make you fluent.


2.5 What are objects?

Think of an object as a container that holds a piece of information.
In R, everything you work with is an object:

  • A single glucose value (e.g., 120).
  • A list of all glucose readings from a patient.
  • A whole data table with time stamps and glucose values.
  • A plot you create.

Each object has a class: a label that tells R what kind of information it contains and what you can do with it.


2.6 Common classes for CGM data

Class What it stores CGM example Typical use / functions
numeric Numbers Glucose value (120, 95, 185) mean(), summary()
integer Whole numbers Count of hypoglycemic events min(), max()
character Text Sensor serial number table()
factor Categories “Day” / “Night” levels()
POSIXct / Date Dates and times Time of glucose reading difftime()
data.frame / tibble Tables (rows and columns) Entire CGM dataset subset()
list A container that can hold multiple objects A summary report with numbers, text, and a plot Access with $ or [ ]

Date vs. POSIXct
- Date: stores only the calendar date (e.g., "2025-03-20"). Use for daily summaries.
- POSIXct: stores date and time (e.g., "2025-03-20 14:30:00"). Essential for tracing glucose over hours and calculating time‑in‑range.


2.7 Create three simple objects

object1 <- 3 + 2  # this symbol "<-" is the object creator
object1
[1] 5
object2 <- "woman" # I did it again
object3 <- TRUE        

With function class() we may check our objects classes.

class(object1)
[1] "numeric"
class(object2)
[1] "character"
class(object3)
[1] "logical"

2.8 Why class matters: functions and errors

The class of an object determines which functions work with it.
If you use the wrong class, R will either give you an error or — even worse — give you a meaningless result without any warning.

2.8.1 CGM examples

  • Glucose values are usually numeric. You can calculate mean(), median(), min(), max() — all useful for metrics.

    • If they were stored as character (text), mean() would produce an error.
  • Time stamps (e.g., "2025-03-20 14:30:00") are usually POSIXct. You can calculate time differences (difftime()), extract the hour (lubridate::hour()), or plot glucose over time correctly.

    • If stored as plain character, a plot would treat time as unrelated text labels, and you could not compute durations.
  • Sensor ID or patient ID is usually character (text) or factor (category). You can group by it (group_by(sensor_id)) or count how many readings each sensor has.

    • You cannot calculate the mean() of a patient ID — that would be meaningless, and R will either error or produce nonsense.

Always check your class
If a function does not behave as expected, use class() on the object. Many “strange” results or errors in R come from mismatched classes.

2.8.2 Fixing class mismatches

When you import CGM data from a CSV file, R sometimes guesses the class incorrectly. You can convert between classes using functions like:

  • as.numeric(): to turn text that looks like numbers into actual numbers.

  • as.POSIXct(): to convert text dates into date‑time objects.

  • as.factor(): to turn text into categories.

You will practice these conversions in the course — they are essential for preparing CGM data for analysis.

Key takeaway: - Objects are the pieces of data you work with. While classes tell R what kind of data each object holds and what actions are allowed.

2.9 Date and time classes in R

When working with CGM data, you will encounter two main date/time classes in R:

  • Date – stores only the calendar date (e.g., 2025-03-20).
    Use when: you need daily summaries (e.g., average glucose per day).

  • POSIXct – stores both date and time (e.g., 2025-03-20 14:30:00).
    Use when: you work with sensor traces, need to calculate time between readings, or want to see glucose patterns by hour.

In this course, most of your CGM data will be imported as POSIXct because the time of each glucose reading matters for metrics like time in range, hypoglycemia duration, and for plotting ambulatory glucose profiles (AGP).

Always check your time column
After importing your data, use class(cgm_data$time) to verify it is POSIXct. If it is not, you can convert it with as.POSIXct() — we will practice this together.

date1 <- as.Date("2025-01-24")
date2 <- as.Date("2025-06-15")
class(date1)
[1] "Date"
difftime(date2, date1, units = "days")
Time difference of 142 days
date_hour1 <- as.POSIXct("2025-01-24 15:30:00")
date_hour2 <- as.POSIXct("01-06-2025 18:30:00", format = "%d-%m-%Y %H:%M:%OS")
difftime(date_hour2, date_hour1, units = "hours")
Time difference of 3074 hours

2.10 Logical Operators

Logical operators are questions you ask the data. In R, the answer to these questions is always a binary “Nursing Assessment”: YES (TRUE) or NO (FALSE).

Operator Meaning Clinical Example
> Greater than Is the Temperature > 38.0°C?
< Less than Is the Glucose < 70 mg/dL?
== Exactly equal to Is the Patient ID == "12345"?
!= Not equal to Is the Heart Rhythm != "Sinus"?
& AND (Both must be true) Is BP low & is Heart Rate high?
| OR (Either can be true) Is the patient in pain | having a fever?

The Double Equal: Notice that we use == to ask a question (Is this equal to that?). In R, a single = is used to assign a value (like writing a note in a chart). bp = 120 (Setting the BP to 120). bp == 120 (Asking: “Is the BP 120?”).

2.11 Loops

In the context of CGM analysis, a for loop is like an automated processing line. Instead of manually opening 250 patient files one by one, the loop handles the entire “queue” (folder) for you. This gives you:

  • Efficiency: It processes 200 files in the time it takes you to do one.

  • Safety: It applies the exact same cleaning rules to every file, eliminating “fatigue errors” or manual typos.

  • Consistency: Every file in your /processed_cgm folder ends up in the same standard format, ready for analysis.

Here is a simple example. The for loop iterates i (or whatever variable name you choose) over a sequence — in this case, from 1 to 5. Each time through the loop, R executes the code inside the curly braces { }, and i takes the next value in the sequence.

  • First iteration: i = 1
  • Second iteration: i = 2
  • … up to i = 5

This pattern of “do something for each item in a set” is the foundation for automating repetitive tasks like processing multiple patient files.

for (i in 1:5) {
  print(i)}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Clinical Analogy: A for loop is like an Automated Medication Dispensing System. You program the logic once, and it follows that exact protocol for every single patient in the system, every single time.

2.12 Data Structures

We did not come this far to deal with a single value — we need more complex objects to store and manage real CGM data. R provides several data structures, each suited for different kinds of information.

Data structure What it holds CGM example
Vector A sequence of values of the same type A single patient’s glucose readings: c(110, 145, 130, 98)
Data frame A table with rows and columns (different types allowed) The complete dataset for one patient: time stamps, glucose, sensor ID
List A container that can hold any mix of objects A report containing numbers, text, a data frame, and a plot
Matrix A two‑dimensional structure where all elements are the same type Rare in CGM analysis; sometimes used for time‑glucose matrices
Array Multi‑dimensional version of a matrix Not commonly used in basic CGM workflows

In this course, you will work most often with:

  • Vectors — for single columns of data (e.g., glucose values, timestamps).
  • Data frames — for entire patient datasets.
  • Lists — for storing multiple outputs (e.g., a summary with text and graphics).

Now that you understand objects and classes, these structures will feel like natural containers for your CGM data.

2.12.1 Vectors

A vector is a one‑dimensional collection of elements that are all the same type. In R, vectors are the building blocks for most data structures — a data frame is essentially a collection of vectors of equal length.

CGM examples of vectors

  • A vector of glucose values: 110, 145, 130, 98, 120.
  • A vector of time stamps: "08:00", "08:15", "08:30", "08:45", "09:00".
  • A vector of patient IDs: "PT001", "PT001", "PT001", "PT002", "PT002".

Once you have a vector, you can:

  • Calculate summaries: mean, min, max.
  • Access individual elements (e.g., the first value, the third value). With [].
  • Subset based on conditions (e.g., all glucose values above 130).

Why vectors matter for CGM analysis

Every column in your CGM dataset is a vector. Understanding vectors helps you manipulate single columns efficiently before combining them into a full data frame.

# Create a vector of glucose values
glucose <- c(110, 145, 130, 98, 120)

# Access elements
glucose[1]   # first value
[1] 110
glucose[3]   # third value
[1] 130
# Subset based on condition
glucose[glucose > 130]
[1] 145

Before long, you will be computing summaries like these on actual patient data (patience).

# Descriptive Statistics
mean(glucose)
min(glucose)
max(glucose)

2.13 Data frames

A data.frame is a table‑like structure where columns can have different types. This is the workhorse of CGM analysis — it allows you to store time stamps (as POSIXct), glucose values (as numeric), and patient identifiers (as character or factor) all in one object.

CGM example of a data frame

time glucose sensor_id
2025-03-20 08:00:00 110 S01
2025-03-20 08:15:00 145 S01
2025-03-20 08:30:00 130 S01
2025-03-20 08:45:00 98 S02
2025-03-20 09:00:00 120 S02

With a data frame, you can:

  • Keep all related data together (time, glucose, sensor ID).

  • Filter rows (e.g., only readings above 180 mg/dL).

  • Group by columns (e.g., calculate daily averages per sensor).

  • Create plots that combine multiple columns (e.g., glucose over time).

Proper data analysis starts with how you organize your files before even opening R. Please read: Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician, 72(1), 2-10. https://doi.org/10.1080/00031305.2017.1375989

Soon you will be building data frames from real sensor exports and using them to generate clinical insights.

# Create a data frame from vectors
cgm_data <- data.frame(
  time = as.POSIXct(c("2025-03-20 08:00:00", "2025-03-20 08:15:00",
                      "2025-03-20 08:30:00", "2025-03-20 08:45:00",
                      "2025-03-20 09:00:00")),
  glucose = c(110, 145, 130, 98, 120),
  sensor_id = c("S01", "S01", "S01", "S02", "S02")
)

# View the data frame
head(cgm_data)
                 time glucose sensor_id
1 2025-03-20 08:00:00     110       S01
2 2025-03-20 08:15:00     145       S01
3 2025-03-20 08:30:00     130       S01
4 2025-03-20 08:45:00      98       S02
5 2025-03-20 09:00:00     120       S02
dim(cgm_data)   # number of patients and variables
[1] 5 3
names(cgm_data) # variable names
[1] "time"      "glucose"   "sensor_id"
str(cgm_data)   # variable classes
'data.frame':   5 obs. of  3 variables:
 $ time     : POSIXct, format: "2025-03-20 08:00:00" "2025-03-20 08:15:00" ...
 $ glucose  : num  110 145 130 98 120
 $ sensor_id: chr  "S01" "S01" "S01" "S02" ...

Why data frames matter for CGM analysis
A data frame is the natural format for a patient’s CGM records. It keeps everything organized, and almost every analysis: from time‑in‑range calculations to AGP plots starts with a data frame.

2.13.1 Selecting Rows (patients) and Columns (variables)

Once your CGM data is in a data frame, you often need to extract specific parts:

  • Rows correspond to individual glucose readings (or patients, if you have multiple patients stacked).
  • Columns correspond to variables (e.g., time, glucose, sensor ID).

R provides several ways to select exactly what you need. The most common are:

  • data[rows, columns] – using square brackets; leaving rows or columns blank means “all”.
  • data$column_name – extracts a single column as a vector.
  • subset() – a more readable way to filter rows based on conditions.

CGM examples

  • Select all rows for a specific patient: all_data[all_data$patient_id == "PT001", ]
  • Select only the glucose column: all_data$glucose
  • Select the first 10 rows and only the time and glucose columns: all_data[1:10, c("time", "glucose")]
  • Filter to hyperglycemia events (glucose > 180): subset(all_data, glucose > 180)

You will use these selections constantly: whether you are isolating a single patient, focusing on nocturnal readings, or extracting glucose values for a summary statistic.

Why selection matters
Real‑world CGM datasets often contain multiple patients, days, or sensor types. Being able to select the exact rows and columns you need is the first step toward any meaningful analysis.

# Select rows where sensor_id is "S01"
cgm_data[cgm_data$sensor_id == "S01", ]
                 time glucose sensor_id
1 2025-03-20 08:00:00     110       S01
2 2025-03-20 08:15:00     145       S01
3 2025-03-20 08:30:00     130       S01
# Select the glucose column as a vector
cgm_data$glucose
[1] 110 145 130  98 120
# Select the first 3 rows and only the time and glucose columns
cgm_data[1:3, c("time", "glucose")]
                 time glucose
1 2025-03-20 08:00:00     110
2 2025-03-20 08:15:00     145
3 2025-03-20 08:30:00     130
# Use subset to get hyperglycemia (glucose > 140)
subset(cgm_data, glucose > 140)
                 time glucose sensor_id
2 2025-03-20 08:15:00     145       S01
# Combine: rows for S01 and columns time and glucose
cgm_data[cgm_data$sensor_id == "S01", c("time", "glucose")]
                 time glucose
1 2025-03-20 08:00:00     110
2 2025-03-20 08:15:00     145
3 2025-03-20 08:30:00     130

In real‑world CGM analysis, your data often lives in multiple tables:

  • One table with glucose readings (time, glucose, sensor ID)
  • Another table with patient information (sensor ID, age, diabetes type, target ranges)
  • A third table with event logs (meal times, insulin doses, exercise)

To bring everything together for analysis, you need to merge data frames. This is like combining two spreadsheets using a common column — in CGM work, usually the sensor_id or patient_id.

Common types of merges

  • Inner join: keeps only rows that exist in both tables.
  • Left join: keeps all rows from the first table, adding matching information from the second (if available).
  • Full join: keeps all rows from both tables.

For most CGM workflows, you will use left joins to attach patient metadata to glucose readings, ensuring no sensor data is lost.

CGM example

You have: - glucose_data – columns: time, glucose, sensor_id - patient_info – columns: sensor_id, age, diabetes_type, target_low, target_high

Merging these allows you to: - Calculate time in range using each patient’s personal targets - Compare glucose patterns by age or diabetes type - Create patient‑specific reports

Before long, you will be merging sensor exports with clinic databases to produce personalized CGM summaries.

Why merging matters
CGM devices export time‑stamped glucose data, but clinical context (patient demographics, insulin regimens) often lives elsewhere. Merging bridges that gap and turns raw numbers into actionable insights.


# Create example data frames
glucose_data <- data.frame(
  time = as.POSIXct(c("2025-03-20 08:00:00", "2025-03-20 08:15:00",
                      "2025-03-20 08:30:00", "2025-03-20 08:45:00")),
  glucose = c(110, 145, 130, 98),
  sensor_id = c("S01", "S01", "S02", "S02")
)
glucose_data
                 time glucose sensor_id
1 2025-03-20 08:00:00     110       S01
2 2025-03-20 08:15:00     145       S01
3 2025-03-20 08:30:00     130       S02
4 2025-03-20 08:45:00      98       S02
patient_info <- data.frame(
  sensor_id = c("S01", "S02"),
  age = c(34, 28),
  diabetes_type = c("Type 1", "Type 2"),
  target_low = c(70, 80),
  target_high = c(180, 200)
)
patient_info
  sensor_id age diabetes_type target_low target_high
1       S01  34        Type 1         70         180
2       S02  28        Type 2         80         200
# Left join: keep all glucose readings, add patient info where available
merged_data <- merge(glucose_data, patient_info, 
                     by = "sensor_id", all.x = TRUE)

# View merged data
merged_data
  sensor_id                time glucose age diabetes_type target_low
1       S01 2025-03-20 08:00:00     110  34        Type 1         70
2       S01 2025-03-20 08:15:00     145  34        Type 1         70
3       S02 2025-03-20 08:30:00     130  28        Type 2         80
4       S02 2025-03-20 08:45:00      98  28        Type 2         80
  target_high
1         180
2         180
3         200
4         200

2.13.1.1 Row Binding (rbind)

Sometimes you need to stack data frames on top of each other, for example, combining CGM data from multiple patients into one large table, or appending a new day of readings to an existing dataset.

Row binding does exactly that. The function rbind() takes two or more data frames with the same columns and stacks their rows together.

CGM examples

  • You have separate CSV files for each patient. After importing them one by one, you use rbind() to combine them into a single data frame for analysis across your entire clinic.

  • A patient wears a new sensor. You want to add the new readings to their existing data frame.

  • You download data for the same patient from two different months and want a continuous timeline.

Important: For rbind() to work, the data frames must have identical column names and the same order (or at least matching column structures). If they differ, you can align them first.

Soon you will be combining hundreds of patient files into one master data frame

Why row binding matters
Real CGM datasets are often split across multiple files or time periods. rbind() lets you bring everything together so you can analyze the full picture without manually copying and pasting in Excel.


# Create two data frames for the same patient on different days
day1 <- data.frame(
  time = as.POSIXct(c("2025-03-20 08:00:00", "2025-03-20 08:15:00")),
  glucose = c(110, 145),
  sensor_id = "S01"
)
day1
                 time glucose sensor_id
1 2025-03-20 08:00:00     110       S01
2 2025-03-20 08:15:00     145       S01
day2 <- data.frame(
  time = as.POSIXct(c("2025-03-21 08:00:00", "2025-03-21 08:15:00")),
  glucose = c(130, 98),
  sensor_id = "S01"
)
day2
                 time glucose sensor_id
1 2025-03-21 08:00:00     130       S01
2 2025-03-21 08:15:00      98       S01
# Stack them
all_readings <- rbind(day1, day2)

# View combined data
all_readings
                 time glucose sensor_id
1 2025-03-20 08:00:00     110       S01
2 2025-03-20 08:15:00     145       S01
3 2025-03-21 08:00:00     130       S01
4 2025-03-21 08:15:00      98       S01

2.14 Summary

We covered the basics of R objects, data structures, and merging. These skills will be essential for handling CGM data.