Formulas and Model Families: Categorical variables | Saylor Academy

Categorical variables

Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like y ~ sex, where sex could either be male or female. It doesn't make sense to convert that to a formula like y = x_0 + x_1 * sex because sex isn't a number - you can't multiply it! Instead what R does is convert it to y = x_0 + x_1 * sex_male where sex_male is one if sex is male and zero otherwise:

df <- tribble(
~ sex, ~ response,
"male", 1,
"female", 2,
"male", 1
)
model_matrix(df, response ~ sex)
#> # A tibble: 3 x 2
#>   `(Intercept)` sexmale
#>           <dbl>   <dbl>
#> 1             1       1
#> 2             1       0
#> 3             1       1

You might wonder why R also doesn't create a sexfemale column. The problem is that would create a column that is perfectly predictable based on the other columns (i.e. sexfemale = 1 - sexmale). Unfortunately the exact details of why this is a problem is beyond the scope of this book, but basically it creates a model family that is too flexible, and will have infinitely many models that are equally close to the data.

Fortunately, however, if you focus on visualising predictions you don't need to worry about the exact parameterisation. Let's look at some data and models to make that concrete. Here's the sim2 dataset from modelr:

ggplot(sim2) + 
geom_point(aes(x, y))

We can fit a model to it, and generate predictions:

mod2 <- lm(y ~ x, data = sim2)

grid <- sim2 %>% 
data_grid(x) %>% 
add_predictions(mod2)
grid
#> # A tibble: 4 x 2
#>   x      pred
#>   <chr> <dbl>
#> 1 a      1.15
#> 2 b      8.12
#> 3 c      6.13
#> 4 d      1.91

Effectively, a model with a categorical x will predict the mean value for each category. (Why? Because the mean minimises the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the original data:

ggplot(sim2, aes(x)) + 
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4)

You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:

tibble(x = "e") %>% 
add_predictions(mod2)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor x has new level e

Course Introduction

Course Syllabus

Unit 1: Introduction to R and RStudio

1.1: R and Coding Environments

Overview of R

Introduction to R and RStudio

1.2: Installing and Setting Up R and RStudio

Installing R and RStudio

Setting up RStudio

Updating Software

1.3: Command Line and Script

Using R as a Calculator

Practice: Calculator

1.4: Functions and Packages

Functions

Practice: Functions

Packages

Updating R and Its Packages

Practice: Functions and Packages

1.5: Management of Code and Other Files

R Projects and Files in a Project

Practice: R Projects

Best Practices for Writing R Code

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Basic Object Types and Operations in R

2.1: Data Types

Basic Data Types and Data Structures in R

Practice: Data Types

Strings

Practice: Strings

Factors

Practice: Factors

2.2: Vectors

Vectors and Simple Manipulations

Vectors and Type Coercion

Practice: Vectors

2.3: Arrays and Matrices

What is the Difference Between Arrays and Matrices?

Arrays in R

Matrices in R

Practice: Arrays and Matrices

2.4: Lists and Data Frames

Lists and Data Frames

Practice: Base-R Lists and Data Frames

The Tibble Format

Practice: Tibbles

The data.table Format

Practice: Data Tables

Unit 2 Assessment

Unit 2 Assessment

Unit 3: Data Import and Export

3.1: Data Input via Keyboard or Number Generation

Entering Data

Data Sets in Base R

Practice: Built-in Datasets

Pseudo-Random Number Generation

Practice: Random Number Generation

Reproducible Simulations

3.2: Loading External Files

Data Loading and Viewing

Base R: Reading Plain-Text Files

Tidyverse: Reading Plain-Text Files

Practice: read_csv

Parsing a Vector

Practice: Parsing a Vector

Parsing a File

Using the readxl Package to Read Excel Files

Loading Files From Other Programs

3.3: Data Export and Reusing R Data

Saving and Reloading Data in R Format

Practice: Export and Reuse

Base R: Writing to a CSV File

Tidyverse: Writing to a CSV File

Practice: Export to a CSV File

Practice: Data Manipulation in a Project

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Data Visualization

4.1: Base-R and ggplot2 Graphics