Factors: Modifying factor order | Saylor Academy

Modifying factor order

It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:

 relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

f, the factor whose levels you want to modify.
x, a numeric vector that you want to use to reorder the levels.
Optionally, fun, a function that's used if there are multiple values of x for each value of f. The default value is median.

 ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.

As you start making more complicated transformations, I'd recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:

 relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

What if we create a similar plot looking at how average age varies across reported income level?

 rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

Here, arbitrarily reordering the levels isn't a good idea! That's because rincome already has a principled order that we shouldn't mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.

 ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

Why do you think the average age for "Not applicable" is so high?

Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

 by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")

Finally, for bar plots, you can use fct_infreq() to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with fct_rev().

 gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()

Course Introduction

Course Syllabus

Unit 1: Introduction to R and RStudio

1.1: R and Coding Environments

Overview of R

Introduction to R and RStudio

1.2: Installing and Setting Up R and RStudio

Installing R and RStudio

Setting up RStudio

Updating Software

1.3: Command Line and Script

Using R as a Calculator

Practice: Calculator

1.4: Functions and Packages

Functions

Practice: Functions

Packages

Updating R and Its Packages

Practice: Functions and Packages

1.5: Management of Code and Other Files

R Projects and Files in a Project

Practice: R Projects

Best Practices for Writing R Code

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Basic Object Types and Operations in R

2.1: Data Types

Basic Data Types and Data Structures in R

Practice: Data Types

Strings

Practice: Strings

Factors

Practice: Factors

2.2: Vectors

Vectors and Simple Manipulations

Vectors and Type Coercion

Practice: Vectors

2.3: Arrays and Matrices

What is the Difference Between Arrays and Matrices?

Arrays in R

Matrices in R

Practice: Arrays and Matrices

2.4: Lists and Data Frames

Lists and Data Frames

Practice: Base-R Lists and Data Frames

The Tibble Format

Practice: Tibbles

The data.table Format

Practice: Data Tables

Unit 2 Assessment

Unit 2 Assessment

Unit 3: Data Import and Export

3.1: Data Input via Keyboard or Number Generation

Entering Data

Data Sets in Base R

Practice: Built-in Datasets

Pseudo-Random Number Generation

Practice: Random Number Generation

Reproducible Simulations

3.2: Loading External Files

Data Loading and Viewing

Base R: Reading Plain-Text Files

Tidyverse: Reading Plain-Text Files

Practice: read_csv

Parsing a Vector

Practice: Parsing a Vector

Parsing a File

Using the readxl Package to Read Excel Files

Loading Files From Other Programs

3.3: Data Export and Reusing R Data

Saving and Reloading Data in R Format

Practice: Export and Reuse

Base R: Writing to a CSV File

Tidyverse: Writing to a CSV File

Practice: Export to a CSV File

Practice: Data Manipulation in a Project

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Data Visualization

4.1: Base-R and ggplot2 Graphics