Factors are the way categorical variables are stored in R. For example,
treatment levels in ANOVA (analysis of variance) are considered factors;
months or quarters of the year can be represented as factors for
modeling seasonality. You should learn how to create factors, rename and
reorder factor levels for convenience, and correct analysis (for
example, the control treatment usually should be the first level of a
factor because, by default, other levels are compared to the first one
in linear models).
Modifying factor order
It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
relig_summary <- gss_cat %>% group_by(relig) %>% summarise( age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n() ) #> `summarise()` ungrouping output (override with `.groups` argument) ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of relig
using fct_reorder()
. fct_reorder()
takes three arguments:
-
f
, the factor whose levels you want to modify. -
x
, a numeric vector that you want to use to reorder the levels. - Optionally,
fun
, a function that's used if there are multiple values ofx
for each value off
. The default value ismedian
.
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point()

Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, I'd recommend moving them out of aes()
and into a separate mutate()
step. For example, you could rewrite the plot above as:
relig_summary %>% mutate(relig = fct_reorder(relig, tvhours)) %>% ggplot(aes(tvhours, relig)) + geom_point()
What if we create a similar plot looking at how average age varies across reported income level?
rincome_summary <- gss_cat %>% group_by(rincome) %>% summarise( age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n() ) #> `summarise()` ungrouping output (override with `.groups` argument) ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

Here, arbitrarily reordering the levels isn't a good idea! That's because rincome
already has a principled order that we shouldn't mess with. Reserve fct_reorder()
for factors whose levels are arbitrarily ordered.
However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use fct_relevel()
. It takes a factor, f
, and then any number of levels that you want to move to the front of the line.
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) + geom_point()

Why do you think the average age for "Not applicable" is so high?
Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2()
reorders the factor by the y
values associated with the largest x
values. This makes the plot easier to read because the line colours line up with the legend.
by_age <- gss_cat %>% filter(!is.na(age)) %>% count(age, marital) %>% group_by(age) %>% mutate(prop = n / sum(n)) ggplot(by_age, aes(age, prop, colour = marital)) + geom_line(na.rm = TRUE) ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + geom_line() + labs(colour = "marital")
Finally, for bar plots, you can use fct_infreq()
to
order levels in increasing frequency: this is the simplest type of
reordering because it doesn't need any extra variables. You may want to
combine with fct_rev()
.
gss_cat %>% mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>% ggplot(aes(marital)) + geom_bar()
