Factors are the way categorical variables are stored in R. For example,
treatment levels in ANOVA (analysis of variance) are considered factors;
months or quarters of the year can be represented as factors for
modeling seasonality. You should learn how to create factors, rename and
reorder factor levels for convenience, and correct analysis (for
example, the control treatment usually should be the first level of a
factor because, by default, other levels are compared to the first one
in linear models).
Modifying factor levels
More powerful than changing the orders of the levels is changing
their values. This allows you to clarify labels for publication, and
collapse levels for high-level displays. The most general and powerful
tool is fct_recode()
. It allows you to recode, or change, the value of each level. For example, take the gss_cat$partyid
:
gss_cat %>% count(partyid) #> # A tibble: 10 x 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Strong republican 2314 #> 5 Not str republican 3032 #> 6 Ind,near rep 1791 #> # … with 4 more rows
The levels are terse and inconsistent. Let's tweak them to be longer and use a parallel construction.
gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat" )) %>% count(partyid) #> # A tibble: 10 x 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Republican, strong 2314 #> 5 Republican, weak 3032 #> 6 Independent, near rep 1791 #> # … with 4 more rows
fct_recode()
will leave levels that aren't explicitly mentioned as is, and will warn
you if you accidentally refer to a level that doesn't exist.
To combine groups, you can assign multiple old levels to the same new level:
gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" )) %>% count(partyid) #> # A tibble: 8 x 2 #> partyid n #> <fct> <int> #> 1 Other 548 #> 2 Republican, strong 2314 #> 3 Republican, weak 3032 #> 4 Independent, near rep 1791 #> 5 Independent 4119 #> 6 Independent, near dem 2499 #> # … with 2 more rows
You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
If you want to collapse a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable, you can provide a vector of old levels:
gss_cat %>% mutate(partyid = fct_collapse(partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(partyid) #> # A tibble: 4 x 2 #> partyid n #> <fct> <int> #> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of fct_lump()
:
gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) #> # A tibble: 2 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Other 10637
The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.
Instead, we can use the n
parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>% mutate(relig = fct_lump(relig, n = 10)) %>% count(relig, sort = TRUE) %>% print(n = Inf) #> # A tibble: 10 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Catholic 5124 #> 3 None 3523 #> 4 Christian 689 #> 5 Other 458 #> 6 Jewish 388 #> 7 Buddhism 147 #> 8 Inter-nondenominational 109 #> 9 Moslem/islam 104 #> 10 Orthodox-christian 95