Factors
Modifying factor levels
More powerful than changing the orders of the levels is changing
their values. This allows you to clarify labels for publication, and
collapse levels for high-level displays. The most general and powerful
tool is fct_recode()
. It allows you to recode, or change, the value of each level. For example, take the gss_cat$partyid
:
gss_cat %>% count(partyid) #> # A tibble: 10 x 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Strong republican 2314 #> 5 Not str republican 3032 #> 6 Ind,near rep 1791 #> # … with 4 more rows
The levels are terse and inconsistent. Let's tweak them to be longer and use a parallel construction.
gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat" )) %>% count(partyid) #> # A tibble: 10 x 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Republican, strong 2314 #> 5 Republican, weak 3032 #> 6 Independent, near rep 1791 #> # … with 4 more rows
fct_recode()
will leave levels that aren't explicitly mentioned as is, and will warn
you if you accidentally refer to a level that doesn't exist.
To combine groups, you can assign multiple old levels to the same new level:
gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" )) %>% count(partyid) #> # A tibble: 8 x 2 #> partyid n #> <fct> <int> #> 1 Other 548 #> 2 Republican, strong 2314 #> 3 Republican, weak 3032 #> 4 Independent, near rep 1791 #> 5 Independent 4119 #> 6 Independent, near dem 2499 #> # … with 2 more rows
You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
If you want to collapse a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable, you can provide a vector of old levels:
gss_cat %>% mutate(partyid = fct_collapse(partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(partyid) #> # A tibble: 4 x 2 #> partyid n #> <fct> <int> #> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of fct_lump()
:
gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) #> # A tibble: 2 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Other 10637
The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.
Instead, we can use the n
parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>% mutate(relig = fct_lump(relig, n = 10)) %>% count(relig, sort = TRUE) %>% print(n = Inf) #> # A tibble: 10 x 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Catholic 5124 #> 3 None 3523 #> 4 Christian 689 #> 5 Other 458 #> 6 Jewish 388 #> 7 Buddhism 147 #> 8 Inter-nondenominational 109 #> 9 Moslem/islam 104 #> 10 Orthodox-christian 95