
Efficiency of factor vs. characters - object size

I come across something odd. I always thought storing data as factor variable if possible and if meaningful will result in a better storage efficiency.

But when I look at this:

object.size(c( "A", "B", "B", "0", "A", "AB", "0")) # 720 Bytes
gr <- factor(c( "A", "B", "B", "0", "A", "AB", "0"))
object.size(gr) # 336 Bytes

Then factor variables require more storage then characters. So was what I read about storage efficiency all wrong?

And is there an example to make the advantage of factors usage visible for beginners?


  • Roughly speaking, a factor is an integer vector with a levels attribute (a character vector) listing the category names and a class attribute (another character vector) telling R that it's a factor.

    A short factor tends to require more memory than a character vector of the same length, because the cost of storing the factor's attributes more than offsets the saving due to storing integers instead of strings. Here is an extreme example illustrating this point:

    x <- c("a", "b")
    f <- factor(x)
    # [1] "factor"
    # [1] 1 2
    # attr(,"levels")
    # [1] "a" "b"

    Storing f requires storing both the integer vector c(1L, 2L) and the character vector c("a", "b"). In this case, the integer vector is completely redundant, because c("a", "b") encodes all of the information we needed in the first place.

    # 568 bytes
    # 176 bytes

    It becomes more efficient to store factors when the levels have a large number of repetitions.

    g <- gl(2L, 1e06L, labels = c("a", "b"))
    y <- as.character(g)
    # 8000560 bytes
    # 16000160 bytes

    Some things to keep in mind:

    So, there are many good reasons to prefer factors, even if they are short.