rr-factor

Factor from numeric vector drops every 100.000th element from its levels


Consider a vector of type numeric with over 100.000 elements. In the example below, it's simply the range 1:500001.

n <- 500001
arr <- as.numeric(1:n)

The following sequence of factor calls causes odd behaviour:

First call factor with the levels argument specified as the exact same range that arr was defined with. Predictably, the resulting variable has exactly n levels:

> tmp <- factor(arr, levels=1:n)
> nlevels(tmp)
[1] 500001

Now call factor again on the result from before. The outcome is that the new value, tmp2, is missing some values from its levels:

> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 499996 

Checking to see which items are missing, we find it's every 100.000th element (which, in this case, have value equal to their index):

> which(!levels(tmp) %in% levels(tmp2))
[1] 100000 200000 300000 400000 500000 

Decreasing n to <=100.000 eliminates this unexpected behaviour. However, it occurs for any n > 100.000.

> n <- 99999
> arr <- as.integer(1:n)
> tmp <- factor(arr)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 99999
> which(!levels(tmp) %in% levels(tmp2))
integer(0)

This also does not happen when the arr vector has a type other than numeric:

> n <- 500001
> arr <- as.integer(1:n)
> tmp <- factor(arr, levels=1:n)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 500001

Finally, the problem does not occur when the levels argument is left unspecified in the first call to factor().

What could be causing this behaviour? Tested in R 4.3.2


Solution

  • Building on ThomasIsCoding's answer, it is due to the scientific notation rule applying to real numbers, but not applying to integers...

    For example, in the console...

    options(scipen = 0) #uses scientific notation if fewer characters than normal
    
    500000L
    [1] 500000   #integer displayed in normal notation
    
    500000
    [1] 5e+05    #numeric displayed in shorter scientific notation
    

    So the names cause a mismatch with the factor levels for each multiple of 100000 using numeric values.

    The problem can be solved by increased scipen.

    I thought scipen was primarily to control displayed values, so it is odd that it is being used for factor levels.