Consider a vector of type numeric
with over 100.000 elements. In the example below, it's simply the range 1:500001.
n <- 500001
arr <- as.numeric(1:n)
The following sequence of factor
calls causes odd behaviour:
First call factor
with the levels
argument specified as the exact same range that arr
was defined with. Predictably, the resulting variable has exactly n
levels:
> tmp <- factor(arr, levels=1:n)
> nlevels(tmp)
[1] 500001
Now call factor
again on the result from before. The outcome is that the new value, tmp2
, is missing some values from its levels:
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 499996
Checking to see which items are missing, we find it's every 100.000th element (which, in this case, have value equal to their index):
> which(!levels(tmp) %in% levels(tmp2))
[1] 100000 200000 300000 400000 500000
Decreasing n
to <=100.000 eliminates this unexpected behaviour. However, it occurs for any n
> 100.000.
> n <- 99999
> arr <- as.integer(1:n)
> tmp <- factor(arr)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 99999
> which(!levels(tmp) %in% levels(tmp2))
integer(0)
This also does not happen when the arr
vector has a type other than numeric
:
> n <- 500001
> arr <- as.integer(1:n)
> tmp <- factor(arr, levels=1:n)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 500001
Finally, the problem does not occur when the levels
argument is left unspecified in the first call to factor()
.
What could be causing this behaviour? Tested in R 4.3.2
Building on ThomasIsCoding's answer, it is due to the scientific notation rule applying to real numbers, but not applying to integers...
For example, in the console...
options(scipen = 0) #uses scientific notation if fewer characters than normal
500000L
[1] 500000 #integer displayed in normal notation
500000
[1] 5e+05 #numeric displayed in shorter scientific notation
So the names cause a mismatch with the factor levels for each multiple of 100000 using numeric values.
The problem can be solved by increased scipen
.
I thought scipen
was primarily to control displayed values, so it is odd that it is being used for factor levels.