I would like to generate unique IDs for rows in my database. I will be adding entries to this database on an ongoing basis so I'll need to generate new IDs in tandem. While my database is relatively small and the chance of duplicating random IDs is minuscule, I still want to build in a programmatic fail-safe to ensure that I never generate an ID that has already been used in the past.
For starters, here are some sample data that I can use start an example database:
library(tidyverse)
library(ids)
library(babynames)
database <- data.frame(rid = random_id(5, 5), first_name = sample(babynames$name, 5))
print(database)
rid first_name
1 07282b1da2 Sarit
2 3c2afbb0c3 Aly
3 f1414cd5bf Maedean
4 9a311a145e Teriana
5 688557399a Dreyton
And here is some sample data that I can use to represent new data that will be appended to the existing database:
new_data <- sample(babynames$name, 5)
print(new_data)
first_name
1 Hamzeh
2 Mahmoud
3 Matelyn
4 Camila
5 Renae
Now, what I want is to bind a new column of randomly generated IDs using the random_id
function while simultaneously checking to ensure that newly generated IDs don't match any existing IDs within the database
object. If the generator created an identical ID, then ideally it would generate a new replacement until a truly unique ID is created.
Any help would be much appreciated!
UPDATE
I've thought of a possibility that helps but still is limited. I could generate new IDs and then use a for()
loop to test whether any of the newly generated IDs are present in the existing database. If so, then I would regenerate a new ID. For example...
new_data$rid <- random_id(nrow(new_data), 5)
for(i in 1:nrow(new_data)){
if(new_data$rid[i] %in% unique(database$rid)){
new_data$rid[id] = random_id(1, 5)
}
}
The problem with this approach is that I would need to build an endless stream of nested if
statements to continuously test the newly generated value against the original database again. I need a process to keep testing until a truly unique value that is not found in the original database is generated.
Use of ids::uuid()
would likely preclude having to check for duplicate id values. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same per What is a UUID?
Here is a base function that will quickly check for duplicate values without needing to do any iteration:
anyDuplicated(1:4)
[1] 0
anyDuplicated(c(1:4,1))
[1] 5
The first result above shows there are no duplicate values. The second is showing that element 5 is a duplicate as 1 is used twice. Below is how to check without iterating, the new_data had the database$rid
copied so all five were duplicates. This will repeat until all rid
are unique, but note that it presumes that all existing database$rid
are unique.
library(ids)
set.seed(7)
new_data$rid <- database$rid
repeat {
duplicates <- anyDuplicated(c(database$rid, new_data$rid))
if (duplicates == 0L) {
break
}
new_data$rid[duplicates - nrow(database)] <- random_id(1, 5)
}
All new_data$rid
have been replaced with unique values.
rbind(database, new_data)
rid first_name
1 07282b1da2 Sarit
2 3c2afbb0c3 Aly
3 f1414cd5bf Maedean
4 9a311a145e Teriana
5 688557399a Dreyton
6 52f494c714 Hamzeh
7 ac4f522860 Mahmoud
8 ffe74d535b Matelyn
9 e3dccc4a8e Camila
10 e0839a0d34 Renae