I want to read a .dbf file in R that contains "deleted" rows. However, I don´t know how to avoid reading all records. I am using the package "foreign" and the function read.dbf
.
According to the .dbf file description, each record begins with a 1-byte "deletion" flag. The byte's value is a space (0x20), if the record is active, or an asterisk (0x2A), if the record is deleted.
How can I extract this information using R? E.g. for a small sample of the iris
data set saved as a .dbf file:
library(foreign)
dbf.file <- 'iris.dbf'
write.dbf(iris[1:5, ], file=dbf.file)
We can use the readBin()
function to read the .dbf file as binary data.
# read binary data
bin <- readBin(dbf.file, what='raw', n=file.size(dbf.file))
Then, based on the .dbf format description, we can read information necessary to navigate ourselves to the first byte of each record. I use a custom function to convert the appropriate bytes from the .dbf header into an unsigned integer.
bin_to_int <- function(bin) {
if (.Platform$endian == 'big') {
bin <- rev(bin)
}
sum(2^(which(as.logical(rawToBits(bin))) - 1))
}
# number of records
n <- bin_to_int(bin[5:8])
# numer of bytes in the header
header.len <- bin_to_int(bin[9:10])
# numer of bytes in the record
record.len <- bin_to_int(bin[11:12])
With these, it is possible to compute what are the records' first bytes and see if they mark the record as deleted or not.
# first bytes of records
record.starts <- header.len + 1 + (seq_len(n)-1) * record.len
is.deleted <- bin[record.starts] == 0x2A
is.deleted
# [1] FALSE FALSE FALSE FALSE FALSE
Indeed, none of the records were marked as deleted so we can at least check if the bytes hold the expected value of 0x20
:
bin[record.starts]
# [1] 20 20 20 20 20
On a side note, from documentation it is not clear how read.dbf()
treats the deleted records so chances are it ignores them and you won't have to deal with this issue at all. It would be interesting to know this so please let us know in the comments.