I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:
And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:
DATE | CLIENT NAME | TEST | MEASUREMENT | TESTER |
---|---|---|---|---|
01-01-2023 | JOHN SMITH | METABOLYTE A | 0.01 | PhD. IMA GU |
01-01-2023 | JOHN SMITH | METABOLYTE B | 10 | PhD. IMA GU |
01-01-2023 | JOHN SMITH | PCR | NEGATIVE | PhD. IMA GU |
01-01-2023 | JOHN SMITH | MUTATION | +++ | PhD. IMA GU |
01-01-2023 | ALBUS DUMBLE | PREGNANT | NEGATIVE | TECH. GUIVER |
01-01-2023 | ALBUS DUMBLE | GLUCOSE | 121 | TECH. GUIVER |
02-01-2023 | MAYDAY JUNE | METABOLYTE A | 0.01 | PhD. IMA GU |
02-01-2023 | JOHN SMITH | METABOLYTE A | 0.01 | TECH. GUIVER |
02-01-2023 | JOHN SMITH | METABOLYTE B | 10 | TECH. GUIVER |
02-01-2023 | JOHN SMITH | PCR | NEGATIVE | TECH. GUIVER |
02-01-2023 | JOHN SMITH | MUTATION | +++ | TECH. GUIVER |
So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer()
. Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?
My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.
a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
colNames = FALSE
)
Now we have to define data frame for data storage. Comments like # [2,2] +0, +1
corresponds to row and column of a
(loaded excel).
entry <- data.frame(
NOMBRE = character(), # [2,2] +0, +1
FECHA = character(), # [2,6] +0, +5
MUESTRA = character(), # [3,3] +1, +2
place = character(), # [3,5] +1, +4
COLOR = character(), # [6,3] +4, +2
ASPECTO = character(), # [7,3] +5< +2
DENSIDAD = double(), # [8,3] +6, +2
PH = character(), # [9,3] +7, +2
# ...
LEUCOCITOS = character(),#[19,3] +17, +2
BACTERIAS = character(), # [6,7] +4, +6
PIOCITOS = character() # [7,7] +5, +6
# ...
)
Now we have to find all rows with NOMBRE
nombre_rows <- which(a[,"X1"] == "NOMBRE")
and use it in loop like:
for (i in 1:length(nombre_rows)) {
x <- nombre_rows[i]
nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
for (j in 1:length(nombre_cols)) {
y <- nombre_cols[j]
entry <- data.frame(
NOMBRE = a[x, y+1],
FECHA = a[x, y+5],
MUESTRA = a[x+1, y+2],
place = a[x+1, y+4],
COLOR = a[x+4, y+2],
ASPECTO = a[x+5, y+2],
DENSIDAD = a[x+6, y+2],
PH = a[x+7, y+2],
# ...
LEUCOCITOS = a[x+17, y+2],
BACTERIAS = a[x+4, y+6],
PIOCITOS = a[x+5, y+6]
# ...
) |> rbind(entry)
}
}
And finally the data:
head(entry)
#> NOMBRE FECHA MUESTRA place COLOR ASPECTO
#> 1 RUANO EDITH 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> 2 CUNIN ELVIRA 44957.0 ORINA HOSPĂTALIZACION AMARILLO TURBIO
#> 3 LOACHAMIN MARIA 44957.0 ORINA CEXTERNA AMARILLO TRANSPARENTE
#> 4 MANZANO RAUL 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> 5 MERCHAN IVAN 44957.0 ORINA HIDRATACION ANARANJADO TURBIO
#> 6 ACERO ANTHONY 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> DENSIDAD PH LEUCOCITOS BACTERIAS PIOCITOS
#> 1 1005.0 8.0 NEGATIVO + 1-2/CAMPO
#> 2 1020.0 5.0 +++ ++ CAMPO LLENO
#> 3 1005.0 6.0 NEGATIVO OCASIONALES 1-2/CAMPO
#> 4 1010.0 7.0 NEGATIVO + 3-7/CAMPO
#> 5 1015.0 6.0 ++ ++ 50-60/CAMPO
#> 6 1010.0 5.0 NEGATIVO + 0-2/CAMPO
Created on 2023-06-18 with reprex v2.0.2
You should extend the entry
data frame to grab all variables from your data. And then loop it through all excels you have.