pythonrtext-miningnon-relational-database

How to detect and tabulate data from an excel file?


I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:

Non-relational data, available at Google Drive

And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:

DATE CLIENT NAME TEST MEASUREMENT TESTER
01-01-2023 JOHN SMITH METABOLYTE A 0.01 PhD. IMA GU
01-01-2023 JOHN SMITH METABOLYTE B 10 PhD. IMA GU
01-01-2023 JOHN SMITH PCR NEGATIVE PhD. IMA GU
01-01-2023 JOHN SMITH MUTATION +++ PhD. IMA GU
01-01-2023 ALBUS DUMBLE PREGNANT NEGATIVE TECH. GUIVER
01-01-2023 ALBUS DUMBLE GLUCOSE 121 TECH. GUIVER
02-01-2023 MAYDAY JUNE METABOLYTE A 0.01 PhD. IMA GU
02-01-2023 JOHN SMITH METABOLYTE A 0.01 TECH. GUIVER
02-01-2023 JOHN SMITH METABOLYTE B 10 TECH. GUIVER
02-01-2023 JOHN SMITH PCR NEGATIVE TECH. GUIVER
02-01-2023 JOHN SMITH MUTATION +++ TECH. GUIVER

So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer(). Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?


Solution

  • My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.

    a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
                        colNames = FALSE
                        )
    

    Now we have to define data frame for data storage. Comments like # [2,2] +0, +1 corresponds to row and column of a (loaded excel).

    entry <- data.frame(
      NOMBRE = character(),    # [2,2] +0, +1
      FECHA = character(),     # [2,6] +0, +5
      MUESTRA = character(),   # [3,3] +1, +2
      place = character(),     # [3,5] +1, +4
      COLOR = character(),     # [6,3] +4, +2
      ASPECTO = character(),   # [7,3] +5< +2
      DENSIDAD = double(),     # [8,3] +6, +2
      PH = character(),        # [9,3] +7, +2
    # ...
      LEUCOCITOS = character(),#[19,3] +17, +2
      BACTERIAS = character(), # [6,7] +4, +6
      PIOCITOS = character()   # [7,7] +5, +6
    # ...
    )
    

    Now we have to find all rows with NOMBRE

    nombre_rows <- which(a[,"X1"] == "NOMBRE")
    

    and use it in loop like:

    for (i in 1:length(nombre_rows)) {
      x <- nombre_rows[i]
      nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
      for (j in 1:length(nombre_cols)) {
        y <- nombre_cols[j]
    
        entry <- data.frame(
          NOMBRE = a[x, y+1],
          FECHA = a[x, y+5],
          MUESTRA = a[x+1, y+2],
          place = a[x+1, y+4],
          COLOR = a[x+4, y+2],
          ASPECTO = a[x+5, y+2],
          DENSIDAD = a[x+6, y+2],
          PH = a[x+7, y+2],
          # ...
          LEUCOCITOS = a[x+17, y+2],
          BACTERIAS = a[x+4, y+6],
          PIOCITOS = a[x+5, y+6]
          # ...
        ) |> rbind(entry)
      }
    }
    

    And finally the data:

    head(entry)
    #>             NOMBRE   FECHA MUESTRA            place      COLOR      ASPECTO
    #> 1      RUANO EDITH 44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
    #> 2    CUNIN ELVIRA  44957.0   ORINA HOSPÌTALIZACION   AMARILLO       TURBIO 
    #> 3 LOACHAMIN MARIA  44957.0   ORINA         CEXTERNA  AMARILLO  TRANSPARENTE
    #> 4    MANZANO RAUL  44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
    #> 5    MERCHAN IVAN  44957.0   ORINA      HIDRATACION ANARANJADO      TURBIO 
    #> 6   ACERO ANTHONY  44957.0   ORINA         CEXTERNA   AMARILLO  LIG.TURBIO 
    #>   DENSIDAD  PH LEUCOCITOS    BACTERIAS     PIOCITOS
    #> 1   1005.0 8.0   NEGATIVO            +    1-2/CAMPO
    #> 2   1020.0 5.0        +++           ++ CAMPO LLENO 
    #> 3   1005.0 6.0   NEGATIVO OCASIONALES    1-2/CAMPO 
    #> 4   1010.0 7.0   NEGATIVO            +   3-7/CAMPO 
    #> 5   1015.0 6.0         ++           ++ 50-60/CAMPO 
    #> 6   1010.0 5.0   NEGATIVO            +   0-2/CAMPO
    

    Created on 2023-06-18 with reprex v2.0.2

    You should extend the entry data frame to grab all variables from your data. And then loop it through all excels you have.