pythonpdftabula

tabula extract table from pdf remove line break


I have a table with wrapped text in a pdf file

enter image description here

I used tabula to extract table from the pdf file

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1,lattice=True)
table[0]

However, the end result looking like this:

enter image description here

is there a way to interpret line break or wrapped text for table in pdf as its own row? not extra rows?

End result should be looking like this using tabula:

enter image description here


Solution

  • You need to add a parameter. Replace

    file1 = "path_to_pdf_file"
    table = tabula.read_pdf(file1,pages=1)
    table[0]
    

    with

    file1 = "path_to_pdf_file"
    table = tabula.read_pdf(file1,pages=1, lattice = True)
    table[0]
    

    All this according to the documention here

    Here is an example:

    Se the article "https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf"

    import tabula
    import io
    import pandas as pd
    
    file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
    table = tabula.read_pdf(file1,pages=3,lattice=True, )
    
    df = table[0]
    df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
    df
    

    returns:

         Unnamed: 0  \
    0                                    NaN   
    1                        Spectrum effect   
    2                           Context bias   
    3                         Selection bias   
    4                                    NaN   
    5            Variation in test execution   
    6           Variation in test technology   
    7                      Treatment paradox   
    8               Disease progression bias   
    9                                    NaN   
    10     Inappropriate reference\rstandard   
    11        Differential verification bias   
    12             Partial verification bias   
    13                                   NaN   
    14                           Review bias   
    15                  Clinical review bias   
    16                    Incorporation bias   
    17                  Observer variability   
    18                                   NaN   
    19    Handling of indeterminate\rresults   
    20  Arbitrary choice of threshold\rvalue   
    
                                Source of Systematic Bias  
    0                                          Population  
    1   Tests may perform differently in various sampl...  
    2   Prevalence of the target condition varies acco...  
    3   The selection process determines the compositi...  
    4                Test Protocol: Materials and Methods  
    5   A sufficient description of the execution of i...  
    6   When the characteristics of a medical test cha...  
    7   Occurs when treatment is started on the basis ...  
    8   Occurs when the index test is performed an unu...  
    9       Reference Standard and Verification Procedure  
    10  Errors of imperfect reference standard bias th...  
    11  Part of the index test results is verified by ...  
    12  Only a selected sample of patients who underwe...  
    13                                     Interpretation  
    14  Interpretation of the index test or reference ...  
    15  Availability of clinical data such as age, sex...  
    16  The result of the index test is used to establ...  
    17  The reproducibility of test results is one det...  
    18                                           Analysis  
    19  A medical test can produce an uninterpretable ...  
    20  The selection of the threshold value for the i...  
    

    The three dots in the column Source of Systematic Bias show that everything that was in that cell, with line breaks i considered as a single cell (item), not multiple cells. Another proof of that is

    df.iloc[2,1]
    

    returns the cell content:

    'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'
    

    There must be something with your pdf. If it's available online, share the link and I'll take a look.