pythonnandata-cleaningtabulatabula-py

How do I remove 'Nan' values while reading a PDF using tabula in python?


I am using tabula-py to read my class timetable PDF file in python and the return value 'data' has a lot of 'nan' values that I cannot seem to clean. Can someone suggest a solution? Should I be using something instead of tabula-py? I've attached a link to the picture of the PDF. I have redacted some info from the PDF for privacy.1

My code is as follows:

import tabula


class ClassTimetable:
def __init__(self, filename):
    self.filename = filename

def read_data(self):
    data = tabula.read_pdf(self.filename, pages='all')
    # data1 = tabula.convert_into(self.filename, output_format="csv", output_path='file.csv')
    print(data)

My output is as follows:

[                                     Course Course Regn.  ... Unnamed: 2     Room
0                                Code Title Credit  Type  ...   GCR Code      No.
1                                     Critical and   NaN  ...        NaN      NaN
2                             1 18PDM202L Creative     0  ...         A-  wubaing
3                                  Thinking Skills   NaN  ...   ISOLATED      NaN
4                                       Management   NaN  ...        NaN      NaN
5                       2 18PDH102T Principles for     2  ...         A-      NaN
6                                        Engineers   NaN  ...   COMBINED      NaN
7   Professional Lab3 18EEC206J Analog Electronics     4  ...          B   boc5om
8                                      Generation,   NaN  ...        NaN      NaN
9                     4 18EEC208T Transmission & 3   NaN  ...        NaN      NaN
10                                    Distribution   NaN  ...          C  4qjaetp
11                                       Numerical   NaN  ...        NaN      NaN
12               5 18MAB202T Methods for Engineers     4  ...          D  vvbxlqp
13              6 18EEC205J Electrical Machines II     4  ...          E  drcfega
14                             7 18BTB101T Biology     2  ...          F      NaN
15                                  Electrical and   NaN  ...        NaN      NaN
16                                     Electronics   NaN  ...        NaN      NaN
17                    8 18EEC207J Measurements and     4  ...          G   koed72
18                                 Instrumentation   NaN  ...        NaN      NaN
19              9 18EEC205J Electrical Machines II     4  ...     P7-P8-  drcfega
20                                             NaN   NaN  ...        NaN      NaN
21                 10 18EEC206J Analog Electronics     4  ...     P3-P4-   boc5om
22                                  Electrical and   NaN  ...        NaN      NaN
23                                     Electronics   NaN  ...        NaN      NaN
24                       11 18EEC207J Measurements     4  ...        NaN      NaN
25                                             and   NaN  ...   P19-P20-      NaN
26                                 Instrumentation   NaN  ...        NaN      NaN
27                                        Total 23   NaN  ...        NaN      NaN

[28 rows x 8 columns]]

ALSO, WHAT DOES '. . .' MEAN?


Solution

  • I figured it out. I realised, the problem was that the library was not reading the separations between the lines properly, so I set 'lattice=True'. This solved my problem about 50% and realised the program requires greater specificity.
    Downloaded Tabula for windows and found the coordinates of the entire table and also the separate columns. Fed that data into tabula-py under build options of 'area=' and 'columns=' . I realise using both attributes is probably overkill, but upon formatting into .csv, all my data is neatly placed in separate columns with no 'Nan' values. Attaching my code below:

        import tabula
        
        class ClassTimetable:
        def __init__(self, filename):
            self.filename = filename
    
        def read_data(self):
            data = tabula.read_pdf(self.filename, pages='all', area=[162.498,141.6,546.248,538.736],
                                   columns=[140.55, 172.53, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
                                            491.585, 542.157], lattice=True)
    
            data1 = tabula.convert_into(self.filename, output_format="csv",  area=[162.498,141.6,546.248,538.736],
                                        columns=[140.559, 172.538, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
                                                 491.585, 542.157], lattice=True, output_path='file2.csv')
            return data
    

    Output, as follows:

    [    Unnamed: 0 Course\rTitle  ...                               Slot      GCR Code
    0          1.0     18PDM202L  ...  Mr. R. Prathap\rChandran (102275)  A-\rISOLATED
    1          2.0     18PDH102T  ...     Mr. Nizamudeen\rAnvar (102293)  A-\rCOMBINED
    2          3.0     18EEC206J  ...  Dr.T.M.Thamizh\rThentral (101436)             B
    3          4.0     18EEC208T  ...          Dr.S.Vidyasagar\r(100597)             C
    4          5.0     18MAB202T  ...            Dr. M. Suresh\r(101984)             D
    5          6.0     18EEC205J  ...     Dr. K. M, Ravi\rEswar (102699)             E
    6          7.0     18BTB101T  ...               Mr.T.Anand\r(100034)             F
    7          8.0     18EEC207J  ...        Mr.S.Raghavendran\r(102704)             G
    8          9.0     18EEC205J  ...     Dr. K. M, Ravi\rEswar (102699)        P7-P8-
    9         10.0     18EEC206J  ...  Dr.T.M.Thamizh\rThentral (101436)        P3-P4-
    10        11.0     18EEC207J  ...        Mr.S.Raghavendran\r(102704)      P19-P20-
    11         NaN            23  ...                                NaN           NaN
    

    Still don't know what '. . .' means