pythonmonkeypatchingpdf-extraction

Overwrite a property in a used (but not imported) Class


I am using fitz/pymupdf and pdf2docx packages of python to read tables from pdf files so that I can get data out of them and model it appropriately for storage in a data lake.

It seems like Converter class in pdf2docx has really solid configurations that allow it to get all tables and read them very cleanly. The one issue is that occasionally it will replace something that I guess it thinks is a nested table with the string . I found this issue on it and have successfully replicated the fix in my local machine by editing the Cell class in the Cell.py file in the table directory of pdf2docx. However, I would like to be able to use this in a cloud function in GCP without loading the source code into the cloud function as direct files. Is there any way to install pdf2docx via my requirements.txt file and then in my code itself redefine the method Cell.text?


Solution

  • Apparently one can monkey patch the class within a script/submodule of one's own. I just did this at the top of my file where I know the Cell.text method/property is going to be called eventually:

    def text(self):
        '''Text contained in this cell.'''
        if not self: return None
        # NOTE: sub-table may exists in
    
        text = []
    
        for block in self.blocks:
            if block.is_text_block:
                text.append(block.text)
            elif block.__class__.__name__ == "TableBlock":
                text.append(''.join(numpy.array(block.text).flatten()))
            else:
                text.append("<NESTED TABLE>")
        return '\n'.join(text)
    
    
    Cell.text = property(text)