I have a specific requirement of writing selected attributes into an excel-sheet from the given PDF. The PDF is around 400 pages long hence I thought of automating this process. The screenshot of the PDF data schema of a sample page is shared below:
I am able to scrape the data on a given PDF page however I have to write down the specific value of the attributes of the data into excel-sheet under particular columns. The desired columns are shown in below table:
The sample screenshot of the text file I get post data scraping using "UiPath Document OCR" is shared below, yellow fields are repeated per company details in the document and some fields may be missing for a given company (OEM):
I need to write the value of the attributes per company in Excel file in the tabular schema shown above. Assuming I need the data only where the company name matches with the set of company name list, what is the best way to accomplish the same? The above image highlights the attributes desired with yellow color whose value I need to record in excel-sheet.
Note: The above data scraping was done with "UiPath Document OCR", with OCR engine as "Tesseract OCR" I get tidier look with appropriate line breaks where required as shown below.
In UiPath, you can use the Find Regex Matches activity, or you can write your own Regex statement to find the matches according to how each attribute will be structured in the mess of text. From the array of matches, you can simply match them up in sequential order in your Excel file.
The Regex Builder in UiPath is helpful in creating and testing your Regex (link here). If you haven't used it yet, UiPath Academy has a nice course with helpful instructions for using their Regex builder. Or, I like to build my own Regex statements and test them on RegexR.