I want to extract certain text and numbers from a PDF invoice, one of which is the total amount. The thing is that the position of the total amount keeps on changing from pdf to pdf based on how many number of items are there. If there are lot of item then the total amount field will be lower in the pdf and if the number of items are less then the total amount will be higher up in the pdf. See below image for ref. There are only 2 items in the invoice so the total field is at a higher position. But I also have invoices where there are 15 items in the invoice and the total field is either lower in the page or is in the next page.
How do I extract it then? I tried using Anchor base
but it is not working!
This is the work I have done till now:
1.) Assign a for loop to open each and every pdf in the folder one by one.
2.) for each pdf, I have assigned a hot key which fits one full page to the window.
3.) Then I am using Anchor Base
(total in the image f=given below is the anchor and the amount is the value to be extracted).
4.) Using a message box to print the value
5.) close the pdf
Two potential solutions.
Use UiPath Document Understanding
You can get a certain amount of DU Data on the Community License, then you can setup the templates and use anchor bases, token selection, custom area selectors etc.
Read Lines Approach
Convert the PDF to Text. Have a look through the extracted Text and find a phrase/keyword that you could use as your anchor. Going by your example you might you "Total: "
Then use Invoke Code (I'll use C# for below example)
Arguments: in_text (the text from the PDF) | out_totalAmount
Code:
var invoiceTotal = File.ReadLines(in_text).Last(e => e.StartsWith("Total: ")).Trim();
out_totalAmount = invoiceTotal.Split(new []{":"}).LastOrDefault();