I'm doing a work-related project in which I should study whether we could extract certain fields of information (e.g. contract parties, start and end dates) from contracts automatically. I am quite new to working with text data and am wondering if those pieces of information could be extracted with ML by having the whole contract as input and the information as output without tagging or annotating the whole text?
I understand that the extraction should be ran separately for each targeted field.
Thanks!
First question - how are the contracts stored? Are they PDFs or text-based?
If they're PDFs, there are a handful of packages that can extract text from a PDF (e.g. pdftotext).
Second question - is the data you're looking for in the same place in every document?
If so, you can extract the information you're looking for (like start and end dates) from a known location in the contract. If not, you'll have to do something more sophisticated. For example you may need to do a text search for "start date", if the same terminology is used in every contract. If different terminology is used from contract to contract, you may need to work to extract meaning from the text, which can be done using some sophisticated natural language processing (NLP).
Without more knowledge of your problem or a concrete example, it's hard to say what your best option may be.