pythonrecursionindexingbeautifulsoupwhoosh

Maximum recursion depth exceeded when building a whoosh index


I am trying to index some documents using Whoosh. However, when I try to add the documents to the Whoosh index, Python eventually gives back the following error:

RecursionError: maximum recursion depth exceeded while calling a Python object

I have tried playing with the limitmb setting of the index writer, as well as changing how often the index is committed to the hard drive. This seemed to change the amount of documents that were indexed succesfully, however the indexing stops with the RecursionError after a short while.

My code is the following:

from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
from bs4 import BeautifulSoup
import os

schema = Schema(title=TEXT(stored=True), docID=ID(stored=True), content=TEXT(stored=True))
ix = create_in("index", schema)

writer = ix.writer(limitmb=1024, procs=4, multisegment=True);

for root, dirs, files in os.walk('aquaint'):
    for file in files:
        with open(os.path.join(root, file), "r") as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
            for doc in soup.find_all('doc'):
                try:
                    t = doc.find('headline').string
                except:
                    t = "No title available"
                try:
                    d = doc.find('docno').string
                except:
                    d = "No docID available"
                try:
                    c = doc.find('text').string
                except:
                    c = "No content available"

                writer.add_document(title=t, docID=d, content=c)
        writer.commit()

The files I am loading in are from the TRAC robust track (https://trec.nist.gov/data/t14_robust.html) and have the following format (due to licensing I can't share the entire file):

<DOC>
<DOCNO> APW1XXXXXXXXX </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1998-01-06 00:17:00 </DATE_TIME>
<HEADER>
XXXX
</HEADER>
<BODY>
<SLUG> BC-Sports-Motorcycling-Grand Prix-Doohan </SLUG>
<HEADLINE>
Doohan calls for upgrade to 1000cc bikes 
</HEADLINE>
<TEXT>
       News article text here
</TEXT>
(PROFILE
(WS SL:BC-Sports-Motorcycling-Grand Prix-Doohan; CT:s; 
(REG:EURO;)
(REG:BRIT;)
(REG:SCAN;)
(REG:MEST;)
(REG:AFRI;)
(REG:INDI;)
(REG:ENGL;)
(REG:ASIA;)
(LANG:ENGLISH;))
)
</BODY>
<TRAILER>
AP-NY-06-01-98 0017EDT
</TRAILER>
</DOC>

Each file loaded in includes several of these documents, beginning and ending with the <DOC> tags.

I don't understand what is causing this error, could someone help me out? Your help is greatly appreciated!


Solution

  • I found what the problem was, I wrongly assumed that BeautifulSoup would return a string when calling doc.find('headline').string, replacing this with str(doc.find('headline').string) seems to have fixed the issue for me, and Whoosh is now indexing correctly.