javapythonpython-2.7jvmboilerpipe

JVM crashes while implementing Python-Boilerpipe in Flask app


Im writing a flask app using boilerpipe to extract content.Initially i wrote the boilerpipe extract as script to extract website content but when i try to integrate with my api JVM crashes when executing boilerpipe extractor . This is the error i get https://github.com/misja/python-boilerpipe/issues/17 i have also raised a issue in github

from boilerpipe.extract import Extractor
import unicodedata

class ExtractingContent:

  @classmethod
  def processingContent(self,sourceUrl,extractorType="DefaultExtractor"):
    extractor = Extractor(extractor=extractorType, url=sourceUrl)
    extractedText = extractor.getText()
    if extractedText:
      toNormalString =  unicodedata.normalize('NFKD',extractedText).encode('ascii','ignore')
     json_data = json.loads({"content": toNormalString, "url": sourceUrl , "status": "success", "publisher_id": "XXXXX", "content_count": str(len(toNormalString)) })
  return json_data
   else:    
     json_data = json.dumps({"response": {"message": "No data found", "url": sourceUrl , "status": "success", "content_count": "empty" }})
     return json.loads(json_data)

This is the above script im trying to integrate in Flask api which use flask-restful,sqlachemy,psql . I also updated my java but that didn't fix the issue.Java version

java version "1.7.0_45" 
javac 1.7.0_45

Any help would be appreciated

Thanks


Solution

  • (copy of what I wrote in https://github.com/misja/python-boilerpipe/issues/17)

    OK, I've reproduced the bug : the thread that calls the JVM is not attached to it, therefore the calls to JVM internals fail. The bug comes from boilerpipe (see below).

    First, monkey patching : in the code you posted on stackoverflow, you just have to add the following code before the creation of the extractor :

    class ExtractingContent:
       @classmethod
       def processingContent(self,sourceUrl,extractorType="DefaultExtractor"):
           print "State=", jpype.isThreadAttachedToJVM()
    
           if not jpype.isThreadAttachedToJVM():
               print "Needs to attach..."
               jpype.attachThreadToJVM()
               print "Check Attached=", jpype.isThreadAttachedToJVM()
    
           extractor = Extractor(extractor=extractorType, url=sourceUrl)
    

    About boilerpipe: the check if threading.activeCount() > 1 in boilerpipe/extractor/__init__.py, line 50, is wrong. The calling thread must always be attached to the JVM, even if there is only one.