pythoncelerywekajava-bridge-method

Fatal error using python-javabridge JVM in Celery thread with NLTK on Mac OS X


I am using the Python wrapper for Weka which is based on python-javabridge. I have a long task to perform and, therefore, I am using Celery to do so. The problem is I get

A fatal error has been detected by the Java Runtime Environment:

  SIGSEGV (0xb) at pc=0x00007fff91a3c16f, pid=11698, tid=3587

JRE version:  (8.0_31-b13) (build )
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C  [libdispatch.dylib+0x616f]  _dispatch_async_f_slow+0x18b

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please visit:
    http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

when starting the JVM inside the thread. These two lines of code are used in order to do so (from weka.core.jvm):

javabridge.start_vm(run_headless=True)
javabridge.attach()

From what I've read, it is probably caused by the fact that the JVM is not attached to the Celery thread. However, javabridge.attach() is indeed run inside it.

What am I missing ?


EDIT: I identified the code that is causing trouble. It has to do with an NLTK tokenizer. The following code (according to Vebjorn's answer) will reproduce the error:

# hello.py
from nltk.tokenize import RegexpTokenizer
import javabridge
from celery import Celery

app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')

started = False    

@app.task
def hello():
    global started
    if not started:
        print 'Starting the VM'
        javabridge.start_vm(run_headless=True)
        started = True

    sentence = "This is a sentence with some numbers like 1, 2 or and some weird symbols like @, $ or ! :)"
    tokenizer = RegexpTokenizer(r'\w+')
    tokenized_sentence = tokenizer.tokenize(sentence.lower())
    print "Tokens:", tokenized_sentence

    return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
                             dict(greetee='world'))

Without starting the JVM, the code runs properly. It also works when not running as a Celery task. I don't understand why it crashes.


EDIT 2: It actually works in a clean Ubuntu environment (Dockerized) but not on Mac OS X Yosemite (v10.3).


EDIT 3: As mentioned in the comments, it works if from nltk.tokenize import RegexpTokenizer is done inside the task wrapper, that is inside the hello() function.


Solution

  • By default, Celery starts four separate worker processes. (See the -c command line option to celery worker.) You need to ensure that you start the JVM in all of them. This example works for me:

    # hello.py
    import os
    import threading
    from celery import Celery
    import javabridge
    
    app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
    
    started = False
    
    @app.task
    def hello():
        global started
        if not started:
            print 'Starting the VM'
            javabridge.start_vm(run_headless=True)
            started = True
        return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
                                     dict(greetee='world'))
    

    and

    # client.py
    from hello import hello
    
    r = hello.delay()
    print r.get(timeout=1)
    
    1. Install on a virgin Ubuntu 14.04 machine:

      $ sudo apt-get update -y
      $ sudo apt-get install -y openjdk-7-jdk python-pip python-numpy python-dev rabbitmq-server
      $ sudo pip install celery javabridge
      $ sudo /etc/init.d/rabbitmq-server start
      
    2. Start worker:

      $ celery -A hello worker
      ...
       -------------- celery@a7cc1bedc40d v3.1.17 (Cipater)
      ---- **** ----- 
      --- * ***  * -- Linux-3.16.7-tinycore64-x86_64-with-Ubuntu-14.04-trusty
      -- * - **** --- 
      - ** ---------- [config]
      - ** ---------- .> app:         hello:0x7f5464766b50
      - ** ---------- .> transport:   amqp://guest:**@localhost:5672//
      - ** ---------- .> results:     amqp
      - *** --- * --- .> concurrency: 4 (prefork)
      -- ******* ---- 
      --- ***** ----- [queues]
       -------------- .> celery           exchange=celery(direct) key=celery
      
      
      [2015-04-21 10:04:31,262: WARNING/MainProcess] celery@a7cc1bedc40d ready.
      
    3. In another window, run a client five times:

       $ python client.py 
       Hello, world!
       $ python client.py 
       Hello, world!
       $ python client.py 
       Hello, world!
       $ python client.py 
       Hello, world!
       $ python client.py 
       Hello, world!
      
    4. Observe in the worker window that the JVM is started on the first four calls from the client (which go to four difference processes) but not in the fifth:

      [2015-04-21 10:05:53,491: WARNING/Worker-1] Starting the VM
      [2015-04-21 10:05:55,028: WARNING/Worker-2] Starting the VM
      [2015-04-21 10:05:56,411: WARNING/Worker-3] Starting the VM
      [2015-04-21 10:05:57,318: WARNING/Worker-4] Starting the VM