pythontitanrexsterbulbs

Why does Rexster Server (and Titan) stop responding?


Setup

I'm implementing a recommender system running on a Ubuntu 12.4 Server using Titan Rexster (titan-server-0.4.4.zip) with the Elasticsearch backend. In order to connect to the Rexster Server I use the Bulbflow library for python.

Beta seemed to run fine for 3 weeks, but with the load "increasing" (only a couple of users ~10) the Rexster server stopped responding. I don't know whether my rexster configuration is wrong or I don't use the Bulbflow library correctly.

Rexster / Titan Configuration

Here is my rexster-cassandra-es.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <rexster>
        <http>
            <server-port>8182</server-port>
            <server-host>0.0.0.0</server-host>
            <base-uri>http://MY_IP</base-uri>
            <web-root>public</web-root>
            <character-set>UTF-8</character-set>
            <enable-jmx>false</enable-jmx>
            <enable-doghouse>true</enable-doghouse>
            <max-post-size>2097152</max-post-size>
            <max-header-size>8192</max-header-size>
            <upload-timeout-millis>30000</upload-timeout-millis>
            <thread-pool>
                <worker>
                    <core-size>20</core-size>
                    <max-size>40</max-size>
                </worker>
                <kernal>
                    <core-size>10</core-size>
                    <max-size>20</max-size>
                </kernal>
            </thread-pool>
            <io-strategy>leader-follower</io-strategy>
        </http>
        <rexpro>
            <server-port>8184</server-port>
            <server-host>0.0.0.0</server-host>
            <session-max-idle>1790000</session-max-idle>
            <session-check-interval>3000000</session-check-interval>
            <connection-max-idle>180000</connection-max-idle>
            <connection-check-interval>3000000</connection-check-interval>
            <enable-jmx>false</enable-jmx>
            <thread-pool>
                <worker>
                    <core-size>8</core-size>
                    <max-size>8</max-size>
                </worker>
                <kernal>
                    <core-size>4</core-size>
                    <max-size>4</max-size>
                </kernal>
            </thread-pool>
            <io-strategy>leader-follower</io-strategy>
        </rexpro>
        <shutdown-port>8183</shutdown-port>
        <shutdown-host>127.0.0.1</shutdown-host>
        <script-engines>
            <script-engine>
                <name>gremlin-groovy</name>
                <reset-threshold>-1</reset-threshold>
                <imports>com.tinkerpop.gremlin.*,com.tinkerpop.gremlin.java.*,com.tinkerpop.gremlin.pipes.filter.*,com.tinkerpop.gremlin.pipes.sideeffect.*,com.tinkerpop.gremlin.pipes.transform.*,com.tinkerpop.blueprints.*,com.tinkerpop.blueprints.impls.*,com.tinkerpop.blueprints.impls.tg.*,com.tinkerpop.blueprints.impls.neo4j.*,com.tinkerpop.blueprints.impls.neo4j.batch.*,com.tinkerpop.blueprints.impls.orient.*,com.tinkerpop.blueprints.impls.orient.batch.*,com.tinkerpop.blueprints.impls.dex.*,com.tinkerpop.blueprints.impls.rexster.*,com.tinkerpop.blueprints.impls.sail.*,com.tinkerpop.blueprints.impls.sail.impls.*,com.tinkerpop.blueprints.util.*,com.tinkerpop.blueprints.util.io.*,com.tinkerpop.blueprints.util.io.gml.*,com.tinkerpop.blueprints.util.io.graphml.*,com.tinkerpop.blueprints.util.io.graphson.*,com.tinkerpop.blueprints.util.wrappers.*,com.tinkerpop.blueprints.util.wrappers.batch.*,com.tinkerpop.blueprints.util.wrappers.batch.cache.*,com.tinkerpop.blueprints.util.wrappers.event.*,com.tinkerpop.blueprints.util.wrappers.event.listener.*,com.tinkerpop.blueprints.util.wrappers.id.*,com.tinkerpop.blueprints.util.wrappers.partition.*,com.tinkerpop.blueprints.util.wrappers.readonly.*,com.tinkerpop.blueprints.oupls.sail.*,com.tinkerpop.blueprints.oupls.sail.pg.*,com.tinkerpop.blueprints.oupls.jung.*,com.tinkerpop.pipes.*,com.tinkerpop.pipes.branch.*,com.tinkerpop.pipes.filter.*,com.tinkerpop.pipes.sideeffect.*,com.tinkerpop.pipes.transform.*,com.tinkerpop.pipes.util.*,com.tinkerpop.pipes.util.iterators.*,com.tinkerpop.pipes.util.structures.*,org.apache.commons.configuration.*,com.thinkaurelius.titan.core.*,com.thinkaurelius.titan.core.attribute.*,com.thinkaurelius.titan.core.util.*,com.thinkaurelius.titan.example.*,org.apache.commons.configuration.*,com.tinkerpop.gremlin.Tokens.T,com.tinkerpop.gremlin.groovy.*</imports>
            <static-imports>com.tinkerpop.blueprints.Direction.*,com.tinkerpop.blueprints.TransactionalGraph$Conclusion.*,com.tinkerpop.blueprints.Compare.*,com.thinkaurelius.titan.core.attribute.Geo.*,com.thinkaurelius.titan.core.attribute.Text.*,com.thinkaurelius.titan.core.TypeMaker$UniquenessConsistency.*,com.tinkerpop.blueprints.Query$Compare.*</static-imports>
            </script-engine>
        </script-engines>
        <security>
            <authentication>
                <type>none</type>
                <configuration>
                    <users>
                        <user>
                            <username>rexster</username>
                            <password>rexster</password>
                        </user>
                    </users>
                </configuration>
            </authentication>
        </security>
        <metrics>
            <reporter>
                <type>jmx</type>
            </reporter>
            <reporter>
                <type>http</type>
            </reporter>
            <reporter>
                <type>console</type>
                <properties>
                    <rates-time-unit>SECONDS</rates-time-unit>
                    <duration-time-unit>SECONDS</duration-time-unit>
                    <report-period>10</report-period>
                    <report-time-unit>MINUTES</report-time-unit>
                    <includes>http.rest.*</includes>
                    <excludes>http.rest.*.delete</excludes>
                </properties>
            </reporter>
        </metrics>
        <graphs>
            <graph>
                <graph-name>newspaper</graph-name>
                <graph-type>com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration</graph-type>
                <!-- <graph-location>/tmp/titan</graph-location> -->
                <graph-read-only>false</graph-read-only>
                <properties>
                    <storage.backend>cassandra</storage.backend>
                    <storage.index.search.backend>elasticsearch</storage.index.search.backend>
                    <storage.index.search.hostname>localhost</storage.index.search.hostname>
                    <storage.index.search.client-only>true</storage.index.search.client-only>
                    <storage.index.search.local-mode>false</storage.index.search.local-mode>
                </properties>
                <extensions>
                  <allows>
                    <allow>tp:gremlin</allow>
                  </allows>
                </extensions>
            </graph>
        </graphs>
    </rexster>

I have changed the core-size and max-size of the threadpool for the worker and kernal, without that change the Rexster Server would hang / not respond even quicker.

What are appropriate values for the core-size and max-size?

Bulbflow Usage

For the use of bulbflow I create a new Graph object every time I need to perform a request. There are a lot of requests, so those objects are created very often.

Should I really create a new Graph object for every new request?

Is it possible to only create one Graph object and use it whenever a new request is sent to the graph database or do I run into session issues?

Error message

When everything is stuck and I forcefully terminate the program (ctrl-c), I get the following stacktrace:

Exception happened during processing of request from ('my_ip', 57489)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 310, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
    self.handle()
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/werkzeug/serving.py", line 200, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/werkzeug/serving.py", line 235, in handle_one_request
    return self.run_wsgi()
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/werkzeug/serving.py", line 177, in run_wsgi
    execute(self.server.app)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/werkzeug/serving.py", line 165, in execute
    application_iter = app(environ, start_response)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/flask/app.py", line 1836, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/user/dir/recommender/project/api/start.py", line 65, in put_user
    graphdb.insert_user(user_id)
  File "project/api/graphdb.py", line 14, in insert_user
    user_with_id = g.users.index.lookup(user_sqlid=user_id)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/bulbs/titan/index.py", line 270, in lookup
    resp = self.client.lookup_vertex(self.index_name,key,value)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/bulbs/titan/client.py", line 348, in lookup_vertex
    return self.request.get(path,params)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/bulbs/rest.py", line 101, in get
    return self.request(GET, path, params)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/bulbs/rest.py", line 184, in request
    http_resp = self.http.request(uri, method, body, headers)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1593, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1335, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/home/user/dir/env/venv_python/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1291, in _conn_request
    response = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 430, in readline
    data = recv(1)

Recovery

In order to recover, I have to shut down rexster / titan and restart it. Whenever I stop the Rexster server (./bin/titan -c cassandra-es stop) I receive the following output:

Killing Titan + Rexster (pid 26779)...
Rexster shutdown timeout exceeded (60 seconds)
Killing Cassandra (pid 26201)...

Rexster is completely stuck.

Looking forward to receiving some helpful guidance.


Solution

  • The following thread on the Titan mailing list could be useful to you: Rexster REST API stops responding. However, I don't think they ever managed to solve that issue for the Titan and Rexster developers couldn't reproduce it.

    This being said, I strongly suggest upgrading to Titan v1.0.0 which uses TinkerPop 3.0+ Gremlin server instead of TinkerPop 2.x Rexster. You'll get less bugs, more features and, especially, much more expressive Gremlin queries (see TinkerPop 3.0.1 documentation used by Titan v1.0.0). Titan v0.4.4 is a very old release and I don't think it's worth the trouble fixing this specific issue, especially if you're new to graphs.