dockerbuilddockerfileopensearchlocked

OpenSearch not exiting gracefully in custom docker image and ends up locked


I am trying to build a custom docker image for OpenSearch based on the official image from the opensearch procect (version 2.16.0).

My goal is to have an image for development and QA purposes that, when deployed, supplies me with a single-node OpenSearch cluster with a number of test indices and no authentication. Data should be baked into the image.

To that end, I take the official image, and during a docker build process, start the OpenSearch cluster and push the data for my indices into OpenSearch using elasticdump. I have done the same successfully for Elasticsearch in the past.

Here is a simplified version of my Dockerfile:

FROM opensearchproject/opensearch:2.16.0

USER root

ENV NODE_TLS_REJECT_UNAUTHORIZED=0                                              
                                                                                
COPY test_mapping.json /etc/test_mapping.json
COPY test_data.json /etc/test_data.json

# procps needed for opensearch --daemonize                                      
RUN yum -y install nodejs npm procps                                            
RUN npm install elasticdump -g

# start as single node cluster with security disabled so that we need not setup certificates
RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml

# opensearch cannot be run as root
USER opensearch

RUN opensearch --daemonize -p /tmp/opensearch-pid \
    # wait for opensearch cluster to start up
    && sleep 30s \
    && elasticdump \
        --input=/etc/test_mapping.json \
        --output=http://localhost:9200/test_index \
        --type=mapping \
    && elasticdump \
        --input=/etc/test_data.json \
        --output=http://localhost:9200/test_index \
        --type=data \
        --limit=250 \
        --concurrencyInterval=2000 \
    # terminate opensearch gracefully
    && pkill -f /tmp/opensearch-pid

The build runs smoothly and without errors and creates a docker image. The problem occurs when I try to start a container from that image:

docker run --rm --name testopensearch -p 9200:9200 -p 9600:9600 -d mycustomopensearch:0.0.1

The endpoint doesn't come up and docker logs testopensearch shows that opensearch was started but then failed with an org.apache.lucene.store.AlreadyClosedException:

...
[2024-09-18T13:25:14,668][INFO ][o.o.e.NodeEnvironment    ] [215c1baf6b02] using [1] data paths, mounts [[/ (overlay)]], net usable_space [364.7gb], net total_space [914.6gb], types [overlay]
[2024-09-18T13:25:14,668][INFO ][o.o.e.NodeEnvironment    ] [215c1baf6b02] heap size [1gb], compressed ordinary object pointers [true]
[2024-09-18T13:25:14,670][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [215c1baf6b02] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.16.0.jar:2.16.0]
    at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.16.0.jar:2.16.0]
Caused by: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1149) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:900) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.env.NodeEnvironment.assertCanWrite(NodeEnvironment.java:1373) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:376) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.node.Node.<init>(Node.java:550) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.node.Node.<init>(Node.java:432) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-2.16.0.jar:2.16.0]
    at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-2.16.0.jar:2.16.0]
uncaught exception in thread [main]
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-2.16.0.jar:2.16.0]
    ... 6 more
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179)
    at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1149)
    at org.opensearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:900)
    at org.opensearch.env.NodeEnvironment.assertCanWrite(NodeEnvironment.java:1373)
    at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:376)
    at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301)
    at org.opensearch.node.Node.<init>(Node.java:550)
    at org.opensearch.node.Node.<init>(Node.java:432)
    at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
    at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
    at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181)
    at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172)
    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
    at org.opensearch.cli.Command.main(Command.java:101)
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138)
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104)
For complete error details, refer to the log at /usr/share/opensearch/logs/docker-cluster.log

My theory is that the opensearch --daemonize call during the build somehow does not terminate correctly and leaves something behind that causes this error. I've already tried adding additional && sleep 30s after the pkill to give opensearch more time to shutdown, but to no effect.

I tried to take a look at the logs, but contrary to what the error message says, there is no file /usr/share/opensearch/logs/docker-cluster.log. I did not find the location of the logs of an opensearch -d run.

I also tried to start the image and interactively run opensearch from there:

docker run --rm --name testopensearch -p 9200:9200 -p 9600:9600 -it --entrypoint=/bin/bash mycustomopensearch:0.0.1

When I run opensearch -d from within the container, I get the same error. Interestingly, when I then terminate it using pkill and repeat the same commands again, I get the same error the second time, but the third time, opensearch starts correctly and the cluster comes up with my indices in place and everything. This leaves me puzzled.

I also checked whether elasticdump may be somehow involved, but this part works fine and is not the root of the problem. If we replace the elasticdump commands with a simple curl to the opensearch root, the build also works fine and the problem afterwards is the same (so the following would be an even more minimal example in case someone would like to reproduce the build).

FROM opensearchproject/opensearch:2.16.0

USER root

ENV NODE_TLS_REJECT_UNAUTHORIZED=0                                                                                                                  

# procps needed for opensearch --daemonize                                      
RUN yum -y install procps                                            

# start as single node cluster with security disabled so that we need not setup certificates
RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml

# opensearch cannot be run as root
USER opensearch

RUN opensearch --daemonize -p /tmp/opensearch-pid \
    # wait for opensearch cluster to start up
    && sleep 30s \
    && curl http://localhost:9200 \                                    
    # terminate opensearch gracefully                                           
    && pkill -f /tmp/opensearch-pid                        

I also checked whether the way opensearch is started when deploying the image may play a role, but if I introduce an additional

RUN opensearch

after the first RUN, this triggers the org.apache.lucene.store.AlreadyClosedException during the build as well.

Can anyone point me to how I could terminate opensearch -d during the build gracefully in a way that avoids this issue? Or is there a lock file I can delete manually or anything? Any help would be greatly appreciated.


Solution

  • Ok, I finally found a way. I manually deleted 2 lock files that seem to have been left during the build and that apparently caused the issues.

    RUN rm /usr/share/opensearch/data/nodes/0/node.lock \
        && rm /usr/share/opensearch/data/nodes/0/_state/write.lock
    

    Seems like a harsh solution to me, but I could not find a way to have opensearch -d shut down gracefully. Hope there will be no side effects.

    I also needed to switch the users because of access rights. Full working example Dockerfile:

    FROM opensearchproject/opensearch:2.16.0
    
    USER root
    
    ENV NODE_TLS_REJECT_UNAUTHORIZED=0                                                                                                                  
    
    # procps needed for opensearch --daemonize                                      
    RUN yum -y install procps                                            
    
    # start as single node cluster with security disabled so that we need not setup certificates
    RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml
    
    # opensearch cannot be run as root
    USER opensearch
    
    RUN opensearch --daemonize -p /tmp/opensearch-pid \
        # wait for opensearch cluster to start up
        && sleep 30s \
        && curl http://localhost:9200 \                                    
        # terminate opensearch gracefully                                           
        && pkill -f /tmp/opensearch-pid        
    
    USER root
    # dirty workaround to resolve problem where opensearch won't start because it did not shutdown cleanly before
    RUN rm /usr/share/opensearch/data/nodes/0/node.lock \
        && rm /usr/share/opensearch/data/nodes/0/_state/write.lock
    
    USER opensearch
    

    Still wondering if there is something else I might have done though, so if anyone knows, feel free to comment or answer...