pythonapache-tikatika-server

How to set TIKA_SERVER_ENDPOINT from tika-python lib


The excellent lib tika-python in its documentation at https://github.com/chrismattmann/tika-python shows that it is possible to set the tika_server.jar file to avoid downloading with each use of the algorithm. Has anyone done this and can post the configuration?

The first time the algorithm is used, tika_server.jar is downloaded so that lib can use it. I want to avoid this download by setting the file locally.

Extract text from PDF

def extraiPDF(f):
    resultado = []
    tika.TikaClientOnly = True
    raw = parser.from_file(f)
    metadados = raw["metadata"]
    conteudo  = raw["content"] 
    conteudo  = (conteudo).replace('\n', '').replace('\r\n', '').replace('\r', '').replace('\\', '').replace('\t', ' ')
    resultado.append(conteudo)
    resultado.append(metadados)
    return resultado

Solution

  • To run the tika server after downloaded it execute this bash script.

    #!/bin/bash
    
    TIKA_PORT=9998
    TIKA_HOST=localhost
    CURRENT_USER=$(whoami) 
    TIKA_JAR_URL="http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar"
    TIKA_WORKSPACE=$HOME/tika
    TIKA_FILE_NAME="tika_server.jar"
    
    echo -e "Current user: $CURRENT_USER"
    
    if [ ! -f $TIKA_WORKSPACE/$TIKA_FILE_NAME ]; then
        echo -e "Downloading tika-server.jar"
    
        if [ ! -d "$TIKA_WORKSPACE" ]; then
            echo -e "making tika workspace"
            mkdir $TIKA_WORKSPACE
        fi
    
        wget -c $TIKA_JAR_URL -O $TIKA_WORKSPACE/$TIKA_FILE_NAME 
    fi
    
    echo -e "## Setting environment vars"
    
    export TIKA_SERVER_ENDPOINT="http://$TIKA_HOST:$TIKA_PORT"
    echo -e "TIKA_SERVER_ENDPOINT to $TIKA_SERVER_ENDPOINT"
    
    export TIKA_CLIENT_ONLY=True
    echo -e "TIKA_CLIENT_ONLY to $TIKA_CLIENT_ONLY"
    
    echo -e "## Starting tika server on: $TIKA_WORKSPACE"
    cd $TIKA_WORKSPACE
    
    java -jar tika_server.jar -h $TIKA_HOST