computer-visiontesseractomr

Is there a way to "prime" tesseract or other OCR engines for certain words


Is there a way to prime Tesseract-OCR or perhaps another engine to have increased sensitivity to certain words/shapes? Priming is a way that humans can increase their sensitivity towards certain stimuli, I wasn't sure if OCR does the same thing. I know apps like facebook/instagram can increase sensitivity towards certain posts or behaviors towards certain accounts if the account has exhibited that behavior in the past


Solution

  • The user-words file is a bit finicky to get working.

    Here's a reduced version of the code I used to get it working

    #include <tesseract/genericvector.h>
    .
    .
    .
        const char* TESSDATA = "C:/Tesseract/tessdata/";
    
    void TryTess() {
        tesseract::TessBaseAPI* api = new tesseract::TessBaseAPI();
    
    
        GenericVector<STRING> pars_vec;
        pars_vec.push_back("load_system_dawg");
        pars_vec.push_back("load_freq_dawg");
        pars_vec.push_back("load_punc_dawg");
        pars_vec.push_back("load_number_dawg");
        pars_vec.push_back("load_unambig_dawg");
        pars_vec.push_back("load_bigram_dawg");
        //pars_vec.push_back("load_fixed_length_dawgs");
        pars_vec.push_back("language_model_penalty_non_dict_word");
        pars_vec.push_back("user_words_suffix");
        pars_vec.push_back("user_patterns_suffix");
    
    
        GenericVector<STRING> pars_values;
        pars_values.push_back("0");
        pars_values.push_back("0");
        pars_values.push_back("0");
        pars_values.push_back("0");
        pars_values.push_back("0");
        pars_values.push_back("0");
        //pars_values.push_back("F");
        pars_values.push_back("9999999999999999");
        pars_values.push_back("user-words");
        pars_values.push_back("user-patterns");
    
        api->Init(TESSDATA, "eng", OEM_DEFAULT, NULL, 0, &pars_vec, &pars_values, false);
    
        /// Some image preprocessing to improve detection
    
        char* out = api->GetUTF8Text();
        std::cout << "Result: " << out;
        api->End();
        delete[] out;
    }
    

    Make sure you have your TESSDATA path configured. The best few resources I could find were Here as well as here.

    The major hangup was not knowing where that genericvector.h class was, as tesseract's Init method requires that class (there doesn't seem to be any conversion methods). Since the user-words file must be passed in prior to initialization, this is the only way I could find to do it. Even reading from a config file must be done after initialization, which prevents you from using user-words

    Good luck!