javascripttesseract

How to parse a string got from converting from image to text using Tesseract?


I am developing a simple JavaScript code to extract text from image. For that I am using image processing library like Tesseract.

But I found that, Tesseract is not 100% accurate. ( or maybe I don't know how to use it correctly) For example, after converting image text to array of strings and scanning every string one by one I am getting following strings which are not same.

Age + 67 Gender : Female Age : 45 Gender : Female Age + 45 Gender : Male

Age : 44 Gender : Male Age 36 Gender : Female Age : 56 Gender : Male

Age +63 Gender : Male Age : 62 Gender : Female Age : 37 Gender : Male

I split the string on the basis of + and space like this

const ageAndGenderArray = line.split(" ") || line.split("+");

and I got following output.

['Age', '+', '67', 'Gender', ':', 'Female', 'Age', ':', '45', 'Gender', ':', 'Female', 'Age', `'+'`, '45', 'Gender', ':', 'Male']
    
['Age', ':', '44', 'Gender', ':', 'Male', 'Age', '36', 'Gender', ':', 'Female', 'Age', ':', '56', 'Gender', `':'`, 'Male']
     
['Age', '+63', 'Gender', ':', 'Male', 'Age', ':', '62', 'Gender', ':', 'Female', 'Age', ':', '37', 'Gender', ':', 'Male']     

   

If you observe, all the input strings are not exactly same. Some are having
Age + 67 and some are having Age +63. Somewhere there is + and somewhere there is :. So I am not able to extract a text out of it.

I am expecting output as like this :

63     Male          
62     Female        
37     Male 
       

So how to parse such diverse string ?

My code :

const processImage = () => {
    Tesseract.recognize(file, "eng", { logger: (m) => console.log(m) }).then(
      ({ data: { text } }) => {
        console.log(text);
        const parsedCandidates = parseOCRResult(text);
        setCandidates(parsedCandidates);
      }
    );
    console.log(file);
  };     
       
const parseOCRResult = (text) => {    
      
// parsing logic of strings      
}

Solution

  • use Regex Age[^\d]+ to split. it will search for 'Age' followed by any chars but stop when find the digit

    let input = `Age + 67 Gender : Female Age : 45 Gender : Female Age + 45 Gender : Male
    Age : 44 Gender : Male Age 36 Gender : Female Age : 56 Gender : Male
    Age +63 Gender : Male Age : 62 Gender : Female Age : 37 Gender : Male`
    
    let results = input.split(/Age[^\d]+/)
    // clean the results
    results = results.map(item => item.replace(' Gender :', '').trim()).filter(i => i.length)
    console.log(results)