I am developing a simple JavaScript code to extract text
from image
. For that I am using image processing library like Tesseract.
But I found that, Tesseract
is not 100% accurate. ( or maybe I don't know how to use it correctly)
For example, after converting image text to array of strings and scanning every string one by one I am getting following strings which are not same.
Age + 67 Gender : Female Age : 45 Gender : Female Age + 45 Gender : Male
Age : 44 Gender : Male Age 36 Gender : Female Age : 56 Gender : Male
Age +63 Gender : Male Age : 62 Gender : Female Age : 37 Gender : Male
I split the string on the basis of +
and space
like this
const ageAndGenderArray = line.split(" ") || line.split("+");
and I got following output.
['Age', '+', '67', 'Gender', ':', 'Female', 'Age', ':', '45', 'Gender', ':', 'Female', 'Age', `'+'`, '45', 'Gender', ':', 'Male']
['Age', ':', '44', 'Gender', ':', 'Male', 'Age', '36', 'Gender', ':', 'Female', 'Age', ':', '56', 'Gender', `':'`, 'Male']
['Age', '+63', 'Gender', ':', 'Male', 'Age', ':', '62', 'Gender', ':', 'Female', 'Age', ':', '37', 'Gender', ':', 'Male']
If you observe, all the input strings are not exactly same. Some are having
Age + 67
and some are having Age +63
. Somewhere there is +
and somewhere there is :
. So I am not able to extract a text out of it.
I am expecting output as like this :
63 Male
62 Female
37 Male
So how to parse such diverse string ?
My code :
const processImage = () => {
Tesseract.recognize(file, "eng", { logger: (m) => console.log(m) }).then(
({ data: { text } }) => {
console.log(text);
const parsedCandidates = parseOCRResult(text);
setCandidates(parsedCandidates);
}
);
console.log(file);
};
const parseOCRResult = (text) => {
// parsing logic of strings
}
use Regex Age[^\d]+
to split. it will search for 'Age' followed by any chars but stop when find the digit
let input = `Age + 67 Gender : Female Age : 45 Gender : Female Age + 45 Gender : Male
Age : 44 Gender : Male Age 36 Gender : Female Age : 56 Gender : Male
Age +63 Gender : Male Age : 62 Gender : Female Age : 37 Gender : Male`
let results = input.split(/Age[^\d]+/)
// clean the results
results = results.map(item => item.replace(' Gender :', '').trim()).filter(i => i.length)
console.log(results)