node.jsfirebasegoogle-cloud-firestoregoogle-cloud-functionsgoogle-cloud-vision

Counting words after Cloud Vision OCR and saving the word count to Firestore produces a different number on each try with the same document


I OCR a pdf with google cloud vision and get its json output from Firebase Storage. Then, I retrieve the full text from the json, count the words and save the word count to Firestore. The weird thing that is happening is that on every try, I get a different word count even though the document is the same. Not only that but also it seems the word count is saved to the same document in Firestore multiple times until it (some function which I am not sure which) is done, perhaps, the function that is counting the words. Also, why Firestore saves data to the document more than once beats me. Does anyone have an idea of what is going on and how I can just get one word count saved only once? Here's the code which lives in a Firebase cloud function:

if (path.basename(object.name).startsWith('output') && path.basename(object.name).split('.').pop() === "json") {       
        // Get references
        const fileBucket = object.bucket; // The Storage bucket that contains the file.
        const filePath = object.name; // File path in the bucket.       

         // Download JSON     
        const bucket = admin.storage().bucket(fileBucket);   
        const downloadResponse = await bucket.file(filePath).download();       
        const bufferToJson = downloadResponse.toString();
        const jsObject = JSON.parse(bufferToJson);

        // Text
        const textArray = jsObject.responses.map(async (response) => {
          return response.fullTextAnnotation.text;
        });
        const readyArray = await Promise.all(textArray);
        const fullTextReady = readyArray.join();

        // Count words
        async function countWords(str) {
          return str.trim().split(/\s+/).length;
        }
        const words = await countWords(fullTextReady);       


        // Text confidence
        const textConfidenceArray = jsObject.responses.map(async (response) => {
          return response.fullTextAnnotation.pages.map((page) => {
            return page.confidence;
          })
        })
        const textConfidence = await Promise.all(textConfidenceArray);        
        const textConfidence2 = textConfidence.flat();
        const sum = textConfidence2.reduce((accumulator, currentValue) => {
          return accumulator + currentValue
        },0);
        const average = sum / textConfidence2.length;
        const textConfidence3 = Number(average).toFixed(2) * 100;      
        
        
        // Language and Language Confidence
        const pages = jsObject.responses.map((response) => {
          return response.fullTextAnnotation.pages.map((page) => {
            return page.property.detectedLanguages
          })
        });
        const pages2 = await Promise.all(pages);
        const detectedLanguages = pages2.flat(2);    

        const languageAndConfidenceArray = detectedLanguages.map((language) => {
               const langCode = language.languageCode;
               const confidence = Number((language.confidence).toFixed(1)) * 100;
                return {
                  languageCode: langCode,
                  languageConfidence: confidence
                }
        })

        const languages = await Promise.all(languageAndConfidenceArray);

             
        // Save to Firestore
        const jsonLocation = path.dirname(object.name);
        const fileName = path.basename(jsonLocation);
        const results = path.dirname(jsonLocation);
        const order = path.dirname(results);   
        const destination = `${order}/${fileName}`;
        const docRef = db.collection('Clients').doc(destination);
        await docRef.set({
          fullText: fullTextReady,
          textConfidence: textConfidence3,         
          type: "application/pdf",
          pageCount: jsObject.responses.length,
          languages: languages,
          fileName: fileName,
          location: jsonLocation,
          wordCount: words
          }, { merge: true });
            
}

Solution

  • I have figured it out. Google Vision API creates a separate JSON for every 20 pages of a PDF file. So, for example, if your PDF has 34 pages, then, it will create 2 JSONs, likewise, if it has 100 pages, then, it will create 5 JSONs. My cloud function would run every time a JSON passed through it so it would simple overwrite the previous JSON's information. The solution was to increment the values of all JSONs. Keep in mind that Firebase Cloud functions do not guarantee the ordering of events. Hence, you must have checks in your function to know which JSON it is. Luckily, Vision Api numbers JSONs like this: 1-3, 2-3, 3-3. Hopefully, this helps someone. Here's my updated code:

    if (path.basename(object.name).startsWith('output') && path.basename(object.name).split('.').pop() === "json") {    
         
            // Get references
            const fileBucket = object.bucket; // The Storage bucket that contains the file.
            const filePath = object.name; // File path in the bucket.      
    
             // Download JSON     
            const bucket = admin.storage().bucket(fileBucket);   
            const downloadResponse = await bucket.file(filePath).download();
            // const url = await getDownloadURL(bucket.file(filePath));       
            const bufferToJson = downloadResponse.toString();
            const jsObject = JSON.parse(bufferToJson);
    
            // jsObject.responses.forEach((response) => {
            //   return response.fullTextAnnotation.pages.map((page) => {
            //     console.log(page.property);
            //   })
            // });
    
            // Text
            const textArray = jsObject.responses.map(async (response) => {
              return response.fullTextAnnotation.text;
            });
            const readyArray = await Promise.all(textArray);
            const fullTextReady = readyArray.join();
    
            // Count words
            function countWords(str) {
              return str.trim().split(/\s+/).length;
            }
            const words = countWords(fullTextReady);       
    
    
            // Text confidence
            const textConfidenceArray = jsObject.responses.map(async (response) => {
              return response.fullTextAnnotation.pages.map((page) => {
                return page.confidence;
              })
            })
            const textConfidence = await Promise.all(textConfidenceArray);        
            const textConfidence2 = textConfidence.flat();
            const sum = textConfidence2.reduce((accumulator, currentValue) => {
              return accumulator + currentValue
            },0);
            const average = sum / textConfidence2.length;
            const textConfidence3 = Number(average).toFixed(2) * 100;      
            
            
            // Language and Language Confidence
            const pages = jsObject.responses.map((response) => {
              return response.fullTextAnnotation.pages.map((page) => {
                if (page.property && page.property.detectedLanguages) {
                    return page.property.detectedLanguages
                }            
              })
            });
            const pages2 = await Promise.all(pages);
            const detectedLanguages = pages2.flat(2);
            const filteredDetectedLanguages = detectedLanguages.filter((language) => language !== undefined)
            console.log(filteredDetectedLanguages);
    
                 
           const languageAndConfidenceArray = filteredDetectedLanguages.map((language) => {
                   const langCode = language.languageCode;
                   const confidence = Number((language.confidence).toFixed(1)) * 100;
                    return {
                      languageCode: langCode,
                      languageConfidence: confidence
                    }
            })
            
            
    
            const languages = await Promise.all(languageAndConfidenceArray);        
    
                 
            // Save to Firestore
            const jsonLocation = path.dirname(object.name);
            const fileName = path.basename(jsonLocation);
            const results = path.dirname(jsonLocation);
            const order = path.dirname(results);   
            const destination = `${order}/${fileName}`;
            const docRef = db.collection('Clients').doc(destination);
    
           const doc = await docRef.get();
          
              if (!doc.data().fullText) { 
                console.log('ya na voobshe ne dolje')              
                await docRef.set({              
                  textConfidence: textConfidence3,         
                  type: "application/pdf",
                  pageCount: jsObject.responses.length,
                  languages: languages,
                  fileName: fileName,
                  location: jsonLocation,
                  wordCount: words,
                  fullText: fullTextReady,
                  jsonArray:  [path.basename(object.name)]          
                  }, { merge: true });
    
              }  else {
                console.log('ya toje srabotala', path.basename(object.name))            
                // Combining texts from more than one json            
                const combinedText = fullTextReady + ' ' + doc.data().fullText;
                // Combining text confidences and averaging them out
                const averageTextConfidence = (Number(doc.data().textConfidence) + Number(textConfidence3)) / 2;            
                // Conbining page counts
                const combinedPages = Number(doc.data().pageCount) + Number(jsObject.responses.length);
                // Combining language arrays
                const combinedLanguageArray = doc.data().languages.concat(languages);
                // Combining word counts
                const combinedWordCount = Number(doc.data().wordCount) + Number(words);
                // Update json array
                const array = [...doc.data().jsonArray, path.basename(object.name)]
    
                //Saving to Firestore
                await docRef.set({              
                  textConfidence: averageTextConfidence,    
                  pageCount: combinedPages,
                  languages: combinedLanguageArray,        
                  wordCount: combinedWordCount,
                  fullText: combinedText,
                  jsonArray: array   
                  }, { merge: true });
    
              }
            
    
            
                
    }