javascript node.js amazon-web-services aws-sdk amazon-rekognition

AWS Rekognition timeout error happening randomly

I have an application in Electron that does facial recognition of people to then decide whether or not they can enter the place and for that I'm using Amazon Rekognition.

Everything was working fine (for a few months) until, two days ago, a customer reported to me that the app was behaving strangely, like it wasn't responding to requests for facial recognition.

After several tests, I discovered that what is happening with it is a timeout error, which occurs in all API calls, whether they are looking for faces (SearchFacesByImage) or registering new faces (IndexFaces).

The error says:

{
    "message": "connect ETIMEDOUT 3.226.60.54:443",
    "errno": -4039,
    "code": "TimeoutError",
    "syscall": "connect",
    "address": "3.226.60.54",
    "port": 443,
    "time": "2022-12-14T13:50:10.909Z",
    "region": "us-east-1",
    "hostname": "rekognition.us-east-1.amazonaws.com",
    "retryable": true
}

What intrigued me was the fact that everything was working fine, until this behavior just started happening (and I didn't make any code changes/updates to the app running on my client's computer).

And what makes me even more intrigued is that this behavior occurs completely randomly and only on the machine of that client in question. Sometimes the API calls work correctly (returning whether the person was recognized or not), but most of the time, the calls take about 90 seconds to return the timeout error. When executing the same code on my machine (same methods and same CollectionId) everything runs normally and there was no timeout error at any time - while at the exact same moment on my client's machine the behavior continues.

I was using aws-sdk and then switched to @aws-sdk/client-rekognition (thinking that could solve the problem) but the code only worked on a few of the first calls to the API and a few minutes later it got the timeout errors again.

The code I'm using to configure and make calls to Rekognition is basically this:

const { RekognitionClient, IndexFacesCommand, SearchFacesByImageCommand } = require('@aws-sdk/client-rekognition')

const rekognitionClient = new RekognitionClient({
    credentials: {
        accessKeyId: 'accessKeyId',
        secretAccessKey: 'secretAccessKey'
    },
    region: 'us-east-1'
})

const registerFaceOnRekognition = async (bytes, userId) => {
    const params = {
        CollectionId: 'collectionId',
        Image: { Bytes: bytes },
        ExternalImageId: userId,
        MaxFaces: 1,
        QualityFilter: 'HIGH'
    }

    const command = new IndexFacesCommand(params)

    try {
        const { FaceRecords } = await rekognitionClient.send(command)

        if (!FaceRecords.length) {
            console.log('No faces detected.')

            return
        }

        console.log('Face created:')
        console.log(FaceRecords[0].Face.FaceId)
    } catch (error) {
        console.error(error) // timeout error
    }
}

const searchFaceByImageOnRekognition = async (bytes) => {
    const params = {
        CollectionId: 'collectionId',
        Image: { Bytes: bytes },
        MaxFaces: 1,
        FaceMatchThreshold: 99,
        QualityFilter: 'HIGH'
    }

    const command = new SearchFacesByImageCommand(params)

    try {
        const { FaceMatches } = await rekognitionClient.send(command)

        if (!FaceMatches.length) {
            console.log('This face has not been registered yet')

            return
        }

        console.log('Face found:')
        console.log(FaceMatches[0].Face.ExternalImageId)
    } catch (error) {
        console.error(error) // timeout error
    }
}

// Method called through the renderer process that has a canvas where the webcam view is reproduced
const onTakePicture = (event, data) => {
    const bytes = Buffer.from(data.dataURL.replace('data:image/jpeg;base64,', ''), 'base64')

    // If there is a userId, register the face in the image
    if (data.userId) {
        registerFaceOnRekognition(bytes, data.userId)

        return
    }

    // Else, search for the face in the image
    searchFaceByImageOnRekognition(bytes)
}

Just remembering that: during all tests on my client's computer the internet connection was stable and working properly.

What is the best way to investigate and resolve this issue?

UPDATE:

I enabled Rekognition debug logs and they can be found at: https://gist.github.com/IgorSamer/4e58e09f3fa615401f85ca325b794245

In it, the first three requests (2022-12-16T13:48:45.932Z, 2022-12-16T13:53:20.325Z and 2022-12-16T14:19:12.479Z) occur normally. However, all other consecutive requests start to give the timeout error, where, in fact, no data is returned after the [DEBUG] App: endpoints Resolved endpoint: step.

As previously mentioned the internet connection is working fine. I could also managing to reproduce the error via remote access, that is, the machine internet was ok at the time of error.

Is there a possibility that there is a block made by my client's firewall/network that prevents requests from being sent by the SDK after a few successful requests? If yes, what is the best way to investigate this?

Solution

Exploration

This is what I would do initially to gather some info:

Verify if this is happening ALL the time with that specific client.
Verify if this is happening ONLY with one client, or more.
Verify if this is happening in one or multiple regions (i.e us-east-1).
Verify if Amazon Recognition has had/or has issues in the affected region during the time window of interest.
1. Check Recognition's status in the Health dashboard in your AWS console: link
Use AWS Recognition Guidelines and Quotas as a reference to determine if your app/service usage of Recognition is under the set limits.
1. Note there's a limit on TPS per resource (i.e SearchFacesByImage, IndexFaces) per account.

Possible approaches

Verify if there was a change in the client network/firewall. Just ask.
Replicate your app's API call with AWS CLI and study logs.
1. Access remotely to your client's device.
2. Setup temporal AWS credentials (remember to remove access after the test)
3. Send an API call to the Recognition endpoint. Note that even a 4XX error will be good news, as you got at least some response.
Set up proper logging for your app (as CloudWatch logs may not be enough to troubleshoot).
1. Check Splunk's APM and NewRelic's APM

I hope this may be of help to at least create a troubleshooting strategy