arraysjsonamazon-web-servicesamazon-s3amazon-s3-select

AWS S3 - Search a specific text in nested JSON file


{
    "repo": [
        {
            "book": 47,
            "version": 1,
            "bookName": "Book1",
            "chapters": [
                {
                    "chapterId": 1,
                    "chapterContents": [
                        {
                            "line": 1,
                            "Text": "1. The first text of chapter 1 book 1"
                        },
                        {
                            "line": 2,
                            "Text": "2. The second text of chapter 1 book 1"
                        }
                    ]
                },
                {
                    "chapterId": 2,
                    "chapterContents": [
                        {
                            "line": 1,
                            "Text": "1. The first text of chapter 2 book 1"
                        }
                    ]
                }
            ]
        }
    ]
}

This is the format for the data model stored as JSON file in Amazon S3 . There are lot of books with lot of chapters and text contents. The requirement is to search a particular text across all books and list the line, chapter, book and the version where the text is found . How can I make S3 queries on nested array JSON file ?


Solution

  • You can use Amazon Athena to query nested JSON data from S3.

    Searching for values inside json should be possible. You can find an example here:

    WITH dataset AS (
      SELECT * FROM (VALUES
        (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
        (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
        (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
      ) AS t (users)
    )
    SELECT json_extract_scalar(users, '$.name') AS user
    FROM dataset
    WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')