amazon-web-servicesamazon-s3amazon-s3-select

AWS S3 Select - Retrieve data from 2 different levels of a json


I have this json stored in a S3 file (which is actually the output of a aws Comprehend EntitiesDetection job => meaning I have unfortunately no control of how this json is organized, it is uploaded to S3 by AWS Job itself, so I can't modify the structure of this json input):

{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 11, "Score": 0.9815415143966675, "Text": "5 start-up", "Type": "QUANTITY"}, {"BeginOffset": 61, "EndOffset": 183, "Score": 0.8883988261222839, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 212, "EndOffset": 327, "Score": 0.8162660002708435, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 337, "EndOffset": 339, "Score": 0.7018660306930542, "Text": "Trump, "Type": "PERSON"}, {"BeginOffset": 364, "EndOffset": 484, "Score": 0.8932908177375793, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 513, "EndOffset": 626, "Score": 0.8157837986946106, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 636, "EndOffset": 638, "Score": 0.6977631449699402, "Text": "Oprah Winfrey", "Type": "PERSON"}, {"BeginOffset": 963, "EndOffset": 971, "Score": 0.4658013880252838, "Text": "facebook", "Type": "ORGANIZATION"}, {"BeginOffset": 972, "EndOffset": 979, "Score": 0.6886632442474365, "Text": "twitter", "Type": "TITLE"}, {"BeginOffset": 985, "EndOffset": 993, "Score": 0.7970104813575745, "Text": "linkedin", "Type": "ORGANIZATION"}, {"BeginOffset": 994, "EndOffset": 998, "Score": 0.36566048860549927, "Text": "Menu", "Type": "TITLE"}
  ],
  "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"},


{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 13, "Score": 0.9995881915092468, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 27, "EndOffset": 69, "Score": 0.8302029371261597, "Text": "Constitution \u00e9conomique\" - African Manager", "Type": "TITLE"}, {"BeginOffset": 94, "EndOffset": 126, "Score": 0.48702114820480347, "Text": ".wpb_animate_when_almost_visible", "Type": "OTHER"}, {"BeginOffset": 290, "EndOffset": 298, "Score": 0.47538018226623535, "Text": "Fran\u00e7ais", "Type": "OTHER"}, {"BeginOffset": 299, "EndOffset": 306, "Score": 0.6746407747268677, "Text": "English", "Type": "OTHER"}, {"BeginOffset": 464, "EndOffset": 476, "Score": 0.9992197155952454, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 515, "EndOffset": 527, "Score": 0.9994662404060364, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 581, "EndOffset": 596, "Score": 0.6652442812919617, "Text": "African Manager", "Type": "ORGANIZATION"}, {"BeginOffset": 599, "EndOffset": 615, "Score": 0.8012278079986572, "Text": "09/09/2019 08:45", "Type": "DATE"}, {"BeginOffset": 674, "EndOffset": 685, "Score": 0.8724801540374756, "Text": "tunisiennes", "Type": "OTHER"}, {"BeginOffset": 689, "EndOffset": 701, "Score": 0.9975908398628235, "Text": "15 septembre", "Type": "DATE"}, {"BeginOffset": 753, "EndOffset": 781, "Score": 0.9481445550918579, "Text": "certain nombre d\u2019initiatives", "Type": "QUANTITY"}
  ],
  "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"}

//and so on ...

I need to find and retrieve all files where there is a type=PERSON with score >0.7 and retrieve the following data: the person and the file.

Today my query expression is:

select s.Text from s3object[*].Entities[*] s where s.Type= 'PERSON' AND s.Score > 0.7;

this outputs:

[

    {
        "Text": "Trump"
    },
    {
        "Text": "Oprah winfrey
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },

]

This is partly good but I need to associate with each "Text" (name of the person) the File where it comes from. So what I expect as the query output is:

[

    {
        "Text": "Trump",
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Oprah winfrey,
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui"
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },

]

How to retrieve this ? Tried a lot of possibilities using https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html, but none worked.


Solution

  • There's a note in this page you shared:

    Note Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins.

    I'd set up Athena for more complex queries against S3 directly (example from official doc). Another alternative is to restructure your JSON in a way that you can avoid joins, like duplicating "File" in the "Text" level. Of course you can also index this JSON in many other tools and formats to make data searchable/"queryable".