jsongroovydatastaxdatastax-startupdatastax-enterprise-graph

Datastax Graph Loader - Loading non-uniform JSON files' meta-properties


Below are 3 sample JSON files and graph loader script. The first file contains the most complexity, most of which should be ignored by the loading script. The second file is a simple variation that often occurs. The last file is there to provide a sense of the wide ranging differences that may occur between each file and to show the most immediate example of where the problems are currently.

Before diving in, note that this is just a close approximation of the structure of the data I'm actually working with and it's loading script. There are better ways to handle vertices for people, but this was the first example I could think of.

Sample Input JSON File 1

/*{
  "peopleInfo": [
    {
      "id": {
        "idProperty1": "property1Value",
        "idProperty2": "someUUID"
      }
    },
    {
      "people": [
        {
          "firstName": "person1FirstName",
          "lastName": "person1LastName",
          "sequence": 1
        },
        {
          "firstName": "person2FirstName",
          "lastName": "person2LastName",
          "sequence": 2
        },
        { //children and twins may be switched such that twins are sequence 3 & 4 and one or both of them have children with corresponding sequences
          "children": [
            {
              "firstName": "firstChildFirstName",
              "lastName": "firstChildLastName",
              "sequence": 3
            },
            {
              "firstName": "secondChildFirstName",
              "lastName": "secondChildLastName",
              "sequence": 4
            },
            {
              "twins": [
                {
                  "firstName": "firstTwinFirstName",
                  "lastName": "firstTwinLastName",
                  "sequence": 5
                },
                {
                  "firstName": "secondTwinFirstName",
                  "lastName": "secondTwinLastName",
                  "sequence": 6
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}*/

The second file doesn't contain any children

Sample Input JSON File 2

/*{
  "peopleInfo": [
    {
      "id": {
        "idProperty1": "property1Value",
        "idProperty2": "someUUID"
      }
    },
    {
      "people": [
        {
          "firstName": "person1FirstName",
          "lastName": "person1LastName",
          "sequence": 1
        },
        {
          "firstName": "person2FirstName",
          "lastName": "person2LastName",
          "sequence": 2
        }
      ]
    }
  ]
}*/

The third file contains Twins, but no single-born children

Sample Input JSON File 3

    /*{
      "peopleInfo": [
        {
          "personsID": {
            "idProperty1": "property1Value",
            "idProperty2": "someUUID"
          }
        },
        {
          "people": [
            { // twins can exist without top level people(parents work well to define this) and without other children. Also, children can exist without twins and without parents as well.
              "twins": [
                {
                  "firstName": "firstTwinFirstName",
                  "lastName": "firstTwinLastName",
                  "sequence": 3
                },
                {
                  "firstName": "secondTwinFirstName",
                  "lastName": "secondTwinLastName",
                  "sequence": 4
                }
              ]
            }
          ]
        }
      ]
    }*/

Loading Script

inputBaseDir = "/path/to/directories"

import java.io.File as javaFile;
def list = []

new javaFile(inputBaseDir).eachDir() { dir ->
  list << dir.getAbsolutePath()
}
for (item in list){
  def fileBuilder = File.directory(item)
  def peopleInfoMapper = fileBuilder.map {
    it['idProperty1'] = it.peopleInfo.id.idProperty1[0]
    it['idProperty2'] = it.peopleInfo.id.idProperty2[0]

    def ppl = it.peopleInfo.people[1]
    people = ppl.collect{
      if ( it['firstName'] != null){
        it['firstName'] = it['firstName']
      } else if ( it['lastName'] != null){
        it['lastName'] = it['lastName']
      } else if ( it['sequence'] != null) {
        it['sequence'] = it['sequence']
      }

      //filling the null values below is the temporary non-solution to get the data to load
      if ( it['firstName'] == null){
        it['firstName'] = ''
      }
      if ( it['lastName'] == null){
        it['lastName'] = ''
      }
      if ( it['sequence'] == null){
        it['sequence'] = 0
      }
      it
    }
    it['people'] = people
    it.remove('peopleInfo')
    it
    }
  load(peopleInfoMapper).asVertices {
    label "peopleInfo"
    key 'idProperty2'
    vertexProperty 'people',{
      value 'firstName'
      value 'lastName'
      value 'sequence'
      ignore 'children'
      ignore 'twins'
    }
  }

Problems

1

Looking at the third file: While twins have the allowed values within them, they shouldn't affect loading because ignoring the 'twins' key should ignore all of their meta-property values. In this instance I believe the exception below is being thrown because there weren't any top level people that weren't children or twins and by ignoring the 'twins' key all that's left for the vertexProperty 'people' is an empty map. My non-answer has simply filled that empty map with an empty string for the names and a zero for the sequences which are loaded into the database along with the actual data.

java.lang.IllegalArgumentException: [On field 'people'] Provided map does not contain property value on field [sequence]: {twin=[{firstName=firstTwinFirstName,lastName=firstTwinLastName, sequence=1},{firstName=secondTwinFirstName,lastName=secondTwinLastName,sequence=2}]}

2

Looking at the first file: When the 'twins' key is ignored, or directly removed, an empty map is still left as a place holder which is filled by the same non-solution in the loading script and loaded into the database along with the actual data.

Is there a best practice for dealing with these issues?


Solution

  • I don't know if this is the grooviest solution, but this seems to do the trick

    inputBaseDir = "/path/to/directories"
    
    import java.io.File as javaFile;
    def list = []
    
    new javaFile(inputBaseDir).eachDir() { dir ->
      list << dir.getAbsolutePath()
    }
    for (item in list){
      def fileBuilder = File.directory(item)
      def peopleInfoMapper = fileBuilder.map {
        it['idProperty1'] = it.peopleInfo.id.idProperty1[0]
        it['idProperty2'] = it.peopleInfo.id.idProperty2[0]
    
        def ppl = it.peopleInfo.people[1]
        people = ppl.collect{
          //removes k:v leaving an empty map
          if (it['children'] != null{
            it.remove('children')
          }
          //removes k:v leaving an empty map
          if (it['twins'] != null{
            it.remove('twins')
          }
          if ( it['firstName'] != null){
            it['firstName'] = it['firstName']
          } else if ( it['lastName'] != null){
            it['lastName'] = it['lastName']
          } else if ( it['sequence'] != null) {
            it['sequence'] = it['sequence']
          }
        }
        if (ppl['firstName'][0] != null && ppl['lastName'][0] != null){
          it['people'] = people.findAll() //only gathers non-empty maps from people
        } else { 
            /* removing people without desired meta-properties enables
             loader to proceed when empty maps from the removal of
             children and/or twins are present, while top-level 
             persons aren't*/
            it.remove('people')}  
        it.remove('peopleInfo')
        it
        }
      load(peopleInfoMapper).asVertices {
        label "peopleInfo"
        key 'idProperty2'
        vertexProperty 'people',{
          value 'firstName'
          value 'lastName'
          value 'sequence'
          ignore 'children'
          ignore 'twins'
        }
      }