neo4jpathspring-datacypherdata-lineage

Is there a way to track end-to-end data lineage through Neo4j Cypher query?


I'm using Spring-Data along with SpringBoot to populate my Neo4j graph db.

I've the following Neo4j entities defined:

Source entity -->

@NodeEntity
public class Source implements Comparable<Source> {

    @GraphId private Long id;

    private String name;
    private SourceType type;
    private String dataStoreName;
    private String dataStoreDesc;

    private Source() {
        // Empty constructor required as of Neo4j API 2.0.5
    };

    public Source(String name, SourceType type, String dataStoreName, String dataStoreDesc) {
        this.name = name;
        this.type = type;
        this.dataStoreName = dataStoreName;
        this.dataStoreDesc = dataStoreDesc;
    }
    @Relationship(type = "CONTAINS", direction = Relationship.UNDIRECTED)
    public Set<Field> fields;

    public void contains(Field field) {
        if (fields == null) {
            fields = new HashSet<Field>();
        }
        fields.add(field);
    }


    /* Getter and Setters */

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public SourceType getType() {
        return type;
    }

    public void setType(SourceType type) {
        this.type = type;
    }

    public String getDataStoreName() {
        return dataStoreName;
    }

    public void setDataStoreName(String dataStoreName) {
        this.dataStoreName = dataStoreName;
    }

    public String getDataStoreDesc() {
        return dataStoreDesc;
    }

    public void setDataStoreDesc(String dataStoreDesc) {
        this.dataStoreDesc = dataStoreDesc;
    }

    public Set<Field> getFields() {
        return fields;
    }

    public void setFields(Set<Field> fields) {
        this.fields = fields;
    }

    @Override
    public int compareTo(Source other) {
        String name = other.getName();
        SourceType type = other.getType();
        if(this.name.equalsIgnoreCase(name) && this.type.equals(type))
            return 0;

        return -1;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((name == null) ? 0 : name.hashCode());
        result = prime * result + ((type == null) ? 0 : type.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Source other = (Source) obj;
        if (name == null) {
            if (other.name != null)
                return false;
        } else if (!name.equals(other.name))
            return false;
        if (type != other.type)
            return false;
        return true;
    }

}

Field entity-->

@NodeEntity
    public class Field implements Comparable<Field> {

        @GraphId private Long id;

        private String name;
        private FieldType fieldType;
        private SourceType sourceType;

        private String logicalName;
        private String dataType;
        private String dataSize;
        private String description;

        private Field() {
            // Empty constructor required as of Neo4j API 2.0.5
        };

        public Field(String name, FieldType fieldType, SourceType sourceType, String logicalName, String dataType, String dataSize, String description) {
            this.name = name;
            this.fieldType = fieldType;
            this.sourceType = sourceType;
            this.logicalName = logicalName;
            this.dataType = dataType;
            this.dataSize = dataSize;
            this.description = description;
        }
        @Relationship(type = "MAPS-TO", direction = Relationship.UNDIRECTED)
        public Set<Field> fields;

        public void mapsTo(Field field) {
            if (fields == null) {
                fields = new HashSet<Field>();
            }
            fields.add(field);
        }

        /* Getter and Setters */

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public FieldType getFieldType() {
            return fieldType;
        }

        public void setFieldType(FieldType fieldType) {
            this.fieldType = fieldType;
        }

        public SourceType getSourceType() {
            return sourceType;
        }

        public void setSourceType(SourceType sourceType) {
            this.sourceType = sourceType;
        }

        public String getLogicalName() {
            return logicalName;
        }

        public void setLogicalName(String logicalName) {
            this.logicalName = logicalName;
        }

        public String getDataType() {
            return dataType;
        }

        public void setDataType(String dataType) {
            this.dataType = dataType;
        }

        public String getDataSize() {
            return dataSize;
        }

        public void setDataSize(String dataSize) {
            this.dataSize = dataSize;
        }

        public String getDescription() {
            return description;
        }

        public void setDescription(String description) {
            this.description = description;
        }

        public Set<Field> getFields() {
            return fields;
        }

        public void setFields(Set<Field> fields) {
            this.fields = fields;
        }

        @Override
        public int compareTo(Field other) {
            String name = other.getName();
            FieldType fieldType = other.getFieldType();
            SourceType sourceType = other.getSourceType();
            if(this.name.equalsIgnoreCase(name) && this.fieldType.equals(fieldType) && this.sourceType.equals(sourceType))
                return 0;

            return -1;
        }

        @Override
        public int hashCode() {
            final int prime = 31;
            int result = 1;
            result = prime * result + ((fieldType == null) ? 0 : fieldType.hashCode());
            result = prime * result + ((name == null) ? 0 : name.hashCode());
            result = prime * result + ((sourceType == null) ? 0 : sourceType.hashCode());
            return result;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj)
                return true;
            if (obj == null)
                return false;
            if (getClass() != obj.getClass())
                return false;
            Field other = (Field) obj;
            if (fieldType != other.fieldType)
                return false;
            if (name == null) {
                if (other.name != null)
                    return false;
            } else if (!name.equals(other.name))
                return false;
            if (sourceType != other.sourceType)
                return false;
            return true;
        }


    }

So, a Source CONTAINS multiple Fields. And a Field is MAPS-TO one or more other Fields.

Each Source is of a SourceType.

My different SourceTypes are : PRODUCER, INBOUND, STAGING, INTERMEDIATE, OUTBOUND, CONSUMER.

public enum SourceType {
        PRODUCER, INBOUND, STAGING, INTERMEDIATE, OUTBOUND, CONSUMER;
    }

Each Field is of a FieldType.

My different FieldTypes are : FILE_FIELD, DB_COLUMN.

public enum FieldType {
    FILE_FIELD, DB_COLUMN;
}

My data lineage is as below: PRODUCER --> INBOUND --> STAGING --> INTERMEDIATE --> OUTBOUND --> CONSUMER

I'm now looking for an advanced Cypher query by which if I provide a Field in CONSUMER Source, I'm able to track its lineage back until the PRODUCER Source.

Similarly, I'm also looking for a query by which if I provide a Field in PRODUCER Source, I'm able to track its lineage forward until CONSUMER Source.

I've tried to build up the query using shortestPath and neighbors functions but it doesn't seem to pull up the results I'm looking for.

Any suggestions/pointers will be appreciated.

Thanks in advance !

UPDATE-1

Background on my data lineage: My application gets file from an external application (PRODUCE). I'm aware of what database tables/columns of external application populated the fields in file. So here, PRODUCER will be my Source node; each table.column of external application (that populated the file) is a Field node and PRODUCER Source node will have CONTAINS relationship with all the Field nodes (representing table.column of external application database table that populated the file).

The file from external application is called INBOUND. It is a comma delimited file. I'm aware of what are the field names coming in the file and in what order. So here, INBOUND will be my Source node; each field in file will be a Field node and INBOUND Source node will have CONTAINS relationship with all the Field nodes (representing the file fields in the inbound file). Also each of the Field nodes of INBOUND Source will have a MAPS_TO relationship with a Field node of PRODUCER Source (one-to-one mapping).

Going on similar workflow, my next stage is called STAGING wherein I load the inbound file fields into my database table/column. So here, STAGING will be my Source node and each column of the database table (into which I load the file fields) will represent a Field node. STAGING Source node will have CONTAINS relationship with all the Field nodes (that represent the db table.column of db table into which I load file fields). Also each of the Field nodes of STAGING Source will have a MAPS_TO relationship with a Field node of INBOUND Source (one-to-one mapping).

Similar my next stage is INTERMEDIATE. In this stage, I'm querying the table in which I loaded the fields of input file and then flushing the output into another file (based on my business use case, I may chose to query all or only a subset of table columns that were populated from input file). I'm aware of what fields and in what order will go into my INTERMEDIATE file. So here, INTERMEDIATE is my Source node and each field that goes into INTERMEDIATE file represents my Field node. Also INTERMEDIATE Source will have CONTAINS relationship with all the Field nodes that represent a field in the intermediate file. Also each of these Field nodes will have MAPS_TO relationship with the fields of STAGING Source (one-to-one mapping).

Similarly I've OUTBOUND stage and lastly the CONSUMER stage.

...(I hope you are now able to visualize the lineage)

The objective of my query is, say, if I give a Field name (that represent a table.column of PRODUCER) as input, then I should be able to track down its lineage till CONSUMER (i.e., the last stage of my lineage).


Solution

  • I was able to get the desired data lineage through below query:

    MATCH (f5:Field)-[:MAPS_TO]-(f4:Field)-[:MAPS_TO]-(f3:Field)-[:MAP‌​S_TO]-(f2:Field)-[:M‌​APS_TO]-(f1:Field)-[‌​:MAPS_TO]-(f:Field)<‌​-[:CONTAINS]-(s:Sour‌​ce {type: "SOURCE"}) WHERE f.name="<my input source field>" RETURN f,s,f1,f2,f3,f4,f5