pythonmavenparsingneo4jdependency-tree

python - parse maven dependency tree


I want to be able to take in a maven dependency tree in as an input and parse through it to determine the groupId, artifactId, and version of each dependency with its child(ren) if any, and the child(ren)'s groupId, artifactId, and version (and any additional child(ren) and so on). I'm not sure if it makes the most sense to parse through the mvn dependency tree and store the info as a nested dictionary before preparing the data for neo4j.

I'm also unsure of the best way to parse through the entire mvn dependency tree. The code below is the most progress I've made at attempting to parse, remove unnecessary info in the front and label something a child or parent.

tree= 
[INFO] +- org.antlr:antlr4:jar:4.7.1:compile
[INFO] |  +- org.antlr:antlr4-runtime:jar:4.7.1:compile
[INFO] |  +- org.antlr:antlr-runtime:jar:3.5.2:compile
[INFO] |  \- com.ibm.icu:icu4j:jar:58.2:compile
[INFO] +- commons-io:commons-io:jar:1.3.2:compile
[INFO] +- brs:dxprog-lang:jar:3.3-SNAPSHOT:compile
[INFO] |  +- brs:libutil:jar:2.51:compile
[INFO] |  |  +- commons-collections:commons-collections:jar:3.2.2:compile
[INFO] |  |  +- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] |  |  |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile
    [INFO] |  |  |  \- com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile
.
.
.


fileObj = open("tree", "r")

for line in fileObj.readlines():
    for word in line.split():
        if "[INFO]" in line.split():
            line = line.replace(line.split().__getitem__(0), "")
            print(line)

            if "|" in line.split():
                line = line.replace(line.split().__getitem__(0), "child")
                print(line)

                if "+-" in line.split() and "|" not in line.split():
                    line = line.replace(line.split().__getitem__(0), "")
                    line = line.replace(line.split().__getitem__(0), "parent")
                    print(line, '\n\n')

Output:

 |  |  \- com.google.protobuf:protobuf-java:jar:3.5.1:compile

 child  child  \- com.google.protobuf:protobuf-java:jar:3.5.1:compile

 |  +- com.h2database:h2:jar:1.4.195:compile

 child  +- com.h2database:h2:jar:1.4.195:compile

   parent com.h2database:h2:jar:1.4.195:compile

I would appreciate any insight on the best way to parse & return data in an organized way given that I'm relatively unfamiliar with the capabilities of python. Thank you in advance!


Solution

  • I don't know what your programming experience is, but that's not a trivial task.

    First, you can see that the level of imbrication of a dependency is materialized by the symbol |. The simplest thing you can do is build a stack that stores the dependency path from root to children, grandchildren, ...:

    def build_stack(text):
        stack = []
        for line in text.split("\n"):
            if not line:
                continue
    
            line = line[7:] # remove [INFO]
            level = line.count("|")
            name = line.split("-", 1)[1].strip() # the part after the -
            stack = stack[:level] + [name] # update the stack: everything up to level-1 and name
            yield stack[:level], name # this is a generator
    
    for bottom_stack, name in build_stack(DATA):
        print (bottom_stack + [name])
    

    Output:

    ['org.antlr:antlr4:jar:4.7.1:compile']
    ['org.antlr:antlr4:jar:4.7.1:compile', 'org.antlr:antlr4-runtime:jar:4.7.1:compile']
    ['org.antlr:antlr4:jar:4.7.1:compile', 'org.antlr:antlr-runtime:jar:3.5.2:compile']
    ['org.antlr:antlr4:jar:4.7.1:compile', 'com.ibm.icu:icu4j:jar:58.2:compile']
    ['commons-io:commons-io:jar:1.3.2:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'commons-collections:commons-collections:jar:3.2.2:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile', 'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile']
    ['brs:dxprog-lang:jar:3.3-SNAPSHOT:compile', 'brs:libutil:jar:2.51:compile', 'org.apache.commons:commons-collections4:jar:4.1:compile', 'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile']
    

    Second, you can use this stack to build a tree based on imbricated dicts:

    def create_tree(text):
        tree = {}
        for stack, name in build_stack(text):
            temp = tree
            for n in stack: # find or create...
                temp = temp.setdefault(n, {}) # ...the most inner dict
            temp[name] = {}
        return tree
    
    from pprint import pprint
    pprint(create_tree(DATA))
    

    Output:

    {'brs:dxprog-lang:jar:3.3-SNAPSHOT:compile': {'brs:libutil:jar:2.51:compile': {'commons-collections:commons-collections:jar:3.2.2:compile': {},
                                                                                   'org.apache.commons:commons-collections4:jar:4.1:compile': {'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile': {},
                                                                                                                                               'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile': {}}}},
     'commons-io:commons-io:jar:1.3.2:compile': {},
     'org.antlr:antlr4:jar:4.7.1:compile': {'com.ibm.icu:icu4j:jar:58.2:compile': {},
                                            'org.antlr:antlr-runtime:jar:3.5.2:compile': {},
                                            'org.antlr:antlr4-runtime:jar:4.7.1:compile': {}}}
    {'brs:dxprog-lang:jar:3.3-SNAPSHOT:compile': {'brs:libutil:jar:2.51:compile': {'commons-collections:commons-collections:jar:3.2.2:compile': {},
                                                                                   'org.apache.commons:commons-collections4:jar:4.1:compile': {'com.fasterxml.jackson.core:jackson-annotations:jar:2.9.0:compile': {},
                                                                                                                                               'com.fasterxml.jackson.core:jackson-core:jar:2.9.5:compile': {}}}},
     'commons-io:commons-io:jar:1.3.2:compile': {},
     'org.antlr:antlr4:jar:4.7.1:compile': {'com.ibm.icu:icu4j:jar:58.2:compile': {},
                                            'org.antlr:antlr-runtime:jar:3.5.2:compile': {},
                                            'org.antlr:antlr4-runtime:jar:4.7.1:compile': {}}}
    

    An empty dict materializes a leaf in the tree.

    Third, you need to format the tree, ie 1. extract the data and 2. group the children in lists. This is a simple tree traversal (DFS here):

    def format(tree):
        L = []
        for name, subtree in tree.items():
            group, artifact, packaging, version, scope = name.split(":")
            d = {"artifact":artifact} # you can add group, ...
            if subtree: # children are present
                d["children"] = format(subtree)
            L.append(d)
        return L
    
    pprint(format(create_tree(DATA)))
    

    Output:

    [{'artifact': 'antlr4',
      'children': [{'artifact': 'antlr4-runtime'},
                   {'artifact': 'antlr-runtime'},
                   {'artifact': 'icu4j'}]},
     {'artifact': 'commons-io'},
     {'artifact': 'dxprog-lang',
      'children': [{'artifact': 'libutil',
                    'children': [{'artifact': 'commons-collections'},
                                 {'artifact': 'commons-collections4',
                                  'children': [{'artifact': 'jackson-annotations'},
                                               {'artifact': 'jackson-core'}]}]}]}]
    

    You can maybe group steps.