pythonjsonos.walk

Build JSON object directory tree using Python os.walk


So I'm just hitting a mental roadblock on solving this problem and none of the other questions I have looked at have really captured my particular use-case. One was close but I couldn't quite figure out how to tailor it specifically. Basically, I have a script that uses os.walk() and renames any files within a target directory (and any sub-directories) according to user-defined rules. The specific problem is that I am trying to log the results of the operation in JSON format with an output like this:

{
    "timestamp": "2022-12-26 09:40:55.874718",
    "files_inspected": 512,
    "files_renamed": 256,
    "replacement_rules": {
        "%20": "_",
        " ": "_"
    },
    "target_path": "/home/example-user/example-folder",
    "data": [
        {
            "directory": "/home/example-user/example-folder",
            "files": [
                {
                    "original_name": "file 1.txt",
                    "new_name": "file_1.txt"
                },
                {
                    "original_name": "file 2.txt",
                    "new_name": "file_2.txt"
                },
                {
                    "original_name": "file 3.txt",
                    "new_name": "file_3.txt"
                }
            ],
            "children": [
                {
                    "directory": "/home/example-user/example-folder/sub-folder",
                    "files": [
                        {
                            "original_name": "file 1.txt",
                            "new_name": "file_1.txt"
                        },
                        {
                            "original_name": "file 2.txt",
                            "new_name": "file_2.txt"
                        },
                        {
                            "original_name": "file 3.txt",
                            "new_name": "file_3.txt"
                        }
                    ]
                }
            ]
        }
    ]
}

The first item in the 3-tuple (dirpath) begins as the target directory, and on that same loop the second item in the 3-tuple (dirnames) is a list of the directories within that dirpath (if any). However, what I think is messing me up is that on the second loop, dirpath becomes the first item in dirnames in the prior loop (if there were any). I am having trouble working out the logic of transforming this 3-tuple loop data into the nested hierarchy above. Ideally, it would be nice if a directory object which had no sub-directories (children) would also not have the children key at all, but having it set to an empty list would be fine.

I would really appreciate any advice or insight you might have on how to achieve that desired log structure from what os.walk() provides. Also open to any suggestions on improving the JSON object structure. Thank you!

https://github.com/dblinkhorn/file_renamer


Solution

  • One issue in your approach is that you want a hierarchical result that is most naturally obtained by recursion, whereas os.walk flattens that hierarchy.

    For this reason, I would recommend using os.scandir instead. It also happens to be one of the most performant tools to interact with a directory tree.

    import os
    from datetime import datetime
    
    def rename(topdir, rules, result=None, verbose=False, dryrun=False):
        is_toplevel = result is None
        if is_toplevel:
            result = dict(
                timestamp=datetime.now().isoformat(sep=' ', timespec='microseconds'),
                dryrun=dryrun,
                directories_inspected=0,
                files_inspected=0,
                files_renamed=0,
                replacement_rules=rules,
                target_path=topdir,
            )
        files = []
        children = []
        with os.scandir(topdir) as it:
            for entry in it:
                if entry.is_dir():
                    children.append(rename(entry.path, rules, result, verbose, dryrun))
                else:
                    result['files_inspected'] += 1
                    for old, new in rules.items():
                        if old in entry.name:
                            newname = entry.name.replace(old, new)
                            dst = os.path.join(topdir, newname)
                            if not dryrun:
                                os.rename(entry.path, dst)
                                result['files_renamed'] += 1
                            if verbose:
                                print(f'{"[DRY-RUN] " if dryrun else ""}rename {entry.path!r} to {dst!r}')
                            files.append(dict(original_name=entry.name, new_name=newname))
                            break
        result['directories_inspected'] += 1
        res = dict(directory=topdir)
        if files:
            res.update(dict(files=files))
        if children:
            res.update(dict(children=children))
        if is_toplevel:
            res = result | res
        return res
    

    Example

    Let's build a reproducible example:

    d = {
        'example/example-folder': [
            'file 1.txt',
            'file 2.txt',
            'foo bar 1.txt',
            {
                'sub/folder': [
                    'file 1.txt',
                    'file 2.txt',
                    'foo bar 1.txt',
                ],
            },
        ],
    }
    
    def make_example(d, topdir='.'):
        if isinstance(d, str):
            print(f'make file: {topdir}/{d}')
            with open(os.path.join(topdir, d), 'w') as f:
                pass
        elif isinstance(d, dict):
            for dirname, specs in d.items():
                topdir = os.path.join(topdir, dirname)
                print(f'makedirs {topdir}')
                os.makedirs(topdir, exist_ok=True)
                make_example(specs, topdir)
        else:
            assert isinstance(d, list), f'got a weird spec: {d!r}'
            for specs in d:
                make_example(specs, topdir)
    
    >>> make_example(d)
    makedirs ./example/example-folder
    make file: ./example/example-folder/file 1.txt
    make file: ./example/example-folder/file 2.txt
    make file: ./example/example-folder/foo bar 1.txt
    makedirs ./example/example-folder/sub/folder
    make file: ./example/example-folder/sub/folder/file 1.txt
    make file: ./example/example-folder/sub/folder/file 2.txt
    make file: ./example/example-folder/sub/folder/foo bar 1.txt
    
    ! tree example
    example
    └── example-folder
        ├── file\ 1.txt
        ├── file\ 2.txt
        ├── foo\ bar\ 1.txt
        └── sub
            └── folder
                ├── file\ 1.txt
                ├── file\ 2.txt
                └── foo\ bar\ 1.txt
    
    3 directories, 6 files
    

    Now, using the rename() function above:

    rules = {'%20': '_', ' ': '_'}
    res = rename('example', rules, verbose=True, dryrun=True)
    # [DRY-RUN] rename 'example/example-folder/file 2.txt' to 'example/example-folder/file_2.txt'
    # [DRY-RUN] rename 'example/example-folder/file 1.txt' to 'example/example-folder/file_1.txt'
    # [DRY-RUN] rename 'example/example-folder/sub/folder/file 2.txt' to 'example/example-folder/sub/folder/file_2.txt'
    # [DRY-RUN] rename 'example/example-folder/sub/folder/file 1.txt' to 'example/example-folder/sub/folder/file_1.txt'
    # [DRY-RUN] rename 'example/example-folder/sub/folder/foo bar 1.txt' to 'example/example-folder/sub/folder/foo_bar_1.txt'
    # [DRY-RUN] rename 'example/example-folder/foo bar 1.txt' to 'example/example-folder/foo_bar_1.txt'
    
    >>> print(json.dumps(res, indent=4))
    {
        "timestamp": "2022-12-29 15:24:06.930252",
        "dryrun": true,
        "directories_inspected": 4,
        "files_inspected": 6,
        "files_renamed": 0,
        "replacement_rules": {
            "%20": "_",
            " ": "_"
        },
        "target_path": "example",
        "directory": "example",
        "children": [
            {
                "directory": "example/example-folder",
                "files": [
                    {
                        "original_name": "file 2.txt",
                        "new_name": "file_2.txt"
                    },
                    {
                        "original_name": "file 1.txt",
                        "new_name": "file_1.txt"
                    },
                    {
                        "original_name": "foo bar 1.txt",
                        "new_name": "foo_bar_1.txt"
                    }
                ],
                "children": [
                    {
                        "directory": "example/example-folder/sub",
                        "children": [
                            {
                                "directory": "example/example-folder/sub/folder",
                                "files": [
                                    {
                                        "original_name": "file 2.txt",
                                        "new_name": "file_2.txt"
                                    },
                                    {
                                        "original_name": "file 1.txt",
                                        "new_name": "file_1.txt"
                                    },
                                    {
                                        "original_name": "foo bar 1.txt",
                                        "new_name": "foo_bar_1.txt"
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }