python-3.xdocxdirectory-structurepydoc

Extract text from first page of word document and use it as a folder name, then move the file inside that folder


I have hundreds of word documents that needs to be processed but need to organized them first by versions in subfolders.

I basically get a drop of these word documents within a single folder and need to automate the organization moving forward before I get nuts.

So I have a script that basically creates a folder with the same name of the file and moves the file inside that folder, this part is done.

Now I need to go into each subfolder, and get the document version from within the first word page of each document, then create a sub-folder withe version number and move the word file into that subfolder.

The structure should be as follows (taking two folders as examples):

(Folder) Test
   (Subfolder) 12.0
        Test.docx

(Folder) Test1
   (Subfolder) 13.0
        Test1.docx

Luckily I was able to figure it out that "doc.paragraphs[6].text" will always return the version information in a single line as follows:

>>> doc.paragraphs[6].text
'Version Number: 12.0'

Would appreciate if someone can point me out to the right direction.

This is the script I have so far:

#!/usr/bin/env python3

import glob, os, shutil, docx, sys

folder = sys.argv[1]

#print(folder)

for file_path in glob.glob(os.path.join(folder, '*.docx')):
    new_dir = file_path.rsplit('.', 1)[0]
    #print(new_dir)
    try:
        os.mkdir(os.path.join(folder, new_dir))
    except WindowsError:
        # Handle the case where the target dir already exist.
        pass
    shutil.move(file_path, os.path.join(new_dir, os.path.basename(file_path)))

Solution

  • Please see below the complete solution to your requirement.

    Note: To know about re.search go through https://www.geeksforgeeks.org/python-regex-re-search-vs-re-findall/

    import docx, os, glob, re, shutil
    from pathlib import Path
    
    
    def create_dir(path): # function to check if a given path exist and create one if not
        # Check whether the specified path exists or not
        is_exist = os.path.exists(path)
    
        # Create a new directory the path does not exist
        if not is_exist:
            os.makedirs(path)
        
    folder = fr"C:\Users\rams\Documents\word_docs"  #my local folder
    
    for file in glob.glob(os.path.join(folder, '*.docx')):
        # Test, Test1, Test2 in your structure
        main_folder = os.path.join(folder,Path(file).stem)  
        file_name = os.path.basename(file)
        
        # Get the first line from the docx
        doc = docx.Document(file).paragraphs[0].text 
        
        #  group(1) = Version Number: (.*)
        version_no = re.search("(Version Number: (.*))", doc).group(1) 
        
        # extract the number portion from version_no
        sub_folder = version_no.split(':')[1].strip() 
        
        # path to actual sub_folder with version_no
        sub_folder = os.path.join(main_folder, sub_folder)
        
        # destination path
        dest_file_path = os.path.join(sub_folder, file_name)
        
        for i in [main_folder,sub_folder]:
            create_dir(i) # function call
            
        # to move the file to the corresponding version folder (overwrite if exists)
        if os.path.exists(dest_file_path):
            os.remove(dest_file_path)
            shutil.move(file, sub_folder)
        else:
            shutil.move(file, sub_folder)
        
    

    Before execution:

    enter image description here

    enter image description here

    After Execution

    enter image description here

    enter image description here

    enter image description here