beautifulsoupwgetpython-os

Accessing folder through os.scandir() after changes made to folder


I am trying to iterate through a folder with html files to filter them according to whether they contain a keyword or not in their string form. I download them to a folder through wget and BeautifulSoup and iterate through it after downloading with os.scandir() and copy the passing files to a different folder. However, first time I run the script, I can only download them, and no file will be copied to the destination directory. When I run it a second time, it does filter properly.

I am guessing that when I first run the script, os.scandir() automatically copies the initial state of the first folder(which is empty). But I couldn't make it so that os.scandir() assumes the final state of the folder with HTML data.How can I make it so that I can iterate through the data after downloading? Here is the snippet:

#pull HTML data from links with wget
subprocess.Popen(bash_commandList,stdout=subprocess.PIPE)

job_as_string = ""

#search for keyword in html as string to detect if the jobstelle has something I can do
with os.scandir('/Users/user/wgetLinks') as parent:
    for job_stelle in parent:
        with open(job_stelle, 'r') as f:
            if job_stelle.name.endswith(".html") and job_stelle.is_file():
                print(job_stelle.name)
                job_as_string = f.read()
        f.close()
        for keyword in keywords:
            if(keyword in job_as_string):
                popen_Command = '/Users/user/wgetLinks/' + job_stelle.name
                shutil.copy(popen_Command, '/Users/user/wgetInformatics')
                continue```

Solution

  • The issue you're encountering is likely due to the asynchronous nature of the subprocess.Popen call, which starts the downloading process but doesn't wait for it to complete before proceeding to the filtering part of your script. When you start iterating through the directory with os.scandir, the files may not yet be fully downloaded.

    To solve your problem, you may use the subprocess.run instead of subprocess.Popen to ensure your script waits for the downloading process to complete before iterating through the directory. The subprocess.run function will wait for the command to complete before moving on to the next line of your script.

    Below you can find some modifications that I've made in your script:

    import subprocess
    import os
    import shutil
    
    # List of keywords to search for
    keywords = ["keyword1", "keyword2", "keyword3"]
    
    # Pull HTML data from links with wget
    bash_commandList = ["wget", "-P", "/Users/user/wgetLinks", "http://example.com/file1.html", "http://example.com/file2.html"]
    subprocess.run(bash_commandList, stdout=subprocess.PIPE)
    
    # Search for keyword in HTML files and copy matching files to another directory
    with os.scandir('/Users/user/wgetLinks') as parent:
        for job_stelle in parent:
            if job_stelle.name.endswith(".html") and job_stelle.is_file():
                with open(job_stelle.path, 'r') as f:
                    job_as_string = f.read()
                for keyword in keywords:
                    if keyword in job_as_string:
                        shutil.copy(job_stelle.path, '/Users/user/wgetInformatics')
                        break
    

    Here is a short explanation about the changes:

    This method ensures that the HTML files are fully downloaded before the script attempts to filter and copy them.

    Just try it out and let me know in the comments!