pythonfile-extensionos.walklanguage-detection

How to get all type of file extensions within a directory using os.walk or glob.glob


I have a code that detects language of files from a directory. But while mentioning about the type of extension how can I detect language of all file extensions (example:- .pdf, .xlsx, .docx etc etc) in the directory and not only .txt files which is mentioned in the code. Attaching code for reference. I would like to know how this can be done using glob and os.walk.

import csv
from fnmatch import fnmatch
try:
    from langdetect import detect
except ImportError:
    detect = lambda _: '<dunno>'
import os

rootdir = '.'  # current directory
extension = '.txt'
file_pattern = '*' + extension

with open('output.csv', 'w', newline='', encoding='utf-8') as outfile:
    csvwriter = csv.writer(outfile)

    for dirpath, subdirs, filenames in os.walk(os.path.abspath(rootdir)):
        for filename in filenames:
            if fnmatch(filename, file_pattern):
                lang = detect(os.path.join(dirpath, filename))
                csvwriter.writerow([dirpath, filename, lang])

Solution

  • IIUC you could replace your fnmatch check by

    eoi = ['*.pdf', '*.xlsx', '*.docx', '*.txt']     # extensions of interest list
    if any(fnmatch(file, ext) for ext in eoi):
        lang = ...