pythonregexos.path

python f-string and regular expression to generate customized path of a file in a directory


Given:

Files in a working directory:

WKDIR = "/scratch/project_2004072/Nationalbiblioteket/dataframes"
$ ls -l
nikeX_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_27450_vocabs.json
nikeX_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_tfidf_matrix_RF_large.gz
nikeX_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_tfidf_vectorizer_large.gz
nikeX_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_user_tokens_df_27452_BoWs.gz
nikeY_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_26042_vocabs.json
nikeY_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_tfidf_matrix_RF_large.gz
nikeY_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_tfidf_vectorizer_large.gz
nikeY_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_user_tokens_df_26050_BoWs.gz

Goal:

I'd like to create a customized path using regular expression fr to only read files with endings of user_tokens_df_XXXX_BoWs.gz and load them via some helper function later in my code. Right now, I have a python script with f-string and regex which does not work:

import re
import os
fprefix = f"nikeY_docworks_lib_helsinki_fi_access_log_07_02_2021_lemmaMethod_stanza_"
fpath = os.path.join(WKDIR, f'{fprefix}_user_token_sparse_df'fr'_user_tokens_df_(\d+)_BoWs.gz')
#fpath = os.path.join(WKDIR, f'{fprefix}_user_token_sparse_df'fr'(_user_tokens_df_(\d+)_BoWs.gz)') # did not work either!
print(fpath) # >>>> it's wrong! <<<<
try:
   # load via helper function
   df = load_pickle(fpath)
except:
   # do something else

Is there any better approach to fix this? Do I have a wrong understanding that using re.search() is not helping since this fpath is fed into another function in try except block in my code.

Cheers,


Solution

  • Based on the description of the filename pattern given in the question, how about:

    from glob import glob
        
    PATTERN = "/scratch/project_2004072/Nationalbiblioteket/dataframes/*user_tokens_df_*_BoWs.gz"
        
    for file in glob(PATTERN):
        ...