I want to normalize filepaths (removing accents) in an external drive and I use os.walk()
. At one point, the script freezes and after I cancel, I see this message:
^CTraceback (most recent call last):
File "~/normalize_filepaths.py", line 2
for root, dirs, files in os.walk(target, topdown = False):
File "<frozen os>", line 377, in walk
KeyboardInterrupt
Here is a snippet of the relevant code, the second
def normalize(fp):
"""
>>> normalize("/Volumes/MM_BUP/MIGUEL/Acólitos")
'/Volumes/MM_BUP/MIGUEL/Acolitos'
>>> normalize("/Volumes/MM_BUP/MIGUEL/Acólitos")
'/Volumes/MM_BUP/MIGUEL/Acolitos'
>>> normalize("'This is my cup.' _ ゼロコ ZEROKO _ 紅茶の遊び方 _ mime _ clowning-8MgRJAXn1tE.mp4")
"'This is my cup.' _ ZEROKO _ _ mime _ clowning-8MgRJAXn1tE.mp4"
>>> normalize(" großer Tag")
' grosser Tag'
"""
fp = fp.replace("ß", "ss")
name_clean = unicodedata.normalize('NFD', fp)
return name_clean.encode('ascii', 'ignore').decode("ascii")
def main(target="/some/path"):
for root, dirs, files in os.walk(target, topdown = False):
for name in files + dirs:
filepath = os.path.join(root, name)
clean = normalize(name)
new_filepath = os.path.join(root, clean)
shutil.move(filepath, new_filepath)
How can I avoid this frozen OS error and visit all files and directories?
In order to reproduce the problem, I tried:
normalize
(such as the ones documented in that method's comments). ASCII only names were tried too....and it worked just fine every time (for some toy example file system hierarchy). So I failed to reproduce, which means I didn't try it in the applicable way.
Your stack trace though does not indicate (at least as far as I see) an error in the code somewhere necessarily, but rather that you interrupted it at some point (you got a KeyboardInterrupt
, which seems to happen upon canceling the code yourself, which is something that your post states you did). So it seems the program was not responding. According to my experience, when a code segment freezes (or seems to freeze), the first possible causes that come to my mind to investigate are:
os
, shutil
, and unicodedata
which you call. But these modules, being Python standard library ones, I trust and feel confident that will not have any race conditions, but even if they have, I just don't know how they are implemented internally and cannot investigate them in a timely manner (ie it is expected as a most effort and unprobable success attempt), so I left this scenario to be investigated last, in favor of the simpler following ones.shutil.move
ing a file from its original name to a new name, inside the same root
path, could result in producing a new result/file from os.walk
. But then this file would be again shutil.move
d and so on. I tried this with a purposedly alternating file name, but failed to produce an infinite loop, and then realized firstly that this shouldn't be the case in your code, judging from how you implement the new files' name generation (for example normalize(normalize(filepath))
should be the same as normalize(filepath)
, according to some tests and reading some documentation at least, so no new path would be generated then), and secondly because according to the documentation of os.walk
the returned values are of types str
and list
(which indicates that they are filled beforehand and not while being iterated).os.walk
and shutil.move
are possible candidates then. The former I assumed (based on past observations) that can easily be a lengthy process in case you are running the code on directories with many files. The latter can obviously be lengthy in case of large files. But since you are so far getting the KeyboardInterrupt
inside os.walk
and not shutil.move
then doing the operation on many files could explain a lengthy os.walk
, while these being small could minimize shutil.move
ing time at the same time, so I just assumed the first case (os.walk
ing many files) first.In order to have progress towards ruling out deadlocks' case, or verifying infinite loop case, or ruling out the case of os.walk
ing many files, the simplest way would be to add some print
statements inside the inner loop. Of course you can't detect as easily (with simple print
statements) the case of shutil.move
ing large files (unless you have access to shutil.move
), but I would suggest to start with the easy stuff first and just put some print
statements inside the inner loop. As a bonus, with print
statements, you can verify generated file paths are legal (as an effort to rule out the possibility of the problem being to normalize
an unusual path for example), as well as in the intended format.
To be honest I didn't test a large number of files to stress the above reasoning though, but since you confirmed in the comments that this is indeed the case and suggested to post this as an answer, then I did so. I am certain there must be more problematic scenarios to think of for freezing code, but I am glad that the actual scenario was found and I contributed for you finding it (hehe).