I have a play list of slightly more than 10,000 mp3 files. The music library has a total of about 40,000 tracks. I decided to write a batch file to copy out only the files on the playlist to a directory on a different drive. I used Notepad++ to modify the text playlist file so I would get the path/file names correct. I simply needed to add quotes around the path/file names, prefix the lines with “copy ” and suffix the lines with the destination drive/directory. Did all that, made sure the batch file was save as UTF-8 and executed it.
After a few minutes the batch file completed. When I checked the destination file I noticed that about 70 files had not copied over. I used ‘Beyond Compare’ on the original playlist file against a playlist file I made from the files that did copy over. What I noticed was that the files that did not copy over had what I will call ‘European’ characters in the filename. So like “Dov' é L'Amore.mp3” and “José Feliciano - Feliciano! - 01 - California Dreamin'.mp3.” Other files with exclamations did not copy either.
I reran the file substituting ‘xcopy’ instead of ‘copy’ – same result. On to Robocopy - same result. At this point I decided to try and copy over one of the problem files using Robocopy at a command prompt to see what errors it reported. Surprise, surprise - it copied over, as did the others. So Robocopy at the command level will copy the files, but not in a batch file saved as UTF-8??
As a last resort I decided to try using Powershell. But as I am inexperienced in using it, I asked ChatGPT to write a script for me, and this is what it returned.
# Source and destination paths
$sourcePath = "F:\Directory\Music\Cher - Bob’s Cher Mix"
$destinationPath = "X:\BoboFMDrive"
# File name with special characters
$fileName = "Cher - My Cher Mix - 02 - Dov' é L'Amore.mp3"
# Full path of the source file
$sourceFile = Join-Path -Path $sourcePath -ChildPath $fileName
# Full path of the destination file
$destinationFile = Join-Path -Path $destinationPath -ChildPath $fileName
# Copy the file to the destination
Copy-Item -Path $sourceFile -Destination $destinationFile -Force
Write-Host "File copied successfully!"
And it worked!, but I am looking for a solution that will let me easily edit a text-based file with many lines/strings as it would be onerous to have to create a script for each file. Does anyone have any thoughts on a solution? I ended up just using ‘Beyond Compare’ and copied over the dropped files manually, but would like to find a better/easier solution for the future.
The problem is the codepage. Windows, per default, is not using UTF-8. It uses the local ANSI codepage.
The codepage of UTF-8
is 65001
Create some filenames using different codepages:
D:\Test> chcp
Active Codepage: 850.
D:\Test> echo . >"Dov' é L'Amore_ansi.mp3"
D:\Test> chcp 65001
Active Codepage: 65001
D:\Test> echo . >"Dov' é L'Amore_utf8.mp3"
D:\Test> chcp 850
Active Codepage: 850.
D:\Test> dir
11.08.2023 17:08 <DIR> .
11.08.2023 17:08 <DIR> ..
11.08.2023 16:56 4 Dov' é L'Amore_ansi.mp3
11.08.2023 16:58 4 Dov' é L'Amore_utf8.mp3
D:\Test> chcp 65001
Active Codepage: 65001
D:\Test> dir
11.08.2023 17:08 <DIR> .
11.08.2023 17:08 <DIR> ..
11.08.2023 16:56 4 Dov' é L'Amore_ansi.mp3
11.08.2023 16:58 4 Dov' é L'Amore_utf8.mp3
D:\Test>
As you can see, there is no difference. Obviously Windows internally converts the used characterset before the filename it's written to the fielsystem.
Therefor you have no problems, when using the commandline and batch without any evaluation of a file content.
Using the Notepad.exe
of Windows you can choose the file encoding during the action Save as ...
.
Create three files with the text Dov' é L'Amore
.
Save them encoded as
D:\Test> chcp 850
Active Codepage: 850.
D:\Test> type ansi.txt
Dov' Ú L'Amore
D:\Test> type utf8.txt
Dov' ├® L'Amore
D:\Test> type utf8_boom.txt
´╗┐Dov' ├® L'Amore
D:\Test>
Please note the Ú
in the ansi.txt
content!
This is the difference between
850 = Latin1
and1252 = Windows-1252
As a GUI app Notepad.exe
saved "ANSI" using characterset "Windows-1252".
D:\Test> chcp 1252
Aktive Codepage: 1252.
D:\Test> type ansi.txt
Dov' é L'Amore
D:\Test> type utf8.txt
Dov' é L'Amore
D:\Test> type utf8_boom.txt
Dov' é L'Amore
D:\Test>
D:\Test> chcp 65001
Aktive Codepage: 65001.
D:\Test> type ansi.txt
Dov' � L'Amore
D:\Test> type utf8.txt
Dov' é L'Amore
D:\Test> type utf8_boom.txt
Dov' é L'Amore
D:\Test>
(Note/compare the space before the text in utf8_boom.txt
's content)
In contrast to the filesystem, within a file the encoding in conjunction with the codepage is relevant.
If it gets out of sync the processed filenames will differ from the ones found in the filesystem.
For scripts involving a UTF-8 text file temporarily change the codepage to UTF-8. To limit the change to the runtime of the batch, the code should be enclosed by setlocal
/ endlocal
:
@echo off
setlocal
chcp 65001
rem Your script ....
type utf8.txt
endlocal
As seen here, storing the UTF-8 with or without boom makes no differences for the displayed characters, but the boom adds binary content. So it is better to store UTF-8 without boom, as the binary characters can irritate programs, especially when interchanging to other operating systems.