special-charactersrobocopy

Batch File to Copy mp3 Files With "European" Characters in the Titles


I have a play list of slightly more than 10,000 mp3 files. The music library has a total of about 40,000 tracks. I decided to write a batch file to copy out only the files on the playlist to a directory on a different drive. I used Notepad++ to modify the text playlist file so I would get the path/file names correct. I simply needed to add quotes around the path/file names, prefix the lines with “copy ” and suffix the lines with the destination drive/directory. Did all that, made sure the batch file was save as UTF-8 and executed it.

After a few minutes the batch file completed. When I checked the destination file I noticed that about 70 files had not copied over. I used ‘Beyond Compare’ on the original playlist file against a playlist file I made from the files that did copy over. What I noticed was that the files that did not copy over had what I will call ‘European’ characters in the filename. So like “Dov' é L'Amore.mp3” and “José Feliciano - Feliciano! - 01 - California Dreamin'.mp3.” Other files with exclamations did not copy either.

I reran the file substituting ‘xcopy’ instead of ‘copy’ – same result. On to Robocopy - same result. At this point I decided to try and copy over one of the problem files using Robocopy at a command prompt to see what errors it reported. Surprise, surprise - it copied over, as did the others. So Robocopy at the command level will copy the files, but not in a batch file saved as UTF-8??

As a last resort I decided to try using Powershell. But as I am inexperienced in using it, I asked ChatGPT to write a script for me, and this is what it returned.

# Source and destination paths
$sourcePath = "F:\Directory\Music\Cher - Bob’s Cher Mix"
$destinationPath = "X:\BoboFMDrive"

# File name with special characters
$fileName = "Cher - My Cher Mix - 02 - Dov' é L'Amore.mp3"

# Full path of the source file
$sourceFile = Join-Path -Path $sourcePath -ChildPath $fileName

# Full path of the destination file
$destinationFile = Join-Path -Path $destinationPath -ChildPath $fileName

# Copy the file to the destination
Copy-Item -Path $sourceFile -Destination $destinationFile -Force

Write-Host "File copied successfully!"

And it worked!, but I am looking for a solution that will let me easily edit a text-based file with many lines/strings as it would be onerous to have to create a script for each file. Does anyone have any thoughts on a solution? I ended up just using ‘Beyond Compare’ and copied over the dropped files manually, but would like to find a better/easier solution for the future.


Solution

  • The problem is the codepage. Windows, per default, is not using UTF-8. It uses the local ANSI codepage.
    The codepage of UTF-8 is 65001

    Commandline Test:

    Prepare:

    Create some filenames using different codepages:

    D:\Test>  chcp
    Active Codepage: 850.
    
    D:\Test>  echo . >"Dov' é L'Amore_ansi.mp3"
    D:\Test>  chcp 65001
    Active Codepage: 65001
    
    D:\Test>  echo . >"Dov' é L'Amore_utf8.mp3"
    
    Check for Differences:
    D:\Test>  chcp 850
    Active Codepage: 850.
    
    D:\Test>  dir
    11.08.2023  17:08    <DIR>          .
    11.08.2023  17:08    <DIR>          ..
    11.08.2023  16:56                 4 Dov' é L'Amore_ansi.mp3
    11.08.2023  16:58                 4 Dov' é L'Amore_utf8.mp3
    
    D:\Test>  chcp 65001
    Active Codepage: 65001
    
    D:\Test>  dir
    11.08.2023  17:08    <DIR>          .
    11.08.2023  17:08    <DIR>          ..
    11.08.2023  16:56                 4 Dov' é L'Amore_ansi.mp3
    11.08.2023  16:58                 4 Dov' é L'Amore_utf8.mp3
    
    D:\Test>  
    

    As you can see, there is no difference. Obviously Windows internally converts the used characterset before the filename it's written to the fielsystem.

    Result:

    Therefor you have no problems, when using the commandline and batch without any evaluation of a file content.

    File Test:

    Prepare:

    Using the Notepad.exe of Windows you can choose the file encoding during the action Save as ....

    Create three files with the text Dov' é L'Amore.
    Save them encoded as

    Check for Differences:
    D:\Test>  chcp 850
    Active Codepage: 850.
    
    D:\Test>  type ansi.txt
    Dov' Ú L'Amore
    
    D:\Test>  type utf8.txt
    Dov' ├® L'Amore
    
    D:\Test>  type utf8_boom.txt
    ´╗┐Dov' ├® L'Amore
    
    D:\Test>  
    

    Please note the Ú in the ansi.txt content!
    This is the difference between

    As a GUI app Notepad.exe saved "ANSI" using characterset "Windows-1252".

    D:\Test>  chcp 1252
    Aktive Codepage: 1252.
    
    D:\Test>  type ansi.txt
    Dov' é L'Amore
    
    D:\Test>  type utf8.txt
    Dov' é L'Amore
    
    D:\Test>  type utf8_boom.txt
    Dov' é L'Amore
    
    D:\Test>  
    
    D:\Test>  chcp 65001
    Aktive Codepage: 65001.
    
    D:\Test>  type ansi.txt
    Dov' � L'Amore
    
    D:\Test>  type utf8.txt
    Dov' é L'Amore
    
    D:\Test>  type utf8_boom.txt
     Dov' é L'Amore
    
    D:\Test>  
    

    (Note/compare the space before the text in utf8_boom.txt's content)

    In contrast to the filesystem, within a file the encoding in conjunction with the codepage is relevant.
    If it gets out of sync the processed filenames will differ from the ones found in the filesystem.


    Result:

    The practical part:

    For scripts involving a UTF-8 text file temporarily change the codepage to UTF-8. To limit the change to the runtime of the batch, the code should be enclosed by setlocal / endlocal:

    @echo off
    setlocal
      chcp 65001
    
      rem   Your script ....
      type utf8.txt
    
    endlocal
    

    As seen here, storing the UTF-8 with or without boom makes no differences for the displayed characters, but the boom adds binary content. So it is better to store UTF-8 without boom, as the binary characters can irritate programs, especially when interchanging to other operating systems.