batch-fileunicode

batch: add a unicode header or how to add hex values or any other ways around this?


I have a batch script that uses drag and drop and creates some html code based on the filenames of the dropped files/folders. With

chcp 65001

I get this to write unicode. All fine and well. In the notepad editor at least, while the browser only shows garbage. When I re-save the file in notepad the file will work all fine in the browser, too. Unfortunately it seems the created unicode file is missing two "unicode header" characters (0xFF and 0xFE), right before the file, as comparison with hexdump (http://www.fileformat.info/tool/hexdump.htm) yielded as a result.

On this topic I found this: http://www.robvanderwoude.com/type.php#Unicode

The linked file from there sorts of doesn't work (parameter format error) Examples from this site on non-native echos etc. are out of the question. Copying an empty unicodeHeader-Helper File and appending my file works fine, but is very suboptimal, since it would mean, any folder, from which my files are drag and droped would need to have this helper file in. Which is not assumed to be the case as it is unpractical, so that is no good.

Using type is also out of the question, as it creates a whole lot of whitespaces between the characters.

So I was thinking of writing the file with missing header into a temp file, add the two hex values into a file and append the temp file to it. So basically writing the hex chars directly instead of copying them from the empty unicode helper file.

I found this: http://www.dostips.com/forum/viewtopic.php?f=3&t=3857 and moreover this: Writing characters > 7F (127) as hex strings according to code page 1252 in windows batch file

I thought I could just replace the example hex values to 0xFF and 0xFE and make it echo to a file:

@echo off
call :hex2Char 0xFF char_FF
call :hex2Char 0xFE char_FE
echo %char_FF% %char_800%
exit /b

:hex2Char  hexString  rtnVar
  for /f delims^=^ eol^= %%A in (
    'forfiles /p "%~dp0." /m "%~nx0" /c "cmd /c echo(%~1"'
  ) do set "%~2=%%A" >> temp.txt 
exit /b

But it seems it was not to be as simple as that. Two issues that come out of it: 1. It writes some unicode characters in there, but it's not the same as the unicode helper file, as hexdump shows.

file name: UniHeader.txt
mime type: 

0000-0003:  ef bb bf                                   


file name: temp.txt
mime type: 

0000-0000:                                                   

in fact I can change the FF or FE and it still only prints 0000-0000 in the hexdump output...

  1. I can add whatever I want after that file (like the code of writing my header-less file and appending it to the created one, the code stops at the second exit /b and does not write anything anymore. (But removing it makes the whole thing not work at all and moving it to the end of the file makes it unable to find the file being dropped onto the bat) In all honesty I am not getting through these few codeline at the moment. exit /b marks the end of the command, if I get it correctly, then, why does it continue to work after the first exit /b, but stops at the second exit /b? I also tries with labels and a goto, didn't work.

I am at a loss right now. Is there any elegant way to solve this?


Solution

  • Include them inside your batch file.

    @echo off
    
        for /f "tokens=2 delims=:" %%f in ('findstr /b /c:"BOFM:" "%~dpnx0"') do echo %%f
    
    exit /b
    rem Here starts the special characters part
    BOFM:ÿþ:
    

    The line which starts with BOFM: is typed as ALT+charchode to get the desired characters.

    EDITED -

    I give up. I'm not able to make it work consistently with multiple pagecodes across batch file, datafiles and editors. There is no way to guarantee what will be generated. So, i took @foxidrive answer (awesome!) to generate the file prefix and tried.

    What i've found is that if we use FF FE as a prefix for a file generated from cmd not in unicode mode (/u parameter) but with a unicode pagecode (65001), we are generating a file marked as unicode (the prefix) but the content is not, we only generate one byte per character. So we get the "chinese"? characters, just a bad translation of a single byte character flow into two byte characters.

    If we use the same prefix but from a unicode cmd (with /u parameter) and an unicode pagecode (65001), then a real unicode file is generated, and the content is correctly shown from command line, notepad and browsers (tested in ie and firefox). But this is a real unicode file, so two bytes per character are generated.

    Instead of FF FE, we can send a utf8 BOM EF BB BF, from a non unicode cmd but with unicode pagecode. This generates a utf8 with BOM prefix, one or multibyte for character (depends on utf encoding of each character) which shows correctly in editors and browsers but not in command line.

    The code (adapted from OP attached files) i've been trying is (to be run from a non unicode cmd):

    @echo off
    
        if ["%~1"]==[""] goto :EOF
    
        setlocal enableextensions enabledelayedexpansion
    
        rem File to generate
        set "myFile=aText.txt"
    
        rem save current pagecode
        for /f "tokens=2 delims=:" %%f in ('chcp') do set "cp=%%f"
    
        rem Generate BOM
        call :generateBOM "%myFile%"
    
        rem change to unicode 
        chcp 65001 > nul 
    
    :loop
        echo %1 >> "%myFile%"
        for %%a in ("%1") do (
            echo %%~nxa 
            echo   ^<br^>^<img src='%%~nxa'^>^<br^> 
        ) >> "%myFile%"
    
        shift
        if ["%~1"]==[""] goto showData
        goto loop   
    
    :showData
    
        "%myFile%"
    
    :endProcess
        rem Cleanup and restore pagecode
        endlocal & chcp %cp% > nul 
    
        exit /b 
    
    :generateBOM file
        rem [ EF BB BF ] utf8 bom     encoded value = 77u/
        rem [ FF FE ]    unicode bom  encoded value = //4=
        echo 77u/>"%~1"
    
        rem Yes, certutil allows decode inplace, so no temporary file needed
        certutil -f -decode "%~1" "%~1" >nul
    
        endlocal
        goto :EOF