rubyvbscriptole

Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS


I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.

Ruby Example:

require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit

VBS Example:

Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing

Documentation:

Document.SaveAs: http://msdn.microsoft.com/en-us/library/bb221597.aspx

msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx

Any suggestions, how to make Word save the HTML file in UTF-8?


Solution

  • My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

    require 'sanitize'
    
    # ... add some code converting a Word file to HTML.
    
    # Post export cleanup.
    html_file = File.open(html_file_name, "r:windows-1252:utf-8")
    html = '<!DOCTYPE html>' + html_file.read()
    html_document = Nokogiri::HTML::Document.parse(html)
    Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
    html_document.at_css('html')['lang'] = 'en-US'
    html_document.at_css('meta[name="Generator"]').remove()
    
    # ... add more cleaning up of Words HTML noise.
    
    sanitized_html = html_document.to_html({encoding: 'utf-8', indent: 0})
    # writing output to (new) file
    sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
    File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
        f.write sanitized_html
    end
    

    HTML Sanitizer: https://github.com/rgrove/sanitize/

    HTML parser and modifier: http://nokogiri.org/

    In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

    I haven't tested SaveAs2, since I don't have Word 2010.