htmlreplaceapplescript

AppleScript Replace All HTML Character Codes in HTML String with Characters


I am converting html text to plain text and need to replace the HTML character codes with the actual characters the code represent.

The following sample code does that for two character codes, but requires a separate line for each character code.

Is there a method of replacing all of these without hard coding each character code?


set theText to replaceText("&", "&", theText)
set theText to replaceText(" ", " ", theText)

on replaceText(find, replace, textString)
    set prevTIDs to AppleScript's text item delimiters
    set AppleScript's text item delimiters to find
    set textString to text items of textString
    set AppleScript's text item delimiters to replace
    set textString to "" & textString
    set AppleScript's text item delimiters to prevTIDs
    return textString
end replaceText

EDIT BASED ON SOLUTION By Jerry Stratton

I am using this AppleScript with GarageSale, which is a Mac only application that manages eBay listings. The Preview Mode tab includes a WYSIWYG editor for a eBay listing's description. The Editor Mode tab displays the HTML version of the Preview Mode tab.

I needed a AppleScript that would output the description to the clipboard. The problem is that GarageSale's AppleScript library only refers to the HTML version of the description so I needed a way to convert that to plain text before copying it to the clipboard.

I needed to add some additional code that trims any empty lines at the beginning or end of the string. This is due to how the HTML is generated in the Preview Mode. For instance; it might add a empty division in the HTML before the text or add a division with only a BR tag inside of it. This funky HTML gets converted to empty lines in the plain text conversion, which are then removed by the added trim code.

Since I have several versions of GarageSale installed I typically rename the app by adding its version. The "tell application" line needs to reference the specific name you are using on your Mac, which is typically just "GarageSale".

tell application "GarageSale 9.8.1"
    repeat with theListing in (get selected ebay listings)
        set des to get the description of theListing
    end repeat
end tell

set theText to des

set theText to do shell script "/usr/bin/textutil -stdin -stdout -format html -convert txt <<< " & quoted form of theText

set the_text to theText

repeat while the_text starts with return
    set the_text to the_text's text 2 thru -1
end repeat
repeat while the_text ends with return
    set the_text to the_text's text 1 thru -2
end repeat

set the clipboard to the_text


Solution

  • It depends a lot on what you mean by “without hard coding each character code” as well as on the real-world text you’ll be using. @red_menace’s suggestion of textutil is a very good one. The textutil command is pre-installed in /usr/bin, and can be used with AppleScript via do shell script.

    For example:

    set theText to "These donuts &amp; pastries include some &eacute;clairs."
    set theText to do shell script "/usr/bin/textutil -stdin -stdout -format html -convert txt <<< " & quoted form of theText
    

    The options, in order, are:

    1. -stdin: This tells textutil to, instead of getting its input from a file, accept its input from standard input, which is how AppleScript will be sending it. This is the same as if you were to “pipe” text from one command to another on the command line.
    2. -stdout: This tells textutil to, instead of printing its output to a file, print its output to standard output, which is how AppleScript expects it to be sent.
    3. -format html: This tells textutil to interpret its input as HTML. This is necessary to tell it to look for entities such as &nbsp; and &amp;.
    4. -convert txt: This tells textutil to convert its input to text. So that when it looks for, say, &amp; it will convert it to &.

    The three left-pointing angle brackets are a means of piping text.

    If I run this on the arbitrary input text These donuts &amp; pastries include some &eacute;clairs. I end up with:

    These donuts & pastries include some éclairs.

    If you wish to perform the full conversion in AppleScript, there really isn’t a direct way of doing it.

    1. There may well be a way to do it by means of use framework "Foundation" but that’s not something I’m familiar with.
    2. You might also find using Automator to create a Quick Action a more reliable means of doing what you want.
    3. And a more complicated solution would be to turn the conceptualization of the solution on its head. Instead of using AppleScript to call out to an environment that can do this, you could start with an environment that can do this, such as by creating a Swift app in Xcode, and have it call out to AppleScript. This other environment could get the text via AppleScript (or Apple Events) and then modify the text using its own tools.

    However, one way of doing such a conversion in AppleScript in a similar but less hard-coded way than in your example code is to place the entities in a tab-delimited file. Such a file could be constructed using any list of entities. Here is a simple file:

    amp[tab]&
    nbsp[tab][ ] 
    eacute[tab]é
    

    Replace [tab] with a tab character and [ ] with a space (or non-breaking space).

    An AppleScript to use this file might look like this:

    property codeFile : POSIX file "/Users/USER/PATH/TO/codes.txt"
    
    set theText to "These donuts &amp; pastries include some &eacute;clairs."
        
    set AppleScript's text item delimiters to tab
    repeat with codeLine in paragraphs of (read codeFile as «class utf8»)
        if (count codeLine) is greater than 1 then
            set theCode to text item 1 of codeLine
            set theCharacter to text item 2 of codeLine
            set theCode to "&" & theCode & ";"
            set theText to replaceText(theCode, theCharacter, theText)
        end if
    end repeat
    theText
    
    on replaceText(find, replace, textString)
        set prevTIDs to AppleScript's text item delimiters
        set AppleScript's text item delimiters to find
        set textString to text items of textString
        set AppleScript's text item delimiters to replace
        set textString to "" & textString
        set AppleScript's text item delimiters to prevTIDs
        return textString
    end replaceText
    

    As you can see, it uses your replaceText handler verbatim. It performs the same conversion as performed by textutil, albeit in a much slower and less versatile manner. Depending on the size of your text this may or may not be an issue.

    It counts the characters in each line because the final line is often an empty one (or empty except for a line delimiter). It manipulates the text item delimiter in the same manner as your replaceText handler does.

    And if this is a subset of a much longer script, you’ll want to save and reset Applescript’s text item delimiters, as you already do in replaceText.