I am converting html text to plain text and need to replace the HTML character codes with the actual characters the code represent.
The following sample code does that for two character codes, but requires a separate line for each character code.
Is there a method of replacing all of these without hard coding each character code?
set theText to replaceText("&", "&", theText)
set theText to replaceText(" ", " ", theText)
on replaceText(find, replace, textString)
set prevTIDs to AppleScript's text item delimiters
set AppleScript's text item delimiters to find
set textString to text items of textString
set AppleScript's text item delimiters to replace
set textString to "" & textString
set AppleScript's text item delimiters to prevTIDs
return textString
end replaceText
EDIT BASED ON SOLUTION By Jerry Stratton
I am using this AppleScript with GarageSale, which is a Mac only application that manages eBay listings. The Preview Mode tab includes a WYSIWYG editor for a eBay listing's description. The Editor Mode tab displays the HTML version of the Preview Mode tab.
I needed a AppleScript that would output the description to the clipboard. The problem is that GarageSale's AppleScript library only refers to the HTML version of the description so I needed a way to convert that to plain text before copying it to the clipboard.
I needed to add some additional code that trims any empty lines at the beginning or end of the string. This is due to how the HTML is generated in the Preview Mode. For instance; it might add a empty division in the HTML before the text or add a division with only a BR tag inside of it. This funky HTML gets converted to empty lines in the plain text conversion, which are then removed by the added trim code.
Since I have several versions of GarageSale installed I typically rename the app by adding its version. The "tell application" line needs to reference the specific name you are using on your Mac, which is typically just "GarageSale".
tell application "GarageSale 9.8.1"
repeat with theListing in (get selected ebay listings)
set des to get the description of theListing
end repeat
end tell
set theText to des
set theText to do shell script "/usr/bin/textutil -stdin -stdout -format html -convert txt <<< " & quoted form of theText
set the_text to theText
repeat while the_text starts with return
set the_text to the_text's text 2 thru -1
end repeat
repeat while the_text ends with return
set the_text to the_text's text 1 thru -2
end repeat
set the clipboard to the_text
It depends a lot on what you mean by “without hard coding each character code” as well as on the real-world text you’ll be using. @red_menace’s suggestion of textutil
is a very good one. The textutil
command is pre-installed in /usr/bin
, and can be used with AppleScript via do shell script
.
For example:
set theText to "These donuts & pastries include some éclairs."
set theText to do shell script "/usr/bin/textutil -stdin -stdout -format html -convert txt <<< " & quoted form of theText
The options, in order, are:
-stdin
: This tells textutil
to, instead of getting its input from a file, accept its input from standard input, which is how AppleScript will be sending it. This is the same as if you were to “pipe” text from one command to another on the command line.-stdout
: This tells textutil
to, instead of printing its output to a file, print its output to standard output, which is how AppleScript expects it to be sent.-format html
: This tells textutil
to interpret its input as HTML. This is necessary to tell it to look for entities such as
and &
.-convert txt
: This tells textutil
to convert its input to text. So that when it looks for, say, &
it will convert it to &
.The three left-pointing angle brackets are a means of piping text.
If I run this on the arbitrary input text These donuts & pastries include some éclairs.
I end up with:
These donuts & pastries include some éclairs.
If you wish to perform the full conversion in AppleScript, there really isn’t a direct way of doing it.
use framework "Foundation"
but that’s not something I’m familiar with.However, one way of doing such a conversion in AppleScript in a similar but less hard-coded way than in your example code is to place the entities in a tab-delimited file. Such a file could be constructed using any list of entities. Here is a simple file:
amp[tab]&
nbsp[tab][ ]
eacute[tab]é
Replace [tab]
with a tab character and [ ]
with a space (or non-breaking space).
An AppleScript to use this file might look like this:
property codeFile : POSIX file "/Users/USER/PATH/TO/codes.txt"
set theText to "These donuts & pastries include some éclairs."
set AppleScript's text item delimiters to tab
repeat with codeLine in paragraphs of (read codeFile as «class utf8»)
if (count codeLine) is greater than 1 then
set theCode to text item 1 of codeLine
set theCharacter to text item 2 of codeLine
set theCode to "&" & theCode & ";"
set theText to replaceText(theCode, theCharacter, theText)
end if
end repeat
theText
on replaceText(find, replace, textString)
set prevTIDs to AppleScript's text item delimiters
set AppleScript's text item delimiters to find
set textString to text items of textString
set AppleScript's text item delimiters to replace
set textString to "" & textString
set AppleScript's text item delimiters to prevTIDs
return textString
end replaceText
As you can see, it uses your replaceText
handler verbatim. It performs the same conversion as performed by textutil
, albeit in a much slower and less versatile manner. Depending on the size of your text this may or may not be an issue.
It counts the characters in each line because the final line is often an empty one (or empty except for a line delimiter). It manipulates the text item delimiter in the same manner as your replaceText
handler does.
And if this is a subset of a much longer script, you’ll want to save and reset Applescript’s text item delimiters, as you already do in replaceText
.