htmlgeditultraeditgedit-plugin

How to apply style to all paragraphs with upper case text?


I have very large HTML document, containg plenty of paragraps. For headings is used UPPER CASE text within paragraphs.

How to find all paragraphs containing UPPER CASE text and apply style to these paragraphs?

There is also a plenty extra spacing between text in most of paragraphs. Sample of existing headings:

<p>                                                   </p>
<p>                      USU EA EUISMOD HONESTATIS DETERRUISSET.</p>
<p>Qualisque mnesarchum no nam, usu cu fastidii delicata. Eu mei nonumy libris, quas movet vivendo vim at. Prima epicuri conceptam pro ad, in suas nonumes similique duo. Qui mundi essent complectitur eu. Ei laudem veritus democritum vis, te ferri appareat eos. Ceteros pertinacia ea eum, quo integre theophrastus ex, eum et sint omnes detracto. Ea vim brute labore. Vim te esse libris erroribus, ex minimum tacimates dissentiet duo. Ignota iisque in mei, pri sanctus albucius omnesque id. Laoreet docendi theophrastus ei pri, duo wisi tollit decore ea, tempor doctus vivendo sed ad. </p>
<p>Usu ea euismod honestatis deterruisset. Ne quo malis meliore, duo viris liberavisse no, mea an vide mutat quodsi. Vis an vidit debitis, et noster aliquam pri, case iudicabit te sea. Cum sadipscing consectetuer cu, an nominavi consulatu adversarium sea, nam ad dico evertitur voluptaria. Id justo viderer bonorum per, in ius impedit tincidunt, nec et quis scaevola. Cu congue iriure scaevola usu. Ei elit reformidans suscipiantur eos, cum ut doming iracundia.  </p>
<p>                                                                             </p>
<p>                       CU CONGUE IRIURE SCAEVOLA   --
   UT DOMING IRACUNDIA. </p>
<p>                                  DICO TEMPOR HABEMUS.</p>
<p>Homero everti ei nam. An liber euripidis vis, pericula persecuti deseruisse ad mea. Dicant offendit sea et, per esse timeam deserunt ut. In pri enim sadipscing, ei movet soleat suavitate vim. Mea et omnesque phaedrum, paulo luptatum concludaturque vim ea. -- LIBER. </p>

I want apply style to UPPER CASE text (headings) inside paragraphs tags to make them bold (headings).

Above block should look like below after running the regular expression replace(s) or the UltraEdit macro:

<p>                                                   </p>
<p class="bold">                      USU EA EUISMOD HONESTATIS DETERRUISSET.</p>
<p>Qualisque mnesarchum no nam, usu cu fastidii delicata. Eu mei nonumy libris, quas movet vivendo vim at. Prima epicuri conceptam pro ad, in suas nonumes similique duo. Qui mundi essent complectitur eu. Ei laudem veritus democritum vis, te ferri appareat eos. Ceteros pertinacia ea eum, quo integre theophrastus ex, eum et sint omnes detracto. Ea vim brute labore. Vim te esse libris erroribus, ex minimum tacimates dissentiet duo. Ignota iisque in mei, pri sanctus albucius omnesque id. Laoreet docendi theophrastus ei pri, duo wisi tollit decore ea, tempor doctus vivendo sed ad. </p>
<p>Usu ea euismod honestatis deterruisset. Ne quo malis meliore, duo viris liberavisse no, mea an vide mutat quodsi. Vis an vidit debitis, et noster aliquam pri, case iudicabit te sea. Cum sadipscing consectetuer cu, an nominavi consulatu adversarium sea, nam ad dico evertitur voluptaria. Id justo viderer bonorum per, in ius impedit tincidunt, nec et quis scaevola. Cu congue iriure scaevola usu. Ei elit reformidans suscipiantur eos, cum ut doming iracundia.  </p>
<p>                                                                             </p>
<p class="bold">                       CU CONGUE IRIURE SCAEVOLA   --
   UT DOMING IRACUNDIA. </p>
<p class="bold">                                  DICO TEMPOR HABEMUS.</p>
<p>Homero everti ei nam. An liber euripidis vis, pericula persecuti deseruisse ad mea. Dicant offendit sea et, per esse timeam deserunt ut. In pri enim sadipscing, ei movet soleat suavitate vim. Mea et omnesque phaedrum, paulo luptatum concludaturque vim ea. -- LIBER. </p>

As some paragraphs contain mixed upper case and lower case text, we need limit regex to search only paragraphs containing all UPPER CASE text, without lower case letters. There can be also line breaks within a paragraph.

How to accomplish this using some macro or code in UltraEdit for Linux? (Or Windows version as regex are the same anyway.)

I want apply class to paragraphs (instead of make headers H1, H2, etc.) just due to ebook readers (Kindle, etc.) may display headers in unpredictable way. Document encoding is utf-8, Cyrillic charset.


Solution

  • Regular expression support in UltraEdit

    UltraEdit v11.20 as mentioned in the original question before editing is very old and does not support regular expression finds/replaces in Perl syntax, just in UltraEdit and Unix syntax whereby Unix is similar to Perl, but very limited in its capabilities.

    Support for Perl regular expression finds/replaces was introduced with UltraEdit for Windows v12.00 released on 2006-03-15. There have been many minor and a few major updates on UltraEdit's Perl regular expression support. The minor updates were bug fixes. And the major updates as for example in UE v19.00 and in UE v21.20 introduced a newer version of the Boost regular expression library embedded in UltraEdit for Windows with enhancements regarding the regular expression engine itself.

    I don't know which regular expression library in Perl syntax is used by UltraEdit on Mac and on Linux. The various regular expression libraries on various platforms and in various versions have many in common, but of course there are also differences. So the platform and the version of UltraEdit respectively the version of the used regular expression library must be taken into account on complex Perl regular expression finds/replaces. There is not one and only Perl regular expression library used by all applications on all platforms in all versions in the last 20 years.

    Character set (code page) depending solutions

    With UltraEdit for Windows v11.20 or any later version of UltraEdit use for this task UltraEdit Regular Expressions with following search and replace strings with Match Case additionally checked in the replace window:

    Find what: <p^(>[~A-Za-z<>]++[A-Z][^t^r^n -`{-~]++</p>^)
    Replace with: <p class="bold"^1

    This is a tagged expression in UltraEdit syntax.

    It searches for <p> with 0 or more characters NOT being an ASCII letter in any case or an angle bracket, have at least 1 ASCII character in upper case, and having 0 or more ASCII characters except the small ASCII letters before </p> must be found. It is expected by the third character class that < in paragraph text is already encoded with &lt; and > is encoded with %gt; as required by HTML/XHTML and XML standards.

    The third character class [^t^r^n -`{-~] contains two unusual character range definitions which requires the knowledge of the characters in ASCII table. The first one is from space to grave accent which includes many often used punctuation marks, the digits 0-9 and the upper case ASCII letters, and the second one is from left curly bracket to tilde character to include the other non word characters in ASCII character range.

    The same regular expression replace in Unix/Perl syntax:

    Find what: <p(>[^A-Za-z<>]*[A-Z][\t\r\n -`{-~]*</p>)
    Replace with: <p class="bold"\1

    Other upper case characters like the German characters ÄÖÜ can be also added to the character classes inside the 3 square brackets. In this case the lower case language specific characters like äöüß must be added also to the first character class definition to exclude them for a positive match.

    Also a negative character class can be used instead of a positive character character class with option Match Case being checked.

    Example in UltraEdit syntax:

    Find What: <p^(>[~A-Za-z<>ÄÖÜäöüß]+[A-ZÄÖÜ][~a-z<>äöüß]++</p>^)
    Replace With: <p class="bold"^1

    This has the advantage that all characters except the lower case characters as specified in the negative character classes and the angle brackets are interpreted as valid characters for a heading which includes many characters from upper half of the used character set / code page.

    This task would be easier with a newer version of UltraEdit than v11.20 because the Perl regular expression engine has predefined a character class for lower case characters and and one more for upper case characters according to Unicode definition.

    Unicode solutions using Perl

    A Perl regular expression replace is required for a solution which does not depend on local character sets / code pages because of using the character definitions according to Unicode standard.

    But not all Perl regular expression libraries in all versions may support the expressions as written below.

    The posted Perl regular expressions were tested with UltraEdit for Windows v22.20.0.49 (last public version of UE for Windows XP) and v23.20.0.28 (currently latest version of UE for Windows Vista and later Windows).

    The Boost Perl regular expression library used by UltraEdit for Windows supports several character classes. The most interesting here are [:upper:] for any upper case word character and [:lower:] for any lower case character.

    Examples with Perl regular expression:

    Find what: <p(>\W*?[[:upper:]][^[:lower:]]+?</p>)
    Replace with: <p class="bold"\1

    Find what: <p(>\W*?[[:upper:]][[:upper:]\W]*?</p>)
    Replace with: <p class="bold"\1

    \W is a common "single character" character class for non word character.

    The "single character" character class for all lower case characters is \l. And \u is the "single character" character class for all upper case characters. Those shorter character classes can be also used for the search strings:

    Find what: <p(>\W*?\u[^\l]+?</p>)
    Replace with: <p class="bold"\1

    Find what: <p(>\W*?\u[\u\W]*?</p>)
    Replace with: <p class="bold"\1

    All expressions posted here make sure that the paragraph contains at least 1 upper case character.