htmlalgorithmformatting

Join texts and optimize formatting


I have text objects (text1, text2, etc) with formatting information (eg. [bold, italic]). Now I want to concat these texts and format them in HTML.

For a simple case it would transform the following

text1[bold,italic]
text2[]
text3[bold]

into this HTML:

<b><i>text1</i></b>
text2
<b>text3</b>

But I want to join formattings where possible, not format each text individually. For example the following

text1[bold,italic]
text2[bold]
text3[]

Should result in this HTML:

<b>
  <i>text1</i>
  text2
</b>
text3

Additionally I want to wrap the "longest" formatting to the "outside" of the DOM. For example in the following case, italics is the longer formatting chain and thus wraps around the bold element:

text1[bold,italic]
text2[italic]
text3[]

Result:

<i>
  <b>text1</b>
  text2
</i>
text3

What would be a good algorithmic approach to solve this problem? Answers in pseudocode or any programming language would be helpful.


Solution

  • Here is the solution I used in the end. Thanks for all input here and elsewhere. It got me on the right track.

    The solution has two parts.

    Output

    To output the tags, a stack based approach is used. You loop over the items and close all formattings that are not in the current item (removing the formatting and all intermediaries from the stack):

    // close all formatting that is not in the current text
    $toclose = array_diff($fStack, $formatting);
    foreach($toclose as $f) {
        // we need to make sure all formatting is closed, but we close by popping the
        // stack. This ensures we don't create invalid nesting
        while (in_array($f, $fStack)) {
            $result .= $this->closeFormatting(array_pop($fStack));
        }
    }
    

    Then we open all formatting that is still missing on the stack and add it to the stack:

    // open formatting that is in the current text but not yet open
    $new = array_diff($formatting, $fStack);
    foreach ($new as $f) {
        $result .= $this->openFormatting($f);
        $fStack[] = $f;
    }
    

    Finally you can output the text itself.

    $result .= $text->__toString();
    ```php
    
    After the loop close all remaining formattings on the stack.
    
    ```php
    // close remaining formatting
    while ($fStack) {
        $result .= $this->closeFormatting(array_pop($fStack));
    }
    

    Find the longest chains

    The above approach will already work, but the tags might not be ideally nested. Eg. for

    text1[bold,italic]
    text2[italic]
    text3[]
    

    It might produce <b><i>text1</i></b><i>text2</i>text3. It depends on which formatting you found first in $formatting for the first item.

    To solve this problem, we need to sort the formatting information by longest chain.

    Eg. the above information should be:

    text1[italic, bold]
    text2[italic]
    text3[]
    

    To achieve that, I do a simple chain sorting step beforehand. You go backwards through the texts (starting at the second to last) and give it the information of how many same formattings come afterwards. You then sort the formatting by chain length (highest first).

    The above would end up as something like this:

    text1[italic:2, bold:1]
    text2[italic:1]
    text3[]
    

    loop:

    function updateFormattingScores()
    {
        $len = count($this->texts);
        if($len < 2) return;
        for($i = $len - 2; $i >= 0; $i--) {
             $this->texts[$i]->updateFormattingScores($this->texts[$i + 1]);
        }
    }
    

    update:

    public function updateFormattingScores(TextRun $nextRun)
    {
        $next = $nextRun->getFormatting();
        foreach ($next as $key => $value) {
            if($this->formatting[$key] === 0) continue; // we don't have this format
            $this->formatting[$key] += $value; // add follow up chain
        }
    
        // sort by value, longest chains first
        arsort($this->formatting);
    }
    

    With this run before the output, the output will always find the longer chain formatting first.