phphtmlregexcode-cleanupstrip-tags

Strip HTML tags from within the title and alt attributes of an image tag


Within some of our articles, we have images which have mistakenly had links hardcoded into the title/alt attributes of image tags, which breaks the display of the image. For example:

<img src="/imgs/my-image.jpg" title="This is a picture of a <a href="/blob.html">blob</a>." />

I've tried using a preg_replace_callback function, but it's difficult to match the full title because of the repeating quotes from the link.

I'd like to be able to do this programmatically on the fly for any string to ensure proper output. Ideas?


Solution

  • You can try this kind of pattern:

    $pattern = <<<'EOD'
    ~
    (?:
        \G(?!\A)                 # second entry point
        (?:                        # content up to the next alt/title attribute (optional)
            [^><"]* "                 # end of the previous attribute
            (?> [^><"]* " [^"]* " )*? # other attributes (optional)
            [^><"]*                   # spaces or attributes without values (optional)
            \b(?:alt|title)\s*=\s*"   # the next alt/title attribute
        )?+                        # make all the group optional
      |
        <img\s[^>]*?             # first entry point
        \b(?:alt|title)\s*=\s*"
    )
    [^<"]*+\K
    (?:              # two possibilities:
        </?a[^>]*>     # an "a" tag (opening or closing)
      |                # OR
        (?=")          # followed by the closing quote
    )
    ~x
    EOD;
    
    $result = preg_replace($pattern, '', $html);
    

    online demo

    This kind of pattern uses the contiguity of the repeated matches with the \G anchor.