javascripthtmlregexstringdomparser

Regex match wrapped phrase with same delimiter (among many)


I need to parse the style attribute from an html corpus (contains many different html entries with many different style attribute).

An example of HTML may be the following :

<span style="font-size:0.58em;font-family:'Times New Roman';">
<span style='font-size:0.58em;font-family:"Times New Roman";'>

So style attribute content is some text wrapped between single (') or double (") quotes. If the text start to be wrapped between single quote, then should be readen until closing single quote is met. If start with a double quote, should proceede until the closing double quote is met.

I have produced the following regex, that work quite well :

/style\s*=\s*(?:'|").+?(?:'|")/gmi

The problem is that my solution fail to check consistency between opening quote and closing quote, so it will produce solution like :

style="font-size:0.58em;font-family:'Times New Roman'  --missing--> ;"
style='font-size:0.58em;font-family:"Times New Roman"  --missing--> ;'

Is there a solution to check both cases with one regex, or the only option is to split the current regex in two regex that will check for single or double quotes?


Solution

  • From my above comment ...

    "Have a look into backreferences ... /(["']).*?\1/g ..."

    const markup = `
      <span style="font-size:0.58em;font-family:'Times New Roman';">foo</span>
      <span style='font-size:0.58em;font-family:"Times New Roman";'>bar</span>
    `;
    const regXStyleAttr = /style\s*=\s*(["']).*?\1/gi;
    
    console.log(
      'regex based ... markup.match(regXStyleAttr) ...',
      markup.match(regXStyleAttr)
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }

    "... but a far more reliable approach was to utilize DOMParser and parseFromString"

    const markup = `
      <span style="font-size:0.58em;font-family:'Times New Roman';">foo</span>
      <span style='font-size:0.58em;font-family:"Times New Roman";'>bar</span>
    `;
    console.log(
      'dom parser based ... and element node mapping ...',
      Array
        .from(
          new DOMParser()
            .parseFromString(markup, "text/html")
            .body
            .getElementsByTagName('*')
        )
        .map(elmNode => `style="${ elmNode.style.cssText }"`)
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }

    Edit ... according to following OP's comment ...

    "... is it tollerant as browser will be? Is it comparable or better to regex in performance? and any use of the parse over the stirng can be done without affecting the page?" – Skary

    Answer: yes - most probably - apparently.

    const markup_1 = `
      <span style="font-size:0.58em;font-family:'Times New Roman';">foo</span>
      <span style='font-size:0.58em;font-family:"Times New Roman";'>bar</span>
    `;
    const markup_2 = `
      <span style=font-size:0.58em;font-family:'Times New Roman'>foo</span>
      <span style=font-size:0.58em;font-family:"Times New Roman">bar</span>
    `;
    const markup_3 = `
      <span style=font-size:0.58em;font-family:'Times New Roman'>foo
      <span style=font-size:0.58em;font-family:"Times New Roman">bar
    `;
    console.log(
      'valid markup ...',
      Array
        .from(
          new DOMParser()
            .parseFromString(markup_1, "text/html")
            .body
            .getElementsByTagName('*')
        )
        .map(elmNode => `style="${ elmNode.style.cssText }"`)
    );
    console.log(
      'invalid markup ...',
      Array
        .from(
          new DOMParser()
            .parseFromString(markup_2, "text/html")
            .body
            .getElementsByTagName('*')
        )
        .map(elmNode => `style="${ elmNode.style.cssText }"`)
    );
    console.log(
      'even more broken markup ...',
      Array
        .from(
          new DOMParser()
            .parseFromString(markup_3, "text/html")
            .body
            .getElementsByTagName('*')
        )
        .map(elmNode => `style="${ elmNode.style.cssText }"`)
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }