phphtmlregexhtml-content-extraction

How to properly get all html elements inside a table using regex in php?


So I am using regex101.com to test my string and I can't get the output I need. The sample I made can be viewed here https://regex101.com/r/YQTW4c/2.

So my regex is this:

<table class=\"datatable\s\">(.*?)<\/table>

and the sample string:

<table class="datatable"><thead><tr><tr></thead></table>

I want to get the everything inside the table class datatable which, in this example, is <thead><tr><tr></thead>.

Am I missing something here? Any help would be much appreciated.


Solution

  • Your problem (as described by regex101) is that

    "\s matches any whitespace character (equal to [\r\n\t\f\v ])"
    

    So your regex requires a whitespace character between the e in datatable and the ", which doesn't exist. If you want to allow for zero or more spaces between that e and the ", you need to change your regex to

    <table class=\"datatable\s*\">(.*?)<\/table>
    

    Note that escaping " in regex's is not necessary (but I presume they are there because your regex is a quoted string).

    What others have been saying about not using regex to parse HTML is very true; for example this regex will fail if two tables with class "datatable" are nested. It will also fail if a datatable is instantiated with additional classes. It is far better to use PHP tools built for the purpose.