phphtmlweb-crawlersimpledom

How to find all element with PHP Simple HTML DOM Parser?


// Find all element has attribute id
$ret = $html->find('*[id]');

This is an example for finding all elements which have attribute id. Is there any way to find all elements. I try this way but it does not work:

// Find all element
$ret = $html->find('*'); 

additional:

I want to fetch through all the elements in $html, all parents and childs elements will be fetched. Example:

<div>
    <span>
        <div>World!</div>
        <div>
            <span>Hello!</span>
            <span>
                <div>Hello World!</div>
            </span>
        </div>
    </span>
</div>

Now I want to escape all <span> with their plaintext inside and keep all <div> we have! Expected result:

<div>
    <div>World!</div>
    <div>
        <div>Hello World!</div>
    </div>
</div>

Solution

  • /**
     * Refine the input HTML (string) and keep what was specified
     *
     * @param $string : Input HTML
     * @param array $allowed : What will be kept?
     * @return bool|simple_html_dom
     */
    function crl_parse_html($string, $allowed = array())
    {
        // String --> DOM Elements
        $string = str_get_html($string);
        // Fetch child of the current element (one by one)
        foreach ($string->find('*') as $child) {
            if (
                // Current inner-text contain one or more elements
                preg_match('/<[^<]+?>/is', $child->innertext) and
                // Current element tag is in maintained elements array
                in_array($child->tag, $allowed)
            ) {
                // Assign current inner-text to current filtered inner-text
                $child->innertext = crl_parse_html($child->innertext, $allowed);
            } else if (
                // Current inner-text contain one or more elements
                preg_match('/<[^<]+?>/is', $child->innertext) and
                // Current element tag is NOT in maintained elements array
                !in_array($child->tag, $allowed)
            ) {
                // Assign current inner-text to the set of inner-elements (if exists)
                $child->innertext = preg_replace('/(?<=^|>)[^><]+?(?=<|$)(<[^\/]+?>.+)/is', '$1', $child->innertext);
                // Assign current outer-text to current filtered inner-text
                $child->outertext = crl_parse_html($child->innertext, $allowed);
            } else if (
                (
                    // Current inner-text is only plaintext
                    preg_match('/(?<=^|>)[^><]+?(?=<|$)/is', $child->innertext) and
                    // Current element tag is NOT in maintained elements array
                    !in_array($child->tag, $allowed)
                ) or
                // Current plain-text is empty
                trim($child->plaintext) == ''
            ) {
                // Assign current outer-text to empty string
                $child->outertext = '';
            }
        }
        return $string;
    }
    

    This is my solution, I made it, I just post here if someone need it and end this question.
    Note that: this function uses recursive. So, too large data will be a big problem. Reconsider carefully when decide to use this function.