phphtmlzend-framework2domdocumentphpquery

keep content of form tag from whole html


I'm using Zend Framework2 and trying to filter content of <form> tag from whole HTML.

I'm scrapping the page from different site and the page loads after some time and huge full page loader is there.

I have tried with DomDocument and with phpQuery but didn't get success.

This is with DomDocument

$htmlForm = new \DOMDocument();
$htmlForm->loadHTML($formData);
$onlyForm = $htmlForm->getElementById('#Frmswift');
echo $htmlForm->saveHTML($onlyForm);

This is with phpQuery

$doc = phpQuery::newDocument($formData);
$doc->find('#Frmswift')->parent()->siblings()->remove();
echo pq($doc)->html();

Where am I wrong?


Solution

  • EDIT

    Okay check this YouTube video. There is well explained how to use chrome's developer tools specifically Network tab(this is quite analogically for Firefox). So go on the website that is holding the <form> from your question -> right click and Inspect Element, then:

    1. When you are on the Network tab you can filter the list to see only XHR request

    2. Go through the list of requests and check the result of each request in Response sub-tab(which on the video is in the bottom-right side of the screen). You should find from which request is coming the HTML of this form.

    3. Then if you succeed to find this - you know where the form is coming from, select this request in the developer tools console(we are on Network tab now) and again in bottom-right go to Headers sub-tab.

    4. Copy the Request URL - this is from where the form HTML will come

    5. Check Request Method

      5.1. If it is GET then use PHP's $htmlForm = file_get_contents(URL from point 4); and proceed with ORIGINAL POST as you replace $sampleHtml with $htmlForm.

      5.2. If it is POST refer to this link or google search or this stackoverflow answer and again with the result proceed with ORIGINAL POST

    ORIGINAL POST

    Hello_ mate.

    I see a mistake in your code snippet - you don't need # when using getElementById

    Check the following code snippet and let me know if it helps you (refer to comments for details):

    $sampleHtml = ' 
        <!DOCTYPE html>
        <html>
        <head>
            <title>External Page Content</title>
        </head>
        <body>
            <h1>Some header</h1>
            <p>Some lorem text ....</p>
            <form id="Frmswift">
                <input name="input1" type="text">
                <input name="input2" type="text">
                <textarea name="mytextarea"></textarea>
            </form>
        </body>
        </html>';
    
    $dom = new \DOMDocument();
    $dom->loadHTML($sampleHtml);
    
    // Where you use getElementById do not put # in front of the selector 
    // This method is working analogically to javascript's getElementById()
    $form = $dom->getElementById('Frmswift');
    
    // Use second blank document which with hold
    // the previously selected form
    $blankDoc = new \DOMDocument();
    $blankDoc->appendChild($blankDoc->importNode($form, true));
    
    // using htmlspecialchars just to show the code, 
    // otherwise you will see imputs in the browser - this is just 
    // for the testing purpose. I suppose you will need the $blankDoc
    // which is holding only the form
    echo htmlspecialchars($blankDoc->saveHTML());
    exit;
    

    Output:

    <form id="Frmswift"> 
        <input name="input1" type="text">
        <input name="input2" type="text">
        <textarea name="mytextarea"></textarea>
    </form>