javascriptjquerynode.jsweb-scraping

Get HTML between two tags


trying to fetch some html sources from a internal forum. Just to be independent we play around with nodejs, express and similar.

When I open up the page directly I get the following html back:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta http-equiv="content-type" content="text/html; charset=us-ascii" />
    <meta name="description" content="myForum" />
    <meta name="viewport" content="width=320; user-scalable=no" />
    <title>myForum</title>
</head>

<body>
        <table>
            <tr>
                <td align="left" valign="top" width="100%">
                    <center>
                        <h1><img class="banner" src=
                        "./img/myForum.jpg" width="730"
                        height="117" border="0" alt="myForum" /></h1>
                    </center>
                    <hr />

                    <center>
                        [ <a href="answer.php?id=975710">Antworten</a> ]&nbsp;&nbsp;[
                        <a href="index.php">Forum</a> ]&nbsp;&nbsp;[ <a href=
                        "newEntries.php">Neue Beitr&auml;ge</a> ]
                    </center>
                    <hr />

                    <h1>sCHween</h1>geschrieben von&nbsp;<font color=
                    "#FFFFFF">User1</font>&nbsp;&nbsp;am&nbsp;18.06.2014&nbsp;um&nbsp;21:26:15
                    <hr />
                    This is my text! It could contain images and links!
                    <img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
                    <a href="http://www.google.com/">Google</a>
                    <br />
                    <hr />
                    <b>Antworten:</b><br />
                    <a href="thread.php?id=9752">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:56:27<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9756">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:14:44<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9753">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:02:21<br />
                    <a href="showentry.php?id=975713">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:46:13<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9720">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User3</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:22:25<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9755">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User4</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:52:51<br />
                    <hr />
                    <span>
                        <a href="answer.php?id=975">Antworten</a><br />
                        <a href="recent.php">Neue Beitr&auml;ge</a><br />
                    </span>
                    <hr />
                </td>
            </tr>
        </table>
</body>
</html>

What we want to get out is the html source of the things between the two hr tags:

This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
<a href="http://www.google.com/">Google</a>

Is there an easy way to get the source between the two hr tags or what would be the cleanest and easiest way to extract this content?


Solution

  • Not Sure if this is what you want:

    Jquery:

    var AllContent = $("td").contents();
    var hrCount = 0;
    var addContent = false;
    var result="";
    AllContent.each(function(){
        if ($(this).prop('tagName') == "HR"){
            hrCount++;
            if (hrCount ==3){
                addContent = true;
            }
            if (hrCount ==4){
                addContent = false;
            }
        }else{
            if(addContent){
                if (typeof $(this).html() != "undefined"){
                    result+=$(this)[0].outerHTML;
                }else{
                    result+=$(this).text();
               }
           }
        }   
    
    });
    
    alert(result);
    

    Must be a cleaner solution...