javascriptnode.jsweb-scrapingx-ray

Brake tags removed on x-ray scrape


I am new to JS. I am scraping a url with X-ray. The tags are removed when scraped as expected, but I want the <br> tag to be replaced with something like ;

For example: If I scrape something like 'span#scraped-portion'

<span id="scraped-portion"><span class="bold>NodeJS</span><br>
    <span class="bold>Version:</span> 8<br><span class="bold>Date released:</span> 2017 Jan<br><span class="bold>Description:</span>Some other text
</span>

I will get result similar to the following

NodeJS /n Version: 8Date released: 2017 JanDescription: Some other text

The text around <br> tags get added together and it will get difficult to understand what is what. So I want the <br> tag to be replaced replaced with something like ;.

Is it possible or Should I better use other libraries?


Solution

  • UPDATE

    I found a pure X-Ray based solution without the need of replacing <br> tags in html prior utilizing X-Ray (see original solution below).

    That way you're going to use X-Ray's filter functions in addition with embedding X-Ray functions in each other (sort of nesting).

    Firstly we're going to replace <br> tags in original html by using custom filter function (called replaceLineBreak) defined for X-Ray. Secondly we're going to use the result of replace with rebuilding the original html structure (by re-adding <span id="scraped-portion">) as the first argument of an X-Ray call.

    Hope you'll like it!

        var x = Xray({
        filters: {
            replaceLineBreak: function (value) { return value.replace(/\<br\>/g, ';'); },
        }
    });
    var html =
    `
        <span id="scraped-portion"><span class="bold">NodeJS</span><br>
            <span class="bold">Version:</span> 8<br><span class="bold">Date released:</span> 2017 Jan<br><span class="bold">Description:</span>Some other text
        </span>
    `;
    
    x(html,
        '#scraped-portion@html | replaceLineBreak' /// Filter function called to replace '<br>' to ';'
    )(function (err, obj) {
        x(`<span id="scraped-portion">${obj}</span>`, /// Restore oroginal html structure to have the outer span with id 'scraped-portion
            '#scraped-portion'
        )(function (err2, obj2) { res.header("Content-Type", "text/html; charset=utf-8"); res.write(obj2); res.end(); })
        });
    

    Resulting the following string:

    NodeJS;   Version: 8;Date released: 2017 Jan;Description:Some other text
    

    ORIGINAL SOLUTION

    why not replacing all occurences of <br> tags prior to processing the html code by X-Ray?

    function tst(req, res) {
    var x = Xray();
    var html =
    `
        <span id="scraped-portion"><span class="bold">NodeJS</span><br>
            <span class="bold">Version:</span> 8<br><span class="bold">Date released:</span> 2017 Jan<br><span class="bold">Description:</span>Some other text
        </span>
    `.replace(/\<br\>/g, ';');
    
    x
        (
        html,
        ['span#scraped-portion']
        )(function (err, obj) { res.header("Content-Type", "text/html; charset=utf-8"); res.write(JSON.stringify(obj, null, 4)); res.end(); })
        ;
    }
    

    Then your code would result something like this

    NodeJS;\n Version: 8;Date released: 2017 Jan;Description:Some other text\n

    which pretty much seems to meet your requirements