phpunicodepage-curl

how to scrape hindi text from web using php


Here i am trying to scrape data from the web (in url) that is in hindi but I am getting response like this

\u093f\u0938\

How to decode this unicode? Please suggest me what to do my script in PHP.

This script is working correctly with english text so what is happening with english. I have already scraped data with this script. I know this response is dev nagri unicode but how to decode it.

I am new in php problem thanks in advance

$i= 1;
for($i; $i < 6; $i++)
{
    $html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();
    $nodes = $dom->getElementsByTagName('p');
    $item = array();
    $articles = array();
    foreach ($nodes as $node) {
         $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
         $item['cat_id'] = 1;
         if($item['msg'] !="")
         $articles[] = array_unique($item);
    }
    $articles = json_encode($articles);
    print_r($articles);
}

Solution

  • I think PHPhil's answer is good and I upvoted it. I edited the code as it does not work just to execute the php part - instead it is important to add the right meta tag (see the code below) to show the devnagari properly. Also I wanted to correct the mistake with the missing "=". Unfortunately my edit was rejected so I have to add a new answer with the code corrections.

    <html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
    <?php
    
    $i= 1;
    for($i; $i < 6; $i++)
    {
        $html = file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
        libxml_use_internal_errors(true);
        $dom = new DOMDocument();
        $dom->loadHTML($html);
        libxml_clear_errors();
        $nodes = $dom->getElementsByTagName('p');
        $item = array();
        $articles = array();
        foreach ($nodes as $node) {
             $item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
             $item['cat_id'] = 1;
             if($item['msg'] !="")
             $articles[] = array_unique($item);
        }
        $articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
    //--------------------add-this---------------------^
        print_r($articles);
    }
    ?>
    </body>
    </html>