linuxbashhttphttp-redirectx-robots-tag

Bash shell script to find Robots meta tag value


I've found this bash script to check status of URLs from text file and print the destination URL when having redirections :

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    echo "$url $urlstatus $dt" >> urlstatus.txt

done < $1

I'm not that good in bash : I'd like to add - for each url - the value of its Robots meta tag (if is exists)


Solution

  • Actually I'd really suggest a DOM parser (e.g. Nokogiri, hxselect, etc.), but you can do this for instance (Handles lines starting with <meta and "extracts" the value of the robots' attribute content):

    curl -s "$url" | sed -n '/\<meta/s/\<meta[[:space:]][[:space:]]*name="*robots"*[[:space:]][[:space:]]*content="*\([^"]*\)"*\>/\1/p'
    

    This will print the value of the attribute or the empty string if not available.

    Do you need a pure Bash solution? Or do you have sed?