htmlxmlxml-parsinghtml-parsingxmlstarlet

How to extraxt HTML elements from inside the "content:encoded" part of an RSS feed?


I am trying to generate a newsletter which, among other stuff, includes news entries which are present on the website as well. The website is built with WordPress and has an RSS feed, which is not actively used but now comes handy to parse the news entries.

I am writing a simple generator script in Bash using xmlstarlet. In particular I am able to get the title, the description and the URL for the news entries (I iterate over them using $itemnum as index):

TITLE=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "title" feed.xml);
DESC=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "description" feed.xml);
URL=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "link" feed.xml);

But now I also want to get the URL for the thumbnail and the date of the news entry. Those are basically two different questions so I only ask about the thumbnail URL (regarding the date: it is easy to get from <pubDate>...</pubDate> but it is not localized). The URL is sitting in the <content:encoded>...</content:encoded> tag, which includes a lot of different HTML tags.

I know that xmlstarlet has a HTML option, but don't know how to use it when the HTML is embedded inside an XML element. If I try to parse the output of

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml | xmlstarlet sel -t -c "//img[@class='size-medium wp-image-2821 alignright'][1]"

it gives errors:

-:1.1: Start tag expected, '<' not found
&lt;div&gt;
^

The reason might be that when getting

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml

it translates all tag brackets < and > into &lt; and &gt; and I don't know how to work around it.

edit:

Here is how a news entry looks like:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    
    xmlns:georss="http://www.georss.org/georss"
    xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
    >

<channel>
    <title>This is the title</title>
    <atom:link href="https://link.to/feed" rel="self" type="application/rss+xml" />
    <link>https://website.url</link>
    <description>This is the description</description>
    <lastBuildDate>Wed, 20 Dec 2023 04:49:30 +0000</lastBuildDate>
    <language>de-DE</language>
    <sy:updatePeriod>
    hourly  </sy:updatePeriod>
    <sy:updateFrequency>
    1   </sy:updateFrequency>
    <generator>https://wordpress.org/?v=6.3.2</generator>
<site xmlns="com-wordpress:feed-additions:1">124249965</site>   

<item>
        <title>A title</title>
        <link>https://link.to/the-news-entry</link>
        
        <dc:creator><![CDATA[HP-Admin]]></dc:creator>
        <pubDate>Wed, 20 Dec 2023 04:49:30 +0000</pubDate>
                <category><![CDATA[Uncategorized]]></category>
        <guid isPermaLink="false">https://perma.link/p123</guid>

                    <description><![CDATA[a short description]]></description>
                                        <content:encoded><![CDATA[<p>A paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p><img decoding="async" fetchpriority="high" class="size-medium wp-image-2821 alignright" src="https://link.to/first-image.jpg" alt="" width="200" height="300" srcset="https://link.to/first-image.jpg 200w, https://link.to/first-image.jpg 683w, https://link.to/first-image.jpg 768w, https://link.to/first-image.jpg 1024w, https://link.to/first-image.jpg 1365w, https://link.to/first-image.jpg 367w, https://link.to/first-image.jpg 16w, https://link.to/first-image.jpg 24w, https://link.to/first-image.jpg 32w, https://link.to/first-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p><img decoding="async" class="size-medium wp-image-2820 alignright" src="https://link.to/second-image.jpg" alt="" width="200" height="300" srcset="https://link.to/second-image.jpg 200w, https://link.to/second-image.jpg 683w, https://link.to/second-image.jpg 768w, https://link.to/second-image.jpg 1024w, https://link.to/second-image.jpg 1365w, https://link.to/second-image.jpg 367w, https://link.to/second-image.jpg 16w, https://link.to/second-image.jpg 24w, https://link.to/second-image.jpg 32w, https://link.to/second-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
                    
        
        
        <post-id xmlns="com-wordpress:feed-additions:1">2818</post-id>  </item>
    </channel>
</rss>

Now I noticed that both images are not actually the header image I need... The URL to the header image does not appear in the feed xml at all... I'm really puzzled why this happens.


Solution

  • To extract the simple variables, for example:

    # shellcheck shell=sh disable=SC2016
    
    xmlstarlet select -T -t \
      --var idx -o "${itemnum:-1}" -b \
      --var q1 -o "'" -b \
      -m '/rss/channel/item[$idx]' \
        -v 'concat("title=",$q1,str:replace(title,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
        -v 'concat("desc=",$q1,str:replace(description,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
        -v 'concat("url=",$q1,link,$q1)' -n \
        -v 'concat("pubdate=",$q1,pubDate,$q1)' -n \
    feed.xml
    

    where

    Output:

    title='A '\''modified'\'' title'
    desc='a short description'
    url='https://link.to/the-news-entry'
    pubdate='Wed, 20 Dec 2023 04:49:30 +0000'
    

    With GNU date localize with e.g. date -Isec -d "${pubdate}".


    To extract the image URLs from the embedded HTML, for example:

    # shellcheck shell=sh disable=SC2016
    
    xmlstarlet select -T -t \
      --var idx -o "${itemnum:-1}" -b \
      -v '/rss/channel/item[$idx]/content:encoded' \
    feed.xml |
    xmlstarlet format -R -H -D |
    # tee /dev/stderr |
    xmlstarlet select -T -t \
      --var cls -o "${class:-wp-image-2821}" -b \
      --var q1 -o "'" -b \
       -m 'str:split(//img[contains(@class,$cls)]/@srcset,",")' \
         --var url_sz='str:split(.," ")' \
         -v 'concat("url_",$url_sz[2],"=",$q1,$url_sz[1],"?width=",substring-before($url_sz[2],"w"),$q1)' -n
    

    Output:

    url_200w='https://link.to/first-image.jpg?width=200'
    url_683w='https://link.to/first-image.jpg?width=683'
    url_768w='https://link.to/first-image.jpg?width=768'
    url_1024w='https://link.to/first-image.jpg?width=1024'
    url_1365w='https://link.to/first-image.jpg?width=1365'
    url_367w='https://link.to/first-image.jpg?width=367'
    url_16w='https://link.to/first-image.jpg?width=16'
    url_24w='https://link.to/first-image.jpg?width=24'
    url_32w='https://link.to/first-image.jpg?width=32'
    url_1707w='https://link.to/first-image.jpg?width=1707'