I have a function that retrieves a paragraph from a website. I've already removed all html tags although the <a>
tags still existed.
function getExplanation ()
{
link="https://apod.nasa.gov/apod/ap"$(echo $1 | cut -c 3-)".html"
content=$(curl -s $link | sed -n -e '/<b> Explanation: <\/b>/,/<p> <center>/p' | sed -e 's/<[^>]*>//g')
echo $content
}
getExplanation "20190102"
Explanation: The Great Nebula in Orion is an intriguing place. Visible to the unaided eye, it appears as a small fuzzy patch in the <a href="http://www.astro.wisc.edu/~dolan/constellations/ constellations/Orion.html">
constellation of Orion. But this image, an illusory-color four-panel mosaic taken in different bands of <a href="http://coolcosmos.ipac.caltech.edu/cosmic_classroom/ir_tutorial/" >
infrared light with the Earth orbiting <a href="http://www.nasa.gov/mission_pages/WISE/mission/index.html" >
WISE observatory, shows the Orion Nebula to be a bustling <a href="http://www.jpl.nasa.gov/news/news.php?release=2013-046" >
neighborhood of recently formed stars, hot gas, and dark dust. The power behind much of the Orion Nebula (M42) is the stars of the Trapezium star cluster, seen near the center of the featured image. The orange glow surrounding the bright stars pictured here is their own starlight reflected by intricate dust filaments that cover much of the region. The current Orion Nebula cloud complex, which includes the Horsehead Nebula, will slowly disperse over the next 100,000 years.
I've already tried this sed 's/<a href=.*>//g'
although the result is still the same.
Sometimes you cannot pipe the output directly into another command; in particular, I've ran into issues with curl
in the past where passes partial content. I believe the solution here would be to separate your sed
commands instead of chaining it into a single line:
html_content=$(curl -s $link | sed -n -e '/<b> Explanation: <\/b>/,/<p> <center>/p')
content=$(echo $html_content | sed -e 's/<[^>]*>//g')
Try this and see if it helps.