I'm using the following Perl code to traverse and format some HTML:
#!/usr/bin/env perl
use v5.38;
use HTML::TreeBuilder;
my $indent = 3;
my $content = do {local $/; <DATA>};
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);
visit($tree);
sub visit($x) {
my $depth = $x->depth;
my $in = ' ' x ($indent * $depth);
foreach my $e ($x->content_list) {
# element
if (ref ($e)) {
say $in . $e->starttag;
visit($e);
say $in . $e->endtag;
}
# text
else {
say $in . $e;
}
}
}
__DATA__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<font size=3><strong>
5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
</strong></font>
<br>
<img src="poster.png" alt="poster/ad" title="poster/ad">
<i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
<br><br>
<font size=3><strong>
5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
</strong></font>
<br>
Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
<br>
<i>(*included on
<a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
</a>;
<a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
</a>;
<a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
</a>
)
</i>
<br><br>
My problem is that each <br>
is output as:
<br />
</br>
Both <br />
and </br>
cause new lines to be rendered. I was surprised that endtag
generated anything at all in the case of tag br
(and img
).
I avoided using HTML::Tree::traverse because the doc discourages its use:
[I]f you want to recursively visit every node in the tree, it's almost always simpler to write a subroutine does just that, than it is to bundle up the pre- and/or post-order code in callbacks for the traverse method.
There are no examples given, so the above is what I cooked up.
Am I using starttag
and endtag
correctly? Should I detect when I'm displaying a tag that doesn't take an end tag and avoid calling endtag? What's the right/best/simplest way to traverse an HTML tree and prettify it?
Update:
As suggested by Stephen Ullrich, I tried to use as_HTML() for formatting:
#!/usr/bin/env perl
use v5.38;
use HTML::TreeBuilder;
say "\%HTML::Element::optionalEndTag= ",
join ', ', keys %HTML::Element::optionalEndTag;
my $content = do {local $/; <DATA>};
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);
# don't encode any entities; indent with three spaces;
say $tree->as_HTML('', ' ');
__DATA__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<font size=3><strong>
5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
</strong></font>
<br>
<img src="poster.png" alt="poster/ad" title="poster/ad">
<i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
<br><br>
<font size=3><strong>
5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
</strong></font>
<br>
Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
<br>
<i>(*included on
<a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
</a>;
<a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
</a>;
<a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
</a>
)
</i>
<br><br>
Output:
%HTML::Element::optionalEndTag= dt, dd, li, p
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
</head>
<body><font size="3"><strong> 5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA </strong></font><br /><img alt="poster/ad" src="poster.png" title="poster/ad" /> <i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i><br />
<br /><font size="3"><strong> 5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA </strong></font><br /> Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway <br /><i>(*included on <a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead </a>; <a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry </a>; <a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape </a> ) </i><br />
<br />
</body>
</html>
Unfortunately, this isn't "pretty" enough. I don't understand why the indenting leaves off after the first couple of levels. However, I do note that it doesn't generate </br>
or </img>
, despite the fact that neither of these tags is mentioned in %HTML::Element::optionalEndTag
!
Update 2
(Although they are listed in %HTML::Tagset::emptyElement
, which as_HTML checks.)
<br>
and <img>
(among others) are empty elements; they aren't intended to surround anything, thus there is no point to having a separate endtag. Nevertheless, HTML::Element::endtag
always generates the string </
tag>
, whether or not tag is an empty element.
(Note that starttag is smart enough to write <
tag attr=... />
for empty tags like <img ... />
and <br />
.)
Therefore the programmer must explicitly test whether or not an endtag is appropriate. Fortunately there's a variable, %HTML::Tagset::emptyElement
, that maps each empty element to 1 (true).
The following code will print the HTML supplied in the OP in a simple, indented format with each tag on a separate line.
#!/usr/bin/env perl
use v5.38;
use HTML::TreeBuilder;
my $indent = 3;
my $content = do {local $/; <DATA>};
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);
visit($tree);
sub visit($x) {
use HTML::Tagset;
my $depth = $x->depth;
my $in = ' ' x ($indent * $depth);
for my $e ($x->content_list) {
if (ref ($e)) { # element
say $in . $e->starttag;
if (! $HTML::Tagset::emptyElement{$e->tag}) {
visit($e);
say $e->endtag;
}
}
else { # text
# for extra prettiness
use Text::Wrap;
$Text::Wrap::columns = 132;
say wrap($in, $in, $e);
}
}
}
__DATA__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<font size=3><strong>
5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto, CA
</strong></font>
<br>
<img src="poster.png" alt="poster/ad" title="poster/ad">
<i>(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)</i>
<br><br>
<font size=3><strong>
5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo Park, CA
</strong></font>
<br>
Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The Railway
<br>
<i>(*included on
<a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">Before The Dead
</a>;
<a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">birthday doodle for Barbara by Jerry
</a>;
<a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">the master tape
</a>
)
</i>
<br><br>
Output:
<head>
<meta charset="utf-8" />
</head>
<body>
<font size="3">
<strong>
5/5/61 Bob & Jerry - Arroyo Lounge, Stanford University, Palo Alto,
CA
</strong>
</font>
<br />
<img alt="poster/ad" src="poster.png" title="poster/ad" />
<i>
(Robert Hunter and Jerry Garcia; source: McNally, Jackson research)
</i>
<br />
<br />
<font size="3">
<strong>
5/26/61 Bob & Jerry - Barbara Meier's 16th birthday party, Menlo
Park, CA
</strong>
</font>
<br />
Follow The Drinking Gourd, John Henry, Santy Anno*, Poor Paddy Works On The
Railway
<br />
<i>
(*included on
<a href="https://www.garciafamilyprovisions.com/product/JY148COMBO/before-the-dead-4cd-set?cp=640_62123_100764" target="_blank">
Before The Dead
</a>
;
<a href="https://gdsets.com/63posters/1961_05_26.jpg" target="_blank">
birthday doodle for Barbara by Jerry
</a>
;
<a href="https://gdsets.com/63posters/1961_05_26a.jpg" target="_blank">
the master tape
</a>
)
</i>
<br />
<br />
</body>