A basic example of parsing HTML in Perl

This program demonstrates the basic use of the HTML::TreeBuilder module to parse HTML and convert the parsed input into a tree structure.

#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;

# This is the file we are going to read.

my $file = 'test.html';

# Parse all of the contents of $file.

my $parser = HTML::TreeBuilder->new ();
$parser->parse_file ($file);

# Now display the contents of $parser.

recurse ($parser, 0);

exit;

# This displays the contents of $node and any children it may
# have. The variable $depth is the indentation used.

sub recurse
{
    my ($node, $depth) = @_;

    # Print indentation according to the level of recursion.

    print "  " x $depth;

    # If $node is a reference, then it is an HTML::Element.

    if (ref $node) {

        # Print the tag associated with $node, for example "html" or
        # "li".

        print $node->tag (), "\n";

        # $node->content_list () returns a list of child nodes of
        # $node, which we store in @children.

        my @children = $node->content_list ();
        for my $child_node (@children) {
            recurse ($child_node, $depth + 1);
        }
    }
    else {

        # If $node is not a reference, then it is just a piece of text
        # from the HTML file.

        print $node, "\n";
    }
}

(download)

On the following HTML:

<html>
<body>
<ol>
<li>Giant panda</li>
<li>Kangaroo</li>
</ol>
</body>
</html>

it produces:

html
  head
  body
    ol
      li
        Giant panda
      li
        Kangaroo

The documentation for HTML::TreeBuilder is scattered over a number of different pages. The tag and content_list methods used in the above example are documented in HTML::Element#tag and HTML::Element#content_list respectively.