A basic example of parsing HTML in Perl
This program demonstrates the basic use of the HTML::TreeBuilder module to parse HTML and convert the parsed input into a tree structure.
#!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; # This is the file we are going to read. my $file = 'test.html'; # Parse all of the contents of $file. my $parser = HTML::TreeBuilder->new (); $parser->parse_file ($file); # Now display the contents of $parser. recurse ($parser, 0); exit; # This displays the contents of $node and any children it may # have. The variable $depth is the indentation used. sub recurse { my ($node, $depth) = @_; # Print indentation according to the level of recursion. print " " x $depth; # If $node is a reference, then it is an HTML::Element. if (ref $node) { # Print the tag associated with $node, for example "html" or # "li". print $node->tag (), "\n"; # $node->content_list () returns a list of child nodes of # $node, which we store in @children. my @children = $node->content_list (); for my $child_node (@children) { recurse ($child_node, $depth + 1); } } else { # If $node is not a reference, then it is just a piece of text # from the HTML file. print $node, "\n"; } }
On the following HTML:
<html> <body> <ol> <li>Giant panda</li> <li>Kangaroo</li> </ol> </body> </html>it produces:
html head body ol li Giant panda li Kangaroo
The documentation for HTML::TreeBuilder is scattered over a
number of different pages. The tag
and content_list
methods used in the above example are
documented in HTML::Element#tag and
HTML::Element#content_list respectively.
Copyright © Ben Bullock 2009-2024. All
rights reserved.
For comments, questions, and corrections, please email
Ben Bullock
(benkasminbullock@gmail.com).
/
Privacy /
Disclaimer