Regex for charset from an HTML page's meta tag

This illustrates using a regular expression to extract the value of charset from the meta tag of an HTML page. It's a subroutine decode_charset ($text) which assumes the HTML is in $text.

The regular expression starts at m! and goes on to !xi. The x here is what makes it possible to put the regular expression on several lines. The i makes the regular expression match either small or capital letters.

sub decode_charset
    my ($text) = @_;
    my $charset = 'UTF-8';
    if ($text =~ m!(<\s*
                    # <meta
                    # http-equiv="Content-Type"
                    # content = "mime/type
                    # charset = something ">
        !xi) {
        $charset = $2;
    return $charset;

Copyright © Ben Bullock 2009-2023. All rights reserved. For comments, questions, and corrections, please email Ben Bullock ( or use the discussion group at Google Groups. / Privacy / Disclaimer