Regex for charset from an HTML page's meta tag

This illustrates using a regular expression to extract the value of charset from the meta tag of an HTML page. It's a subroutine decode_charset ($text) which assumes the HTML is in $text.

The regular expression starts at m! and goes on to !xi. The x here is what makes it possible to put the regular expression on several lines. The i makes the regular expression match either small or capital letters.

sub decode_charset
{
    my ($text) = @_;
    my $charset = 'UTF-8';
    if ($text =~ m!(<\s*
                    # <meta
                    meta\s+
                    # http-equiv="Content-Type"
                    http-equiv\s*=\s*["']?content-type["']?\s*
                    # content = "mime/type
                    content=["']?\w+/\w+["']?;\s*
                    # charset = something ">
                    charset=)([^"']+)(["']?\s*/?>)
        !xi) {
        $charset = $2;
    } 
    return $charset;
}

Copyright © Ben Bullock 2009-2024. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com). / Privacy / Disclaimer