Regex for charset from an HTML page's meta tag
This illustrates using a regular expression to extract the value
of charset
from the meta tag of an HTML page. It's a
subroutine decode_charset ($text)
which assumes the HTML
is in $text
.
The regular expression starts at m!
and goes on
to !xi
. The x
here is what makes it possible
to put the regular expression on several lines. The i
makes the regular expression match either small or capital letters.
sub decode_charset { my ($text) = @_; my $charset = 'UTF-8'; if ($text =~ m!(<\s* # <meta meta\s+ # http-equiv="Content-Type" http-equiv\s*=\s*["']?content-type["']?\s* # content = "mime/type content=["']?\w+/\w+["']?;\s* # charset = something "> charset=)([^"']+)(["']?\s*/?>) !xi) { $charset = $2; } return $charset; }
Copyright © Ben Bullock 2009-2024. All
rights reserved.
For comments, questions, and corrections, please email
Ben Bullock
(benkasminbullock@gmail.com).
/
Privacy /
Disclaimer