i wrote a function to handle it:
function getURLTitle($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
$charset = '';
if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){
$charset = $matches[1];
}
curl_close($ch);
if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){
$title = $matches[1];
if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){
//order:
//http header content-type
//meta http-equiv content-type
//meta charset
foreach($matches as $match){
$match = strtolower($match);
if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){
$charset = $ms[1];
break;
}
}
if(!$charset){
//meta charset=utf-8
//meta charset='utf-8'
foreach($matches as $match){
$match = strtolower($match);
if(preg_match('/\bcharset=([\'"])?(.+)\1?/', $match, $ms)){
$charset = $ms[1];
break;
}
}
}
}
return $charset ? iconv($charset, 'utf-8', $title) : $title;
}
return $url;
}
it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):
- An HTTP "charset" parameter in a "Content-Type" field.
- A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
- The charset attribute set on an element that designates an external resource.
(see http://www.w3.org/TR/html4/charset.html)
and then uses iconv
to convert title to utf-8
encoding.