Get title of website via link
Asked Answered
L

10

43

Notice how Google News has sources on the bottom of each article excerpt.

The Guardian - ABC News - Reuters - Bloomberg

I'm trying to imitate that.

For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/ I want to return The Washington Times

How is this possible with php?

Lenoralenore answered 3/12, 2010 at 19:0 Comment(1)
Google news probably manages a look up table for known domains, and perhaps analyzes the HTML for unknown ones. A lookup table should be trivial to implement, so I've submitted an answer that does the latter.All
T
66

My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.

<?php

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}
//Example:
echo get_title("http://www.washingtontimes.com/");

?>

OUTPUT

Washington Times - Politics, Breaking News, US and World News

As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.

http://www.washingtontimes.com/ => The Washington Times

Trait answered 3/12, 2010 at 19:20 Comment(11)
Thanks, the code works but how would you get the same main title if say the link was washingtontimes.com/news/2010/dec/3/… ? I think that's what AI W suggestedLenoralenore
You would use parse_url to get the hostname and use getTitle($host); instead.Vaso
any other way than parsing html with regex ?Taking
The pattern specified here need to be improved. As this code won't work if any attributes sets for title tag. E.g, facebook.comCulbreth
The regex matching ought to be: preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); Some sites have the <title> in all caps, so the check should ignore case.Heidiheidie
@Jose, how would you account for http 500 and other header errors. The function breaks if a page returns an error? Can you show how those conditions would be added to the if statement maybe with an if else else etc?Pasteur
Remember: file_get_contents() can work locally, so it could be a security risk, e.g. file_get_contents('./passwords.txt'). This function may only return the contents of <title>, but it could be used maliciously.Vehicle
Some websites don't allow file_get_contents() and produced an Access Denied error. I found a work around it by setting this - ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11'); More info hereConnaught
Make sure to make the regex non-greedy since some websites use more than one <title> tag: preg_match("/\<title\>(.*?)\<\/title>Subdued
When I try to use that function with this page[1] it loads the whole content and not only the title.. zeit.de articleJubbah
This saved a startup. Cheers!Stringent
A
36
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');
$xpath = new DOMXPath($doc);
echo $xpath->query('//title')->item(0)->nodeValue."\n";

Output:

Debt commission falls short on test vote - Washington Times

Obviously you should also implement basic error handling.

All answered 3/12, 2010 at 19:15 Comment(3)
@All When I changed the URL to facebook.com it is showing "Update Your Browser | Facebook". Is there any solution for this?Formenti
@Enve, without looking at it, I would assume it's because they are using a lot of Javascript to generate the page. The "Update Your Browser" is probably the default title. So you're probably out of luck in terms of any simple solution.All
Thanks! The accepted answer didn't work for me. It just returned localhost. This answer worked for me :)Owens
S
6

Using get_meta_tags() from the domain home page, for NYT brings back something which might need truncating but could be useful.

$b = "http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/" ;

$url = parse_url( $b ) ;

$tags = get_meta_tags( $url['scheme'].'://'.$url['host'] );
var_dump( $tags );

includes the description 'The Washington Times delivers breaking news and commentary on the issues that affect the future of our nation.'

Synonymous answered 3/12, 2010 at 20:30 Comment(0)
M
5

You could fetch the contents of the URL and do a regular expression search for the content of the title element.

<?php
$urlContents = file_get_contents("http://example.com/");
preg_match("/<title>(.*)<\/title>/i", $urlContents, $matches);

print($matches[1] . "\n"); // "Example Web Page"
?>

Or, if you don't want to use a regular expression (to match something very near the top of the document), you could use a DOMDocument object:

<?php
$urlContents = file_get_contents("http://example.com/");

$dom = new DOMDocument();
@$dom->loadHTML($urlContents);

$title = $dom->getElementsByTagName('title');

print($title->item(0)->nodeValue . "\n"); // "Example Web Page"
?>

I leave it up to you to decide which method you like best.

Madisonmadlen answered 3/12, 2010 at 19:3 Comment(2)
Aaargh! Regexp... for... getting... data... from... HTMLLaodicea
@thejh: You don't know in general what kind of HTML pages are out there. I guess DOMDocument may have larger memory footprint than the regexp. (You may exceed PHP memory limit.) This is the case where it is maybe justifiable to use a regex or a simple strpos function.Geelong
C
4

PHP manual on cURL

<?php

$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");

curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);

curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

PHP manual on Perl regex matching

<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>

And putting those two together:

<?php 
// create curl resource 
$ch = curl_init(); 

// set url 
curl_setopt($ch, CURLOPT_URL, "example.com"); 

//return the transfer as a string 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

// $output contains the output string 
$output = curl_exec($ch); 

$pattern = '/[<]title[>]([^<]*)[<][\/]titl/i';

preg_match($pattern, $output, $matches);

print_r($matches);

// close curl resource to free up system resources 
curl_close($ch);      
?>

I can't promise this example will work since I don't have PHP here, but it should help you get started.

Catchings answered 3/12, 2010 at 19:3 Comment(5)
A) Curl is overkill. B) Using regular expressions to parse HTML/XML is generally less reliable than using XPath queries or the DOM.All
For traversing a document definitely. However a title tag is simple to extract. Another concern is that XPath is for XML. Assuming that a webpage is well formed XML is a leap of faith, imho. I've only used DOMXPath once and I'm not sure how well it deals with a typical trainwreck of a webpage.Catchings
DOMDocument::loadHTML will do an adequate job of converting HTML into XML, especially for finding a single tag. Using regexp to find something as simple as a title tag isn't even as trivial as you may think. For instance, yours will fail with <title > due to the space. (If the XPath fails, you could always fall back to a regexp.)All
Yes, this is true. '/[<][ ]*title[ ]*[>]([^<]*)/i' Anything that will break that will most likely break any DOM parser that wasn't designed for use in a web browser.Catchings
Hmm.. while CURL works perfectly I agree that I can use something more simplified for retrieving a title. However I also want to avoid webpage errors. I'm in a dilemma..Lenoralenore
W
4

I try to avoid regular expressions when it isn't necessary, I have made a function to get the website title with curl and DOMDocument below.

function website_title($url) {
   $ch = curl_init();
   curl_setopt($ch, CURLOPT_URL, $url);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   // some websites like Facebook need a user agent to be set.
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
   $html = curl_exec($ch);
   curl_close($ch);

   $dom  = new DOMDocument;
   @$dom->loadHTML($html);

   $title = $dom->getElementsByTagName('title')->item('0')->nodeValue;
   return $title;
}

echo website_title('https://www.facebook.com/');

above returns the following: Welcome to Facebook - Log In, Sign Up or Learn More

Wivern answered 18/9, 2014 at 23:0 Comment(0)
R
1

Alternatively you can use Simple Html Dom Parser:

<?php
require_once('simple_html_dom.php');

$html = file_get_html('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');

echo $html->find('title', 0)->innertext . "<br>\n";

echo $html->find('div[class=entry-content]', 0)->innertext;
Rennarennane answered 3/12, 2010 at 19:25 Comment(4)
Hmm I never tried HTML dom Parser. It sure looks simpler. Tho I'm not sure if it takes longer to process compared to other methodsLenoralenore
@Lenoralenore It's much slower than DOMDocument (see here), but it runs without any PHP warning on this page (but I recommend konforce's solution with some error handling).Papeete
@IstvánUjj-Mészáros you can disable PHP warnings using LIBXML_NOWARNING | LIBXML_NOERROR options.Tauromachy
Example: @$doc->loadHTMLFile($link, LIBXML_NOWARNING | LIBXML_NOERROR);Tauromachy
U
1

i wrote a function to handle it:

 function getURLTitle($url){

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $content = curl_exec($ch);

    $contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
    $charset = '';

    if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){
        $charset = $matches[1];
    }

    curl_close($ch);

    if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){
        $title = $matches[1];

        if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){
            //order:
            //http header content-type
            //meta http-equiv content-type
            //meta charset
            foreach($matches as $match){
                $match = strtolower($match);
                if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){
                    $charset = $ms[1];
                    break;
                }
            }

            if(!$charset){
                //meta charset=utf-8
                //meta charset='utf-8'
                foreach($matches as $match){
                    $match = strtolower($match);
                    if(preg_match('/\bcharset=([\'"])?(.+)\1?/', $match, $ms)){
                        $charset = $ms[1];
                        break;
                    }
                }
            }
        }

        return $charset ? iconv($charset, 'utf-8', $title) : $title;
    }

    return $url;
}

it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

(see http://www.w3.org/TR/html4/charset.html)

and then uses iconv to convert title to utf-8 encoding.

Unmindful answered 3/12, 2012 at 3:54 Comment(0)
F
1

Get title of website via link and convert title to utf-8 character encoding:

https://gist.github.com/kisexu/b64bc6ab787f302ae838

function getTitle($url)
{
    // get html via url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $html = curl_exec($ch);
    curl_close($ch);

    // get title
    preg_match('/(?<=<title>).+(?=<\/title>)/iU', $html, $match);
    $title = empty($match[0]) ? 'Untitled' : $match[0];
    $title = trim($title);

    // convert title to utf-8 character encoding
    if ($title != 'Untitled') {
        preg_match('/(?<=charset\=).+(?=\")/iU', $html, $match);
        if (!empty($match[0])) {
            $charset = str_replace('"', '', $match[0]);
            $charset = str_replace("'", '', $charset);
            $charset = strtolower( trim($charset) );
            if ($charset != 'utf-8') {
                $title = iconv($charset, 'utf-8', $title);
            }
        }
    }

    return $title;
}
Fuhrman answered 13/7, 2013 at 7:22 Comment(0)
B
0

Simple but it takes some time:

$tags = get_meta_tags('https://google.com');
if (array_key_exists('title', $tags)) {
    # Do something with it
    echo nl2br("Page Title: $tags[title]\n");
}

I haven't tried the proposed answers by others here to compare for performance, but you should do.

Bubalo answered 21/3, 2022 at 3:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.