How do I get the final, redirected, canonical URL of a website using PHP?
Asked Answered
C

3

13

In the days of link shorteners and Ajax, there can be many links that ultimately point to the same content. I was wondering what the best way is to get the final, best link for a web site in PHP, hopefully with a library. I was unable to find anything on Google or GitHub.

I have seen this example code, but it doesn't handle things like a rel="canonical" meta tags or default ssl ports: http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/

Facebook seems to handle this pretty well, you can see how they follow 301's and rel="canonical", etc. To see examples of the way Facebook handles it, use their Open Graph tool:

https://developers.facebook.com/tools/debug

and enter these links:

http://dlvr.it/xxb0W
https://twitter.com/#!/twitter/statuses/136946408275193856

Is there a PHP library out there that already has this pre-built, where it will check for these headers, resolve 301 redirects, parse rel="canonical", detect redirect loops and properly just grab the best resulting URL to use?

As an alternative, I am open to APIs that can be used, but would prefer something that runs on my own server.

Costanza answered 1/12, 2011 at 8:4 Comment(5)
Check this, #4455105Posturize
I don't know if I understand your question, but I think you should check this php.net/manual/es/reserved.variables.server.phpSwafford
Thanks Srisa, that is the general idea, but curl does not follow meta tag redirects, as the accepted answer notes... The solution is going to require some parsing of the HTML for the final redirected link, and then potentially more redirects until a loop is located or we reach the end of redirect and rel="canonical" chain... Was just hoping someone already wrote this so I don't have to. :)Costanza
PHP HTML ParserTransponder
Thanks guys, I know how to parse the HTML or use preg_match() to just quickly pull that tag out. Maybe it's overkill to be looking for a library, but I was really hoping there was someone out there who had taken the time to do this "right"... For instance, even taking into account the hashbang and google's escaped fragment code (and maybe other things I haven't even thought of relating to URL redirection).Costanza
C
12

Since I wasn't able to find any libraries that really did what I was looking for, and I was hoping to do more than just follow HTTP redirects, I have gone ahead and created a library that accomplishes the goals and released it under the MIT license. You can get it here:

https://github.com/mattwright/URLResolver.php

URLResolver.php is a PHP class that attempts to resolve URLs to a final, canonical link:

  • Follows 301 and 302 redirects found in HTTP headers
  • Follows Open Graph URL <meta> tags found in web page <head>
  • Follows Canonical URL <link> tags found in web page <head>
  • Aborts download quickly if content type is not an HTML page

I am certainly not an expert on the rules of HTTP redirection, so if anyone has suggestions on how to improve this library, it would be greatly appreciated. I have tested in on thousands of URLs and it seems to do pretty well. I followed Mario's advice and used PHP Simple HTML Parser library where needed.

Costanza answered 4/12, 2011 at 7:48 Comment(0)
D
2

Using Guzzle (a well known and robust HTTP client) you can do it like that:

<?php
use Guzzle\Http\Client as GuzzleClient;
use Guzzle\Plugin\History\HistoryPlugin;

public function resolveUrl($url)
{
    $client   = new GuzzleClient($url);
    $history  = new HistoryPlugin();
    $client->addSubscriber($history);

    $response = $client->head($url)->send();

    if (!$response->isSuccessful()) {
        throw new \Exception(sprintf("Url %s is not a valid URL or website is down.", $url));
    }

    return $response->getEffectiveUrl();
}
Devalue answered 24/7, 2014 at 13:10 Comment(0)
M
0

I wrote you a little function to do it. It's simple, but it may be a starting point for you. Note: the http://dlvr.it/xxb0W url returns an invalid URL for it's Location response header.

You'll need the Altumo PHP library for it to work. It's a library that I wrote, but it's MIT license, as is this function.

See: https://github.com/homer6/altumo

Also, you'll have to wrap the function in a try/catch.

/**
* Gets the final URL of a URL that will be redirected.
* 
* @param string $url_string
* @throws \Exception                    //on error
* @return string
*/
function get_final_url( $url_string ){

    while( 1 ){

        //validate URL
            $url = new \Altumo\String\Url( $url_string );

        //get the Location response header of the URL
            $client = new \Altumo\Http\OutgoingHttpRequest( $url_string );
            $response = $client->sendAndGetResponseMessage();
            $location = $response->getHeader( 'Location' );

        //return the URL if no Location header was found, else continue
            if( is_null($location) ){
                return $url_string;
            }else{
                $url_string = $location;
            }

    }

}

echo get_final_url( 'your url here' );

Please let me know if you'd like further modifications or help getting it going.

Mcclung answered 3/12, 2011 at 4:49 Comment(2)
Thanks Homer -- I appreciate the effort. Since I am not getting any library suggestions, I decided to start writing my own and I will post it here (and on github) when it is done in the next couple days... I am actually looking for something a little more advanced than following just location header redirects. I want it to parse the page's <head> to get canonical and open graph URLs, follow those, etc. The library is up to around 500 lines of code so far, but it is close to working as I desire. :)Costanza
Sounds good Matt... looking forward to seeing what you have. Cheers.Mcclung

© 2022 - 2024 — McMap. All rights reserved.