Counting number of views for a page ignoring search engines?
Asked Answered
P

6

10

I notice that StackOverflow has a views count for each question and that these view numbers are fairly low and accurate.

I have a similar thing on one of my sites. It basically logs a "hit" whenever the page is loaded in the backend code. Unfortunately it also does this for search engine hits giving bloated and inaccurate numbers.

I guess one way to not count a robot would be to do the view counting with an AJAX call once the page has loaded, but I'm sure there's other, better ways to ignore search engines in your hit counters whilst still letting them in to crawl your site. Do you know any?

Pulchia answered 5/9, 2008 at 13:31 Comment(0)
G
5

An AJAX call will do it, but usually search engines will not load images, javascript or CSS files, so it may be easier to include one of those files in the page, and pass the URL of the page you want to log a request against as a parameter in the file request.

For example, in the page...

http://www.example.com/example.html

You might include in the head section

<link href="empty.css?log=example.html" rel="stylesheet" type="text/css" />

And have your server side log the request, then return an empty css file. The same approach would apply to JavaScript or and image file, though in all cases you'll want to look carefully at what caching might take place.

Another option would be to eliminate the search engines based on their user agent. There's a big list of possible user agents at http://user-agents.org/ to get you started. Of course, you could go the other way, and only count requests from things you know are web browsers (covering IE, Firefox, Safari, Opera and this newfangled Chrome thing would get you 99% of the way there).

Even easier would be to use a log analytics tool like awstats or a service like Google analytics, both of which have already solved this problem.

Guano answered 5/9, 2008 at 13:36 Comment(3)
We've changed our increment method to an ajax post - although users without javascript won't affect a question's view count, we didn't want to have a bot blacklist, either!Miscible
Search engines do access css files: free-seo-news.com/newsletter246.htm ... also when you check some sites in google cache, they're styled, this confirms that they scan and save css files.Piecemeal
I'm pretty sure search engines execute Javascript nowSterner
S
2

To solve this problem I implemented a simple filter that would look at the User-Agent header in the HTTP request and compare it to a list of known robots.

I got the robot list from www.robotstxt.org. It's downloadable in a simple text-format that can easily be parsed to auto-generate the "blacklist".

Scourings answered 5/9, 2008 at 14:30 Comment(0)
U
1

You don't really need to use AJAX, just use JavaScript to add an iFrame off screen. KEEP IT SIMPLE

<script type="javascript">
document.write('<iframe src="myLogScript.php" style="visibility:hidden" width="1" height="1" frameborder="0">');
</script>
Uncommonly answered 5/9, 2008 at 13:39 Comment(0)
V
1

An extension to Matt Sheppard's answer might be something like the following:

  <script type="text/javascript">
  var thePg=window.location.pathname;
  var theSite=window.location.hostname;
  var theImage=new Image;
  theImage.src="/test/hitcounter.php?pg=" + thePg + "?site=" + theSite;
  </script>

which can be plugged into a page header or footer template without needing to substitute the page name server-side. Note that if you include the query string (window.location.search), a robust version of this should encode the string to prevent evildoers from crafting page requests that exploit vulnerabilities based on weird stuff in URLs. The nice thing about this vs. a regular <img> tag or <iframe> is that the user won't see a red x if there is a problem with the hitcounter script. In some cases, it's also important to know the URL that was seen by the browser, before rewrites, etc. that happen server-side, and this give you that. If you want it both ways, then add another parameter server-side that inserts that version of the page name into the query string as well.

An example of the log files from a test of this page:

10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/testpage.html HTTP/1.1" 200 306 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/hitcounter.php?pg=/test/testpage.html?site=www.home.***.com HTTP/1.1" 301 - "http://www.home.***.com/test/testpage.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
Vow answered 14/9, 2008 at 2:38 Comment(0)
I
0

The reason Stack Overflow has accurate view counts is that it only count each view/user once.

Third-party hit counter (and web statistics) application often filter out search engines and display them in a separate window/tab/section.

Indistinguishable answered 5/9, 2008 at 13:33 Comment(0)
M
0

You are either going to have to do what you said in your question with AJAX. Or exclude out User-Agent strings that are known search engines. The only sure way to stop bots are with AJAX.

Marcelmarcela answered 5/9, 2008 at 13:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.