Stop search engines to index specific parts of the page
Asked Answered
B

4

16

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (e.g. /my-book/page-one, /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js.

Since all the book content is rendered from the server side, the problem is that the content is indexed by search engines (especially I'm referring to Google), but the urls are wrong (e.g. it finds a snippet on page-two but the url is page-one).

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

Would it work if I render the content in a different way: for example, <div data-page-number="1" data-content="Lorem ipsum..."></div> and then on the JavaScript side to change that in the needed format? That would make the page slower and in fact I'm not sure if Google will not index the changed content by JavaScript.

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page?

Bohemia answered 6/5, 2016 at 9:46 Comment(6)
You can use robots.txt to tell Google. AFAIK Google respects it. Most probably it would be better to build a sitemap.xml and tell Google what to index and what not. You can also use Google's Webmaster Tools to push the changes and see how Google is crawling your site.Hollis
The question is how? I'm not sure if any of these would work. In short, I serve the same HTML on different urls, but I show only a specific part of it depending on the url.Wolfson
Can you give an Example of wrong url that is wrong indexed ? Or you do the change onClick on the element?Sector
@Sector Let's suppose I have Hello World on page 42 (under the url /my-book/page/42). It's very possible that Google indexes this content on another url (and obviously another page), for example, /my-book/page/7. That happens because I serve the same content on multiple urls. I have no idea how this can be fixed...Wolfson
Do you mean that : /my-book/page/42 and /my-book/page/7 Have the same Content ?Sector
@Sector Exactly. But the visible area to the user is different.Wolfson
H
1

Save the content in a JSON file which you do not render in the HTML. From the server, serve only the correct page: the content which is visible to the user.

When the user clicks the buttons (prev/next page links etc), render using JavaScript the content you have the JSON file and change the url like you're already doing.

That way you know you always serve from the server the right content and the Google bot will obviously index the pages correctly.

Heathenish answered 17/5, 2016 at 4:11 Comment(1)
This doesn't seem likely to work. The rise of SPAs have made search engines put a lot of effort into indexing JS generated content.Persuader
S
6

As I understood he issue is that you have same content for many urls. Like:

www.my-awesome-domain.com/my-book/page/42

www.my-awesome-domain.com//my-book/page/7

And the visible content of the page is adjustable by JavaScript, that User Execute when he clicks some elements on your site.

In This case you need to do 2 things:

  1. Mark your URL's as Canonical pages in any of the ways described in this google document: https://support.google.com/webmasters/answer/139066?hl=en
  2. You need add a feature that each page will load to the same state after full page refresh, for example you can use hash parameter when navigating as desiribed in the article here: or here is the overview of the technique

Today google bot is executing JavaScript as announced in their official blog: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

So if you achieve proper page behavior when hitting Refresh (F5) and Will specify the canonical pages property, pages will be correctly crawled, and when you will follow the link you will get to the linked page.

If you need more guidance how to do it in url.js Please post another question (so it's will be proper documented for others) and I will be glad to help.

Sector answered 8/5, 2016 at 13:9 Comment(14)
Can you give me an example how the code would look like? I'm not sure how canonical urls would help here. How to make the link between the url and the right part of the page that is visible?Wolfson
Canonical Url will eliminate penalty for Duplicate Content on many pages, you need make 1 page per books list. and the other will be canonical to this page. What code you use to hide and show the per book content ? I will suggest how to modify itSector
Let's suppose I have hidden divs and one of them is visible, containing the page content. I'm not sure what you mean by make 1 page per books list.Wolfson
ok, So make them visible on page load Regarding "1 page per books list." Do all pages have the same content? Or you have for example a category that have those many div's and then 1 div is displayed per book ?Sector
I can't make them visible because it's not what I want. I want to display one page, depending on the url and then allow the user to navigate through the pages.Wolfson
So As I tell you make only 1 visible per URL but on load of the page via HashTag urls... And change the hash tag on user click also, Had you read this article that I had linked to you blog.mgm-tp.com/2011/10/… ?Sector
More info here : github.com/browserstate/history.js/wiki/…Sector
I still don't understand what you mean: the user does not click anything on the page. After opening the web page, the book page appears on the screen. The user has the possibility to go to the next/prev pages by clicking two buttons. When they do that the page are flipped and the url is updated using HTML5 states. Summarizing, what should I have to do to fix the problem?Wolfson
What do you mean by "HTML5 states"? Make it's unis Hash Tag navigation , this will solve the issue... as I wrote before.Sector
That's not possible because there are a lot of urls that people access, so the book needs to have these urls. HTML5 history states allows us to change the pathname without reloading the page.Wolfson
What I say didn't conflict with it... Each book will have for example: mysuperbookstore.com/allbooks#bookTitle Where the bookTitle will be your div Identifyer And On Click You will chnage the Url to: mysuperbookstore.com/allbooks#AnotherBookTitle This way the navigation & back button will work and SEO will work :-)Sector
No, it won't, because we need non-hash urls (which are already heavily used by the users).Wolfson
Maybe I just don't understand your point, but I do not want to use hashes. I need pathnames.Wolfson
Let us continue this discussion in chat.Sector
P
5

The answere is really simple: you can't do it. There is no technical possibility to keep the same content under different URLs and ask search engines to index only part of it.

If you are OK with having only one page indexed you can use, as suggested before, canonical URLs. You place the canonical URL that links to the main page on every sub-page.

You may find a "hack" that uses special tags used for Google Search Appliance: googleon and googleoff.

https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/preparing.html

The only issue is this will most likely not work with Google Bot (at least no one will guarantee it will) or any other search engine.

Plagiarism answered 10/5, 2016 at 12:22 Comment(1)
I may fall back to render the content on user interaction (from JS), so there should be a solution anyways. I'm interested in the best solution.Wolfson
M
2

I dont think you will be able to achieve what you are looking for.

I cant see how robots.txt would have any affect. Canonical tags dont work on divs.

Google has spoken about sites like these in the past and made some suggestions for indexing, here are a couple of links that may help :

https://www.seroundtable.com/seo-single-page-12964.html

https://www.seroundtable.com/google-on-crawling-javascript-sites-progressive-web-apps-21737.html

Misdoing answered 16/5, 2016 at 12:34 Comment(0)
H
1

Save the content in a JSON file which you do not render in the HTML. From the server, serve only the correct page: the content which is visible to the user.

When the user clicks the buttons (prev/next page links etc), render using JavaScript the content you have the JSON file and change the url like you're already doing.

That way you know you always serve from the server the right content and the Google bot will obviously index the pages correctly.

Heathenish answered 17/5, 2016 at 4:11 Comment(1)
This doesn't seem likely to work. The rise of SPAs have made search engines put a lot of effort into indexing JS generated content.Persuader

© 2022 - 2024 — McMap. All rights reserved.