Is there a way to make search bots ignore certain text? [closed]
Asked Answered
C

9

41

I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.

The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.

Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.

I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)

This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.

Crofoot answered 8/7, 2010 at 19:30 Comment(4)
similar question: webmasters.stackexchange.com/questions/16390/…Beneficent
I voted to close this question because it is not a programming question and it is off-topic on Stack Overflow. These days, non-programming questions about a website can be asked on Webmasters. In this case the question has already been asked and answered there: Preventing robots from crawling specific part of a pageNewsreel
The top two answers here about using googleoff and data-nosnippet are DANGEROUSLY WRONG. Neither of these two methods cause search bots to ignore the text.Newsreel
In the spirit of full disclosure: many comments arguing about the moderation of this question have been deleted. While we understand that it feels a bit weird to get a notification about a 12-year-old question being closed, that does not change the fact that this question is off-topic for Stack Overflow (certainly by today's standards), and we continue to enforce our standards, even on old questions. Having a question closed is not a punishment. Also, the way Stephen's comment was originally phrased was sub-par, since, as noted, Webmasters didn't really exist when this question was posted.Foray
U
18

There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:

<p>This is normal (X)HTML content that will be indexed by Google.</p>

<!--googleoff: index-->

<p>This (X)HTML content will NOT be indexed by Google.</p>

<!--googleon: index-->

In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:

  • index — content surrounded by “googleoff: index” will not be indexed by Google

    anchor — anchor text for any links within a “googleoff: anchor” area will not be associated with the target page

    snippet — content surrounded by “googleoff: snippet” will not be used to create snippets for search results

    all — content surrounded by “googleoff: all” are treated with all

source

Unciform answered 23/9, 2014 at 15:1 Comment(3)
This is for GSA not GoogleBot. from wikibedia: en.wikipedia.org/wiki/Noindex Google's main indexing spider, Googlebot, is not known to recognize any of these techniques.Marj
googleon, googleofftags are only supported by the Google Search Appliance (when you host your own search results). So this won't avoid Google bot from crawling that text.Demit
linkrot fix for the first link in these comments from @AlexanderMP web.archive.org/web/20121024043825/http://google.utk.edu/…Daff
B
14

Google ignores HTML tags which have data-nosnippet:

<p>
   This text can be included in a snippet
   <span data-nosnippet>and this part would not be shown</span>.
</p>

Source: Special tags that Google understands - Inline directives

Betsey answered 5/8, 2020 at 23:40 Comment(2)
Are you sure it excludes it from indexing or just excludes it from being shown as snippets? The source says: "you can exclude parts of an HTML page from snippets"Beneficent
data-nosnippet won't prevent indexing, only prevent it the text from showing up in the search results as part of the snippet.Newsreel
G
10

I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):

  • Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
  • Replace those links with images (you say you don't want to do that, but don't explain why not)
  • Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.

That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?

If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.

Gerrilee answered 9/7, 2010 at 5:29 Comment(1)
Thank you. It's just that I could make my blog appear in top results if I write a strange combination of category names, 2 post topics, and by adding the "rss" and "feed" keywords. Without "rss" and "feed" it's way to the end. I'll read the rules again and pay attention at clauses associated with serving slightly different content to bots.Crofoot
B
6

Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.

Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.

Options:

  • Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
  • Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
  • Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:

    <a href="/my-feed.rss" class="add-text"></a>
    
    .add-text:after { content:'View my RSS feed'; }
    

Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.

Banna answered 9/8, 2013 at 13:15 Comment(2)
I like this technique, it's clean.Magnetic
For anybody interested in compatibility with older IE versions, it seems to work with IE 8 - IE 11: caniuse.com/css-gencontentPolyglot
M
4

"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).

They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.

Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.

The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.

Middlesworth answered 24/1, 2019 at 14:0 Comment(0)
C
2

The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.

You basically can prohibit certain links and URL's but not necessarily keywords.

Create answered 8/7, 2010 at 19:51 Comment(1)
Yes, I know about robots.txt. That's implemented. Russian search engines provide certain tags, like <noindex></noindex>, and anything that's between gets ignored by the search engine. Yahoo provides something based on class names. Doesn't Google offer anything?Crofoot
M
1

Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.

It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )

Martinelli answered 9/7, 2010 at 3:45 Comment(2)
That is very interesting. So if I make text replacing with tools like cufon, Google bot will parse that JS, transform the text and ignore it because then it will only be a canvas?Crofoot
No guarantees, Google is tight-lipped about what the bot can and can not do, so it probably won't work. You can however, start with the canvas rather than having Cufon do a replace.Martinelli
D
1

Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique. It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day. This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one. I hope it will give you some idea

Dionne answered 17/3, 2014 at 9:31 Comment(1)
This does not address the question.Beneficent
S
-4

you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

Selfexecuting answered 9/7, 2010 at 3:26 Comment(4)
That is horrible advice. It's a good way to get google-spanked.Martinelli
I don't think it is that bad. What if you have a site which is subscription based but you still want Google to index the content? I don't think you will get 'google-spanked'Nowhere
@Aaron Harun , its not black hat seo its completely white hat as long as you don't serve completely different content.Selfexecuting
@AaronHarun, This is white hat. Read Christopher's reply for more infoEpitasis

© 2022 - 2024 — McMap. All rights reserved.