What's the best way to write robots.txt for github pages using multiple repos?

Asked 8/12, 2016 at 7:1 Answered 5/12, 2017 at 14:3

I am using Github pages to build my personal website with Jekyll. I have a head site in the username.github.io repo, project A site in the projectA repo, project B in the projectB repo and so on. I have put a CNAME file under the username.github.io repo so that all of my sites are under the customized domain name (www.mydomain.com). I have noticed that with robots.txt file pointing to the sitemap.txt file under each repo, the sitemap.txt can only contain page links for pages in each separate repo. So, I have a couple of questions:

Since my site is structured as www.mydomain.com, www.mydomain.com/projectA, www.mydomain.com/projectB and so on corresponding to the pages in single repos, will the search engine recognize all of my site pages even though the sitemap.txt under the username.github.io head repo only has the page links generated in the single repo?
What is the best way to write the robots.txt file in my case?

Thanks! Qi

Christine answered 8/12, 2016 at 7:1 Comment(0)

Standards and disclaimer

Sitemap: in robots.txt is a nonstandard extension according to Wikipedia. Remember that:

Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Wikipedia also lists allow: as a nonstandard extension.

Multiple sitemaps in robots.txt

You can specify more than one Sitemap file per robots.txt file. When specifying more than one sitemap in robots.txt this is the format:

Sitemap: http://www.example.com/sitemap-host1.xml

Sitemap: http://www.example.com/sitemap-host2.xml

An index of sitemaps

There is also a type of sitemap file that is an index of sitemap files.

If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap1.xml.gz</loc>

      <lastmod>2004-10-01T18:23:17+00:00</lastmod>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/sitemap2.xml.gz</loc>

      <lastmod>2005-01-01</lastmod>

   </sitemap>

</sitemapindex>

<lastmod> is optional.

About excluding content

The Sitemaps protocol enables you to let search engines know what content you would like indexed. To tell search engines the content you don't want indexed, use a robots.txt file or robots meta tag. See robotstxt.org for more information on how to exclude content from search engines.

If you want search engines not to index anything it should be in the robots.txt file (in the User Page repository) as:

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Alternatively you can use the robots tag.

Suggestions

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/sitemap.xml

Sitemap: http://www.example.com/projectA/sitemap.xml

Sitemap: http://www.example.com/projectB/sitemap.xml

or, if you are using a sitemap index file

User-agent: *
Disallow: /project_to_disallow/
Disallow: /projectname/page_to_disallow.html

Sitemap: http://www.example.com/siteindex.xml

where http://www.example.com/siteindex.xml looks like

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <sitemap>

      <loc>http://www.example.com/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectA/sitemap.xml</loc>

   </sitemap>

   <sitemap>

      <loc>http://www.example.com/projectB/sitemap.xml</loc>

   </sitemap>

</sitemapindex>

For info on how set up robots.txt with GitHub Pages see my answer here.

Crispas answered 5/12, 2017 at 14:3 Comment(0)

Where to put it The short answer: in the top-level directory of your web server. Source : http://www.robotstxt.org/robotstxt.html

You can also read in google documentation that a www.mydomain.com/folder/robots.txt url will not be crawled.

The basic www.mydomain.com/robots.txt can be :

User-agent: *

This will instruct crawler to thru all www.mydomain.com file hierarchy by following links.

If no page of www.mydomain.com is referencing you project pages, you can add :

User-agent: *
allow: /ProjectA
allow: /projectB

Petronel answered 8/12, 2016 at 11:13 Comment(1)

I have special rules for each Project repo, and I put those allowed links under the sitemap.txt file under each project repo. Can I use something like

sitemap: https://www.example.com/sitemap.txt; sitemap: https://www.example.com/ProjectA/sitemap.txt; sitemap: https://www.example.com/ProjectB/sitemap.txt

(in three lines of course)? In this way, I don't need to update the top-level repo if there is any robot rule was changed under a project repo. Thank you for replying. – Christine 8/12, 2016 at 21:42

Standards and disclaimer

Multiple sitemaps in robots.txt

An index of sitemaps

About excluding content

Suggestions

Recommended topics

Hot tags