robots.txt file for different domains of same site
Asked Answered
P

2

24

I have an ASP.NET MVC 4 web application that can be accessed from multiple different domains. The site is fully localized based on the domain in the request (similar in concept to this question).

I want to include a robots.txt file and I want to localize the robots.txt file based on the domain, but I am aware that I can only have one physical "robots.txt" text file in a site's file system directory.

What is the easiest/best way (and is it even possible) to use the ASP.NET MVC framework to achieve a robots.txt file on a per-domain basis so that the same site installation serves content to every domain, but the content of the robots file is localized depending on the domain requested?

Puny answered 10/6, 2013 at 22:21 Comment(1)
I don't believe this questions should have been closed: it is a programming question relevant to asp.net MVC, and exactly the kind of problem the ASP.NET pipeline is suited to solve: how to make contextual decisions about what content to serve. It is definitely not off-topic.Rosenfeld
R
56

The process is reasonably simple:

The controller/action approach

  • Using your routes table, map your robots.txt path to an action in a controller (I use controller and action as a simple example to get you started), just as you would any other controller and view for a given path.
  • Inside the Action, check the domain in the request and choose your robots.txt content for that domain.
  • Return the appropriate file from disk using something like:

The following sample assumes a single top level robots.txt file:

// In App_Start/RouteConfig:
public static void RegisterRoutes(RouteCollection routes)
{
  routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
  routes.MapRoute(
    name: "robots",
    url: "robots.txt",
    defaults: new { controller = "Seo", action = "Robots" }
);

// The controller:
public class SeoController : Controller {
  public ActionResult Robots() {
    var robotsFile = "~/robots-default.txt";
    switch (Request.Url.Host.ToLower()) {
      case "stackoverflow.com":
        robotsFile = "~/robots-so.txt";
        break;
      case "meta.stackoverflow.com":
        robotsFile = "~/robots-meta.txt";
        break;
    }
    return File(robotsFile, "text/plain");
  }
}

One of the easiest ways to get this to work is then to ensure that the routing module is called for all requests using runAllManagedModulesForAllRequests in web.config (don't use this, see the next paragraph):

<system.webServer>
  <handlers>
    ...
  </handlers>
  <modules runAllManagedModulesForAllRequests="true" />
</system.webServer>

This is not a good thing in general as now all the static files (css, js, txt) go through managed handlers before being diverted to the static file handler. IIS is really good at serving static files fast (a largely static file website will max out your disk I/O way before the CPU), so to avoid this performance hit the recommended approach is as the web.config sample section below. Note the similarity to the ExtensionlessUrlHandler-Integrated-4.0 handler in the Visual Studio MVC 4 template applications:

<system.webServer>
  <handlers>
    <add name="Robots-Integrated-4.0"
         path="/robots.txt" verb="GET" 
         type="System.Web.Handlers.TransferRequestHandler" 
         preCondition="integratedMode,runtimeVersionv4.0" />
    ... the original handlers ...
  </handlers>
  <modules runAllManagedModulesForAllRequests="false" />
</system.webServer>       

Benefits/drawbacks

The advantages of this type of approach become apparent once you start using it:

  • You can dynamically generate robots.txt files by using the helpers to generate Action urls which you can then add all/part of to the template robots.txt file.
  • You can check the robot user agent to return different robots files per robot user agent
  • You can use the same controller to output sitemap.xml files for web crawler
  • You could manage the robots content from a database table that can easily be administered by site users.

On the downside,

  • your robots file is now complicating your routes table, and it doesn't really need to
  • you will need to optimise caching to prevent constant disk reads. However, this is the same for any approach you take.

Remember also that different robots.txt files can be used for different subdirectories. This gets tricky with the route and controller approach, so the IHttpHandler approach (below) is easier for this situation.

The IHttpHandler approach

You can also do this with a custom IHttpHandler registered in your web.config. I emphasise custom as this avoids the need to make ALL controllers see ALL requests (with runAllManagedModulesForAllRequests="true", unlike adding a custom route handler into your route table.

This is also potentially a more lightweight approach than the controller, but you would have to have enormous site traffic to notice the difference. It's other benefit is a reuseable piece of code you can use for all your sites. You could also add a custom configuration section to configure the robot user agent/domain name/path mappings to the robots files.

<system.webServer>
  <handlers>
    <add name="Robots" verb="*" path="/robots.txt"
         type="MyProject.RobotsHandler, MyAssembly" 
         preCondition="managedHandler"/>
  </handlers>
  <modules runAllManagedModulesForAllRequests="false" />
</system.webServer>
public class RobotsHandler: IHttpHandler
{
  public bool IsReusable { get { return false; } }
  public void ProcessRequest(HttpContext context) {
    string domain = context.Request.Url.Host;
    // set the response code, content type and appropriate robots file here
    // also think about handling caching, sending error codes etc.
    context.Response.StatusCode = 200;
    context.Response.ContentType = "text/plain";

    // return the robots content
    context.Response.Write("my robots content");
  }
}

robots.txt in subdirectories

To serve robots for subdirectories as well as the site root you can't use the controller approach easily; the handler approach is simpler in this scenario. This can be configured to pick up robots.txt file requests to any subdirectory and handle them accordingly. You might then choose to return 404 for some directories, or a subsection of the robots file for others.

I specifically mention this here as this approach can also be used for the sitemap.xml files, to serve different sitemaps for different sections of the site, multiple sitemaps that reference each other etc.


Other References:

Rosenfeld answered 11/6, 2013 at 6:40 Comment(3)
This was super helpful, thanks for making this awesome answer, Andy. One small note I'd like to add: You need to remove robots.txt from the root directory, or you will get a 500 error recursive depth exeeded.Finance
Can I ask what type="System.Web.Handlers.TransferRequestHandler" and preCondition="integratedMode,runtimeVersionv4.0" means? I hate seeing version numbers in there. It makes me feel like I'll need to rewrite my code when I upgrade to a new version. (And, surprise, I'd prefer not to have to do that.)Hoberthobey
I agree with @JonathanWood, how do we know which version numbers to use, especially in a cloud environment, and how do we handle changes in the version?Pooh
T
0

Andy Brown's System.Web.Handlers.TransferRequestHandler approach in web.config approach did not work for me due to the environment I was working in, resulting in 500 errors.

An alternative of using a web.config url rewrite rule worked for me instead:

<rewrite>
    <rules>
        <rule name="Dynamic robots.txt" stopProcessing="true">
            <match url="robots.txt" />
            <action type="Rewrite" url="/DynamicFiles/RobotsTxt" />
        </rule>
    </rules>
</rewrite>
Technology answered 22/11, 2021 at 2:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.