What is a Robots.txt File?

This post is another in the spirit of my Sitemaps post, where I explained how to create an XML sitemap for inclusion into Google and Bing Webmaster Tools.

Today I want to talk about the robots.txt file, which is an extremely powerful and oft-misunderstood part of websites. I misunderstood it myself recently.

Overview

The robots.txt file has been in existence since 1994, which in Internet time is since the beginning of time.

The robots.txt file is used to restrict specific search engine bot access to specific parts of your website. The file works on a per-bot basis. The typical search engine bots that you can restrict are called:

GoogleBot
MSNbot (Bing)

There are other specific bots as well, such as GoogleImageBot and GoogleNewsBot. For a fairly comprehensive list, go here.

What’s the format?

Here is the basic format of a robots.txt file:

User-agent: *
Disallow: *

Sitemap:

Let’s walk through these one at a time:

User-agent: This is where you specify the name of the search engine bot that you want to restrict. Default (if you want to apply all of your settings to all search engine bots) is simply a *. So the line will look like “User-agent: *”

Allow: This is where you specify which pages you WANT indexed. If you want all areas of your site indexed, you simply put a “/”. So the line will look like “Allow: /”, or you can simply leave this line out (recommended).

Disallow: This is where you specify the parts of your site that you want to completely restrict access to. For example, if you want to restrict the crawling and indexing of your Admin section, you could put “/admin”, which will completely disallow all files in the Admin folder. This would include sections like “http://www.examplesite.com/admin/login” or “http://www.examplesite.com/admin/secretfile”. Also, remember that each disallowed URL/folder must be put on a separate line.

Sitemap: This is where you can specify the path to your sitemap.xml. So the line of code will look like “http://www.examplesite.com/sitemap.xml”. You can also specify multiple sitemaps, such as news or video sitemaps. Here is CNN’s robots.txt, which specifies multiple sitemaps such as News and Video.

Where is the file placed?

The robots.txt file is placed in the root folder of your website, so that it can be found at the path “http://www.yoursite.com/robots.txt”.

How do I build a robots.txt file?

There are at least two different ways to build a robots.txt file. They are:

By hand (more difficult)
Using Google Webmaster Tools (requires being signed up for Google Webmaster Tools (GWT), for which there is no good excuse not to be)

I recommend using the GWT functionality to set up your robots.txt file, because of the ease and simplicity. For a good step-by-step tutorial, I recommend the official Google tutorial.

High Level Uses of Robots.txt

There are some advanced operators that most search engine bots will recognize. Use these with caution (below text taken directly from Google Webmaster site):

To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
   User-agent: Googlebot
   Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
   User-agent: Googlebot
   Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
   User-agent: Googlebot
   Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn’t crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
   User-agent: *
   Allow: /*?$
   Disallow: /*?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Resources for more reading

If you are interested in reading more into robots.txt files and what is possible and recommended, please check out the following resources:

Robotstxt.org

Google Webmaster Central on Robots.txt

Search News Central on Duplicate Content

Some last words of advice

One must understand that the robots.txt file is simply a directive for webcrawler bots, and not all bots will adhere to your file. Some bad bots will still crawl the pages you have restricted through your robots.txt file.

Also, the robots.txt file found at http://www.examplesite.com/robots.txt is different from robots metatags. Until I have the time to write about robots metatags, I suggest checking out this explanation.

BE CAREFUL

I close this post with a word of caution. I recently put what I thought was a correct, minimal robots.txt on this site. I did it like this:

User-agent: *
Allow: *
Disallow: *

Sitemap: http://johnfdoherty.wpengine.com/sitemap.xml

However, this apparently disallowed my whole site! All of my top-level pages were being removed from the index! My traffic was plummeting and I could not figure out why. Of course, I was traveling and once I finally logged back into GWT it told me…but the site had been disallowed for a few days. Woops!

So learn my lesson. Only Disallow exactly what you need disallowed. If you aren’t disallowing anything…leave it blank or not in existence at all.