Robots.txt Secrets From Matt Cutts
Finally some robots.txt questions have been been answered. The most interesting thing is how simple robots.txt files really are. And how incredibly useful they can be at directing your pagerank wherever you want it. The robots.txt secret is that you use the robots.txt file as the first and not too restrictive control. Then you use XHTML meta tags NoIndex and NoFollow in the <head>
section of your html. Finally you mark up actual links in your source code by adding the rel="nofollow" to control pagerank flow.
These hounding robotstxt questings were answered by none other than Matt Cutts in this upbeat and not too technical interview done by Eric Enge of STC. The interview transcript is fairly long and touches on a multiple issues that are quite frankly out my league, so I just grabbed parts of the interview that were answering some robots.txt questions.
Eric Enge Does the NoFollow metatag imply a NoIndex on a page? Matt Cutts No. The NoIndex and NoFollow metatags are independent. The NoIndex metatag, for Google at least, means don't show this page in Google's index. The NoFollow metatag means don't follow the outgoing links on this entire page. ... Eric Enge Right. So there are two levels of NoFollow. There is the attribute on a link, and then there is the metatag, right. Matt Cutts Exactly. Eric Enge What we've been doing is working with clients and telling them to take pages like their about us page, and their contact us page, and link to them from the Homepage with a NoFollow attribute, and then link to them using NoFollow from every other page. It's just a way of lowering the amount of link juice they get. These types of pages are usually the highest PageRank pages on the site, and they are not doing anything for you in terms of search traffic. Matt Cutts Absolutely. So, we really conceive of NoFollow as a pretty general mechanism. The name, NoFollow, is meant to mirror the fact that it's also a metatag. As a metatag NoFollow means don't crawl any links from this entire page. NoFollow as an individual link attribute means don't follow this particular link, and so it really just extends that granularity down to the link level. We did an interview with Rand Fishkin over at SEOmoz where we talked about the fact that NoFollow was a perfectly acceptable tool to use in addition to robots.txt. NoIndex and NoFollow as a metatag can change how Googlebot crawls your site. It's important to realize that typically these things are more of a second order effect. What matters the most is to have a great site and to make sure that people know about it, but, once you have a certain amount of PageRank, these tools let you choose how to develop PageRank amongst your pages. Eric Enge Right. Another example scenario might be if you have a site and discover that you have a massive duplicate content problem. A lot of people discover that because something bad happened. They want to act very promptly, so they might NoIndex those pages, because that will get it out of the index removing the duplicate content. Then, after it's out of the index, you can either just leave in the NoIndex, or you can go back to robots.txt to prevent the pages from being crawled. Does that make sense in terms of thinking about it? ... Matt Cutts In general, Google does a relatively good job of following the 301s, and 302s, and even Meta Refreshes and JavaScript. Typically what we don't do would be to follow a chain of redirects that goes through a robots.txt that is itself forbidden. Eric Enge Let's talk a bit about the various uses of NoIndex, NoFollow, and Robots.txt. They all have their own little differences to them. Let's review these with respect to 3 things: (1) whether it stops the passing of link juice; (2) whether or not the page it still crawled; and: (3) whether or not it keeps the affected page out of the index. Matt Cutts I will start with robots.txt, because that's the fundamental method of putting up an electronic no trespassing sign that people have used since 1996. Robots.txt is interesting, because you can easily tell any search engine to not crawl a particular directory, or even a page, and many search engines support variants such as wildcards, so you can say don't crawl *.gif, and we won't crawl any GIFs for our image crawl. We even have additional standards such as Sitemap Support, so you can say here's a link to where my Sitemap is can be found. I believe the only robots.txt extension in common use that Google doesn't support is the crawl-delay. And, the reason that Google doesn't support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that. We have even seen people who set a crawl-delay such that we'd only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; its saying crawl me once every "n" seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once. Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results. Eric Enge Based on the links from other sites to those pages. Matt Cutts Exactly. So, we would return the un-crawled reference to eBay. Matt Cutts Exactly. The funny thing is that we could sometimes rely on the ODP description (Editor: also known as DMOZ). And so, even without crawling, we could return a reference that looked so good that people thought we crawled it, and so that caused a little bit of earlier confusion. So, robots.txt was one of the most long standing standards. Whereas for Google,NoIndex
means we won't even show it in our search results. So, with robots.txt for good reasons we've shown the reference even if we can't crawl it, whereas if we crawl a page and find a Meta tag that says NoIndex, we won't even return that page. For better or for worse that's the decision that we've made. I believe Yahoo and Microsoft might handleNoIndex
slightly differently which is little unfortunate, but everybody gets to choose how they want to handle different tags. Eric Enge Can a NoIndex page accumulate PageRank? Matt Cutts ANoIndex
page can accumulate PageRank, because the links are still followed outwards from aNoIndex
page. Eric Enge So, it can accumulate and pass PageRank. Matt Cutts Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make aNoIndex
page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages. Eric Enge Interviews Google's Matt Cutts For example you might want to have a master Sitemap page and for whatever reason NoIndex that, but then have links to all your sub Sitemaps. ... Eric Enge Another example is if you have pages on a site with content that from a user point of view you recognize that it's valuable to have the page, but you feel that is too duplicative of content on another page on the site That page might still get links, but you don't want it in the Index and you want the crawler to follow the paths into the rest of the site. Matt Cutts That's right. Another good example is, maybe you have a login page, and everybody ends up linking to that login page. That provides very little content value, so you could NoIndex that page, but then the outgoing links would still have PageRank. Now, if you want to you can also add a NoFollow metatag, and that will say don't show this page at all in Google's Index, and don't follow any outgoing links, and no PageRank flows from that page. We really think of these things as trying to provide as many opportunities as possible to sculpt where you want your PageRank to flow, or where you want Googlebot to spend more time and attention.
Matt Cutts Blog - Full Length Transcript at Stone Temple Consulting here, Or you can read Erics Blog Post about it.wordpress robots.txt
« WordPress What Is This PluginPHP5 Custom Install Shell Script Example »
Comments