FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home  »  SEO  »  Robots.txt Secrets From Matt Cutts

by 8 comments

Watch out googlebots got a weapon!Finally some robots.txt questions have been been answered. The most interesting thing is how simple robots.txt files really are. And how incredibly useful they can be at directing your pagerank wherever you want it. The robots.txt secret is that you use the robots.txt file as the first and not too restrictive control. Then you use XHTML meta tags NoIndex and NoFollow in the section of your html. Finally you mark up actual links in your source code by adding the rel="nofollow" to control pagerank flow.

These hounding robotstxt questings were answered by none other than Matt Cutts in this upbeat and not too technical interview done by Eric Enge of STC. The interview transcript is fairly long and touches on a multiple issues that are quite frankly out my league, so I just grabbed parts of the interview that were answering some robots.txt questions.

Eric Enge Does the NoFollow metatag imply a NoIndex on a page?Matt Cutts No. The NoIndex and NoFollow metatags are independent. The NoIndex metatag, for Google at least, means don't show this page in Google's index. The NoFollow metatag means don't follow the outgoing links on this entire page....Eric Enge Right. So there are two levels of NoFollow. There is the attribute on a link, and then there is the metatag, right.Matt Cutts Exactly.Eric Enge What we've been doing is working with clients and telling them to take pages like their about us page, and their contact us page, and link to them from the Homepage with a NoFollow attribute, and then link to them using NoFollow from every other page. It's just a way of lowering the amount of link juice they get. These types of pages are usually the highest PageRank pages on the site, and they are not doing anything for you in terms of search traffic.Matt Cutts Absolutely. So, we really conceive of NoFollow as a pretty general mechanism. The name, NoFollow, is meant to mirror the fact that it's also a metatag. As a metatag NoFollow means don't crawl any links from this entire page.NoFollow as an individual link attribute means don't follow this particular link, and so it really just extends that granularity down to the link level.We did an interview with Rand Fishkin over at SEOmoz where we talked about the fact that NoFollow was a perfectly acceptable tool to use in addition to robots.txt. NoIndex and NoFollow as a metatag can change how Googlebot crawls your site. It's important to realize that typically these things are more of a second order effect. What matters the most is to have a great site and to make sure that people know about it, but, once you have a certain amount of PageRank, these tools let you choose how to develop PageRank amongst your pages.Eric Enge Right. Another example scenario might be if you have a site and discover that you have a massive duplicate content problem. A lot of people discover that because something bad happened. They want to act very promptly, so they might NoIndex those pages, because that will get it out of the index removing the duplicate content. Then, after it's out of the index, you can either just leave in the NoIndex, or you can go back to robots.txt to prevent the pages from being crawled. Does that make sense in terms of thinking about it?...Matt Cutts In general, Google does a relatively good job of following the 301s, and 302s, and even Meta Refreshes and JavaScript. Typically what we don't do would be to follow a chain of redirects that goes through a robots.txt that is itself forbidden.Eric EngeLet's talk a bit about the various uses of NoIndex, NoFollow, and Robots.txt. They all have their own little differences to them. Let's review these with respect to 3 things: (1) whether it stops the passing of link juice; (2) whether or not the page it still crawled; and: (3) whether or not it keeps the affected page out of the index.Matt Cutts I will start with robots.txt, because that's the fundamental method of putting up an electronic no trespassing sign that people have used since 1996. Robots.txt is interesting, because you can easily tell any search engine to not crawl a particular directory, or even a page, and many search engines support variants such as wildcards, so you can say don't crawl *.gif, and we won't crawl any GIFs for our image crawl.We even have additional standards such as Sitemap Support, so you can say here's a link to where my Sitemap is can be found. I believe the only robots.txt extension in common use that Google doesn't support is the crawl-delay. And, the reason that Google doesn't support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.We have even seen people who set a crawl-delay such that we'd only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; its saying crawl me once every "n" seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once.Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results.Eric Enge Based on the links from other sites to those pages.Matt Cutts Exactly. So, we would return the un-crawled reference to eBay.Matt Cutts Exactly. The funny thing is that we could sometimes rely on the ODP description (Editor: also known as DMOZ). And so, even without crawling, we could return a reference that looked so good that people thought we crawled it, and so that caused a little bit of earlier confusion. So, robots.txt was one of the most long standing standards. Whereas for Google, NoIndex means we won't even show it in our search results.So, with robots.txt for good reasons we've shown the reference even if we can't crawl it, whereas if we crawl a page and find a Meta tag that says NoIndex, we won't even return that page. For better or for worse that's the decision that we've made. I believe Yahoo and Microsoft might handle NoIndex slightly differently which is little unfortunate, but everybody gets to choose how they want to handle different tags.Eric Enge Can a NoIndex page accumulate PageRank?Matt Cutts A NoIndex page can accumulate PageRank, because the links are still followed outwards from a NoIndex page.Eric Enge So, it can accumulate and pass PageRank.Matt Cutts Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make a NoIndex page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages.Eric Enge Interviews Google's Matt CuttsFor example you might want to have a master Sitemap page and for whatever reason NoIndex that, but then have links to all your sub Sitemaps....Eric Enge Another example is if you have pages on a site with content that from a user point of view you recognize that it's valuable to have the page, but you feel that is too duplicative of content on another page on the siteThat page might still get links, but you don't want it in the Index and you want the crawler to follow the paths into the rest of the site.Matt Cutts That's right. Another good example is, maybe you have a login page, and everybody ends up linking to that login page. That provides very little content value, so you could NoIndex that page, but then the outgoing links would still have PageRank.Now, if you want to you can also add a NoFollow metatag, and that will say don't show this page at all in Google's Index, and don't follow any outgoing links, and no PageRank flows from that page. We really think of these things as trying to provide as many opportunities as possible to sculpt where you want your PageRank to flow, or where you want Googlebot to spend more time and attention.

Matt Cutts Blog - Full Length Transcript at Stone Temple Consulting here, Or you can read Erics Blog Post about it.

Tags

December 14th, 2007

Comments Welcome

  • John H. Gohde

    Leak juice? I prefer not to humanize the SEO process. Your post once again brings up the importance of internal link structure. Use NoFollow on your "about us page"? An absolute waste of time, IMHO. Webmasters should use NoFollow to artificially fix an inherit weakness of the Google Page Rank system? No, I have better things to do with my time. Like making my "about us page" into something important and worth reading. And, personally I think webmasters have better things to do than waste their time on time consuming meaningless busy work.

  • Steve Walker

    I have a page on my site that is simply designed to assist clients in funding their accounts but it seems to be drawing a lot of traffic from very odd and irrelevant terms. Will the use of noindex tags alone work or do I need to block the URL in my robots text file as well? If so, should I then block every link pointing to the page too? It is an important page so I link to it site-wide, I just don't want it indexed.

    Also, I'm ranked 12th or so for my key phrase and hoping to land on the home page soon. Will this adversely affect my ranking in any way or could it help? Its taken a while to get here so I hate to rock the boat unless it will likely help me.
    Thanks

  • AskApache

    @ JOHN

    Nice advice, thanks.

    Leak juice? I prefer not to humanize the SEO process. Your post once again brings up the importance of internal link structure. Use NoFollow on your "about us page"? An absolute waste of time, IMHO. Webmasters should use NoFollow to artificially fix an inherit weakness of the Google Page Rank system? No, I have better things to do with my time. Like making my "about us page" into something important and worth reading. And, personally I think webmasters have better things to do than waste their time on time consuming meaningless busy work.

  • R. Richard Hobbs

    This made for interesting reading and gave me some food for thought - not being a full time webmaster but rather more like being a business owner and "chief bottle washer and cook" (read: publishing my own website(s) and at the bottom of it all just wanting tmy content to get seen by the right people, I have to be careful not to get in over my head... I agree with the @John's comment about not wanting to spend too much time as a "Google Detective"...

    In an effort to reduce potential dupe content, applied a robots.txt recently based on one I copied from an article elsewhere in which the author was using disallow for his index.php, not sure what I was thinking, that was one of the lines that ended up in my robots.txt and unless there is some other unknown aspect involved, (no messages from Google...), for the time being I seem to be literally wiped off the face of Google. Hopefully this is just temporary the Google bot revisits and starts indexing from my home page again... My Home Page (index.php) in fact was my highest ranked page and was ranked very attractively in my desired search results - argggh

    So- after reading around, including the interview you posted, I started thinking (and have little to lose for the moment) and this is probably the eternal conflict for anyone wishing to SEO their site on anything but the most basic of levels, "how do I get ALL my content seen but avoid dupe content penalties? have decided to try the following - using noindex, follow tags on the content most likely to be flagged for due content (I use a excerpting plugin as it is for anything resembling an archive...) and only using disallow for for administrative or site code areas (i.e. the wp-* folders) and server areas i.e. cgi-bin as they aare surely duplicates of many others sites, its boring stuff that has no business showing up in content search results and just generally feeling like its best to keep as much of the administraive and code stuff private for security reasons.

    So my thinking is, dont ask the googlebot to index all the flotsam and jetsam (noindex - archives, other likely "dupy" content...) but tell it to follow so any linky goodness is maximised - i.e outbound links contained in any of the content - this is all seems pretty wholesome?

    Or am I just somehow missing the boat in my naivety?

    I just wondered to myself if the follow tag needs to be included, its probably default action for a searchbot?

    Any way thanks askapache for making some interesting and thought provoking content available on your site.

  • John H Gohde

    Quoting Matt Cutts:

    "Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results."

    If that is not proof that Google does anything that it feels like doing, then what is?
    Yeah, right, sure ... Google respects your robots.txt directives not to crawl a given webpages. Google is going to return its content in the search results!!!
    With double-talk like this, I am going to depend on using my own brain and not read more dribble from Matt Cutts.

  • John H. Gohde

    Upon a second reading, I am now able to comprehend Matt Cutts's quote from my last comment. Matt was referring to the concept of "Google Bombing" where totally erroneous webpages can show up in the SERPs because of the off-page factor of having other websites pointing to them with anchor text. Still, I find that a violation of trust committed by Google. When I specify that I don't want a webpage indexed or crawled, I most certainly do NOT want that page to show up in Google's SERPs under any circumstances. Seems simple enough to me.

  • Pragmites

    This guy's a genius. He managed to explain all the terms with such simplicity. I had a doubt in my mind about the NoIndex pages and if their outgoing links could have page ranks.

  • MetaT

    Is there a way to block the flow of page rank by just using the robots.txt file without the meta tags? or do you have to use both?


Related Articles


My Online Tools
Popular Articles


Hacking and Hackers

The use of "hacker" to mean "security breaker" is a confusion on the part of the mass media. We hackers refuse to recognize that meaning, and continue using the word to mean someone who loves to program, someone who enjoys playful cleverness, or the combination of the two. See my article, On Hacking.
-- Richard M. Stallman









[hide]

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

+Askapache | askapache

Site Map | Contact Webmaster | License and Disclaimer | Terms of Service

↑ TOPMain