FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home  »  SEO  »  SEO with Robots.txt

by 25 comments

robots.txt Search Engine Optimization is simply using robots.txt for your blog, wordpress, or phpbb. WordPress Optimized robots.txt and meta tags

Contents [hide]


See the Updated WordPress robots.txt file

Google Robots.txt Info and Recommendations

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.

Googlebot and Robots.txt SEO Info

When deciding which pages to crawl, Googlebot goes in this order

  1. Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot."
  2. If no "Googlebot User-agent exists, it will obey the first entry with a User-agent of "*"

Google User-agents

Googlebot
crawl pages from our web index and our news index
Googlebot-Mobile
crawls pages for our mobile index
Googlebot-Image
crawls pages for our image index
Mediapartners-Google
crawls pages to determine AdSense content. We only use this bot to crawl your site if you show AdSense ads on your site.
Adsbot-Google
crawls pages to measure AdWords landing page quality. We only use this bot if you use Google AdWords to advertise your site. Find out more about this bot and how to block it from portions of your site.

Removing Old/wrong content from google

  1. Create the new page
  2. In .htaccess (if Linux) add a RedirectPermanent command
  3. DO NOT DELETE THE OLD FILE
  4. Update all the links on your website to point to the new page (change the link text while you're at it)
  5. Verify that no pages point to the old file (including your sitemap.xml)
  6. Add a noindex,nofollow to the old file AND Disallow in your robots.txt
  7. Submit your updated sitemap.xml to Google & Yahoo
  8. Wait a few weeks
  9. When the new page appears in Google, it's safe to delete the old one

Google Sponsored Robots.txt Articles

  1. Controlling how search engines access and index your website
  2. The Robots Exclusion Protocol
  3. robots.txt analysis tool
  4. Googlebot
  5. Inside Google Sitemaps: Using a robots.txt file
  6. All About Googlebot

robots.txt examples

robots.txt for WordPress 2.+

User-agent:  *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /z/j/
Disallow: /z/c/
Disallow: /stats/
Disallow: /dh_
Disallow: /about/
Disallow: /contact/
Disallow: /tag/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact
Disallow: /manual
Disallow: /manual/*
Disallow: /phpmanual/
Disallow: /category/
 
User-agent: Googlebot
# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
 
# disallow all files with ? in url
Disallow: /*?*
 
# disable duggmirror
User-agent: duggmirror
Disallow: /
 
# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*
 
# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

robots.txt for phpBB

User-agent: *
Disallow: /cgi-bin/
Disallow: /phpbb/admin/
Disallow: /phpbb/cache/
Disallow: /phpbb/db/
Disallow: /phpbb/images/
Disallow: /phpbb/includes/
Disallow: /phpbb/language/
Disallow: /phpbb/templates/
Disallow: /phpbb/faq.php
Disallow: /phpbb/groupcp.php
Disallow: /phpbb/login.php
Disallow: /phpbb/memberlist.php
Disallow: /phpbb/modcp.php
Disallow: /phpbb/posting.php
Disallow: /phpbb/privmsg.php
Disallow: /phpbb/profile.php
Disallow: /phpbb/search.php
Disallow: /phpbb/viewonline.php
 
User-agent: Googlebot
# disallow files ending with these extensions
Disallow: /*.inc$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
 
# disallow all files with? in url
Disallow: *mark=*
Disallow: *view=*
 
# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*
 
# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
User-agent: *
Disallow: /stats
Disallow: /dh_
Disallow: /V
Disallow: /z/j/
Disallow: /z/c/
Disallow: /cgi-bin/
Disallow: /viewtopic.php
Disallow: /viewforum.php
Disallow: /index.php?
Disallow: /posting.php
Disallow: /groupcp.php
Disallow: /search.php
Disallow: /login.php
Disallow: /post
Disallow: /member
Disallow: /profile.php
Disallow: /memberlist.php
Disallow: /faq.php
Disallow: /templates/
Disallow: /mx_
Disallow: /db/
Disallow: /admin/
Disallow: /cache/
Disallow: /images/
Disallow: /includes/
Disallow: /common.php
Disallow: /index.php
Disallow: /memberlist.php
Disallow: /modcp.php
Disallow: /privmsg.php
Disallow: /viewonline.php
Disallow: /images/
Disallow: /rss.php
 
User-agent: Googlebot
# disallow all files ending with these extensions
Allow: /sitemap.php
Disallow: /*.php$
Allow: /sitemap.php
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.txt$
 
# disallow all files with? in url
Disallow: /*?*
Disallow: /*?
 
# disallow all files in /wp- directorys
Disallow: /wp-*/
 
# disallow archiving site
User-agent: ia_archiver
Disallow: /
 
# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*.gif$
Allow: /*.png$
Allow: /*.jpeg$
Allow: /*.jpg$
Allow: /*.ico$
Allow: /*.jpg$
Allow: /images
Allow: /z/i/
 
# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Allow: /*

Pattern Matching with Google

Matching a sequence of characters using *

You can use an asterisk * to match a sequence of characters.

Block access to all subdirectories that begin with private:

User-Agent: Googlebot
Disallow: /private*/

Block access to all URLs that include a ?

User-agent: *
Disallow: /*?*

Matching the end characters of the URL using $

You can use the $ character to specify matching the end of the URL.

Block any URLs that end with .php

User-Agent: Googlebot
Disallow: /*.php$
You can use this pattern matching in combination with the Allow directive.

Exclude all URLs that contain ? to ensure Googlebot doesn't crawl duplicate pages. URLs that end with a ? DO get crawled

User-agent: *
Allow: /*?$
Disallow: /*?

Disallow:/*? blocks any URL that begins with HOST, followed by any string, followed by a ?, followed by any string

Allow: /*?$ allows any URL that begins with HOST, followed by any string, followed by a ?, with no characters after the ?

SRC

User-Agent Discussion

Blocking a specific User-Agent

Note: Blocking Googlebot blocks all bots that begin with "Googlebot"

Block Googlebot entirely

User-agent: Googlebot
Disallow: /

Allowing a specific User-Agent

Note: Googlebot follows the line directed at it, rather than the line directed at everyone.

Block access to all bots other than "Googlebot"

User-agent: *
Disallow: /
 
User-agent: Googlebot
Disallow:

Googlebot recognizes an extension to the robots.txt standard called Allow, which is opposite of Disallow.

Block all pages inside a subdirectory except for single file

User-Agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Block Googlebot but allow other Bot

User-agent: Googlebot
Disallow: /
 
User-agent: Googlebot-Mobile
Allow:
SRC

Removing Content From Google

It is better to use <meta name="Googlebot" content="Follow,NoIndex"> on pages that have been indexed if you wish google to drop them. This way it is much faster than blocked using robots.txt.

Note: removing snippets also removes cached pages.

A snippet is a text excerpt that appears below a page's title in our search results and describes the content of the page.

Prevent Google from displaying snippets for your page

<meta NAME="GOOGLEBOT" CONTENT="NOSNIPPET">

Remove an outdated "dead" link

Google updates its entire index automatically on a regular basis. When we crawl the web, we find new pages, discard dead links, and update links automatically. Links that are outdated now will most likely "fade out" of our index during our next crawl.

Note: Please ensure that you return a true 404 error even if you choose to display a more user-friendly body of the HTML page for your visitors. It won't help to return a page that says "File Not Found" if the http headers still return a status code of 200, or normal.

Remove cached pages

Google automatically takes a "snapshot" of each page it crawls and archives it. This "cached" version allows a webpage to be retrieved for your end users if the original page is ever unavailable (due to temporary failure of the page's web server). The cached page appears to users exactly as it looked when Google last crawled it, and we display a message at the top of the page to indicate that it's a cached version. Users can access the cached version by choosing the "Cached" link on the search results page.

Prevent all search engines from showing a "Cached" link for your site

<meta name="robots" content="noarchive" />

Allow other search engines to show a "Cached" link, preventing only Google

<meta name="googlebot" content="noarchive" />

Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.

Remove your entire website

If you wish to exclude your entire website from Google's index

Remove site from search engines and prevent all robots from crawling it in the future

User-agent: *
Disallow: /

Note: Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots.txt fetch as a request not to crawl any pages on the site.

To remove your site from Google only and prevent just Googlebot from crawling your site in the future

User-agent: Googlebot
Disallow: /

Allow Googlebot to index all http pages but no https pages

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols.

For your http protocol (http://yourserver.com/robots.txt)

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt)

User-agent: *
Disallow: /

Remove part of your website

Option 1: Robots.txt

Remove all pages under a particular directory (for example, lems)

User-agent: Googlebot
Disallow: /lems

Remove all files of a specific file type (for example, .gif)

User-agent: Googlebot
Disallow: /*.gif$

To remove dynamically generated pages, you'd use this robots.txt entry

User-agent: Googlebot
Disallow: /*?

Option 2: Meta tags

Another standard, which can be more convenient for page-by-page use, involves adding a META tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.

Prevent all robots from indexing a page on your site

<meta name="robots" content="noindex, nofollow" />

Allow other robots to index the page on your site, preventing only Google's robots from indexing the page

<meta name="googlebot" content="noindex, nofollow" />

Allow robots to index the page on your site but instruct them not to follow outgoing links

<meta name="robots" content="nofollow" />

Remove an image from Google's Image Search

Want Google to exclude the dogs.jpg image that appears on your site at www.yoursite.com/images/dogs.jpg

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

Remove all the images on your site from our index

User-agent: Googlebot-Image
Disallow: /

Remove all files of a specific file type (for example, to include .jpg but not .gif images)

User-agent: Googlebot-Image
Disallow: /*.gif$

Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster or SEO agency must first create and place a robots.txt file on the site in question.

Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of the directories specified in your robots.txt file from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)

Remove a blog from Blog Search

Only blogs with site feeds will be included in Blog Search. If you'd like to prevent your feed from being crawled, make use of a robots.txt file or meta tags (NOINDEX or NOFOLLOW), as described above. Please note that if you have a feed that was previously included, the old posts will remain in the index even though new ones will not be added.

Remove a RSS or Atom feed

When users add your feed to their Google homepage or Google Reader, Google's Feedfetcher attempts to obtain the content of the feed in order to display it. Since Feedfetcher requests come from explicit action by human users, Feedfetcher has been designed to ignore robots.txt guidelines.

It's not possible for Google to restrict access to a publicly available feed. If your feed is provided by a blog hosting service, you should work with them to restrict access to your feed. Check those sites' help content for more information (e.g., Blogger, LiveJournal, or Typepad).

Remove transcoded pages

Google Web Search on mobile phones allows users to search all the content in the Google index for desktop web browsers. Because this content isn't written specifically for mobile phones and devices and thus might not display properly, Google automatically translates (or "transcodes") these pages by analyzing the original HTML code and converting it to a mobile-ready format. To ensure that the highest quality and most useable web page is displayed on your mobile phone or device, Google may resize, adjust, or convert images, text formatting and/or certain aspects of web page functionality.

SRC

To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do.

For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin

It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.

Tell googlebot not to count certain external links in your ranking

Meta tags can exclude all outgoing links on a page, but you can also instruct Googlebot not to crawl individual links by adding rel="nofollow" to a hyperlink. When Google sees the attribute rel="nofollow" on hyperlinks, those links won't get any credit when we rank websites in our search results. For example a link, This is a great link! could be replaced with I can't vouch for this link.

Other Links

  1. Database of Web Robots, Overview

# robots.txt, www.nytimes.com 6/29/2006
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/
 
User-agent: Mediapartners-Google*
Disallow:
 
# robots.txt, http://dictionary.reference.com
User-agent: Googlebot
Disallow:
 
User-agent: Mediapartners-Google
Disallow:
 
User-agent: Teleport Pro
Disallow: /
 
User-agent: *
Disallow: /cgi-bin/
 
# robots.txt for www.phpbbhacks.com
User-agent: *
Disallow: /forums/viewtopic.php
Disallow: /forums/viewforum.php
Disallow: /forums/index.php?
Disallow: /forums/posting.php
Disallow: /forums/groupcp.php
Disallow: /forums/search.php
Disallow: /forums/login.php
Disallow: /forums/privmsg.php
Disallow: /forums/post
Disallow: /forums/profile.php
Disallow: /forums/memberlist.php
Disallow: /forums/faq.php
Disallow: /forums/archive
 
# robots.txt for Slashdot.org
#
# "Any empty [Disallow] value, indicates that all URLs can be retrieved.
# At least one Disallow field needs to be present in a record."
 
User-agent: Mediapartners-Google
Disallow:
 
User-agent: Googlebot
Crawl-delay: 100
Disallow: /firehose.pl
Disallow: /submit.pl
Disallow: /comments.pl
Disallow: /users.pl
Disallow: /zoo.pl
Disallow: firehose.pl
Disallow: submit.pl
Disallow: comments.pl
Disallow: users.pl
Disallow: zoo.pl
Disallow: /~
Disallow: ~
 
User-agent: Slurp
Crawl-delay: 100
Disallow:
 
User-agent: Yahoo-NewsCrawler
Disallow:
 
User-Agent: msnbot
Crawl-delay: 100
Disallow:
 
User-agent: *
Crawl-delay: 100
Disallow: /authors.pl
Disallow: /index.pl
Disallow: /article.pl
Disallow: /comments.pl
Disallow: /firehose.pl
Disallow: /journal.pl
Disallow: /messages.pl
Disallow: /metamod.pl
Disallow: /users.pl
Disallow: /search.pl
Disallow: /submit.pl
Disallow: /pollBooth.pl
Disallow: /pubkey.pl
Disallow: /topics.pl
Disallow: /zoo.pl
Disallow: /palm
Disallow: authors.pl
Disallow: index.pl
Disallow: article.pl
Disallow: comments.pl
Disallow: firehose.pl
Disallow: journal.pl
Disallow: messages.pl
Disallow: metamod.pl
Disallow: users.pl
Disallow: search.pl
Disallow: submit.pl
Disallow: pollBooth.pl
Disallow: pubkey.pl
Disallow: topics.pl
Disallow: zoo.pl
Disallow: /~
Disallow: ~
 
# robots.txt for http://www.myspace.com
User-agent: ia_archiver
Disallow: /
 
# robots.txt for http://www.craigslist.com
User-agent: YahooFeedSeeker
Disallow: /forums
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj
 
User-agent: *
Disallow: /cgi-bin
Disallow: /cgi-secure
Disallow: /forums
Disallow: /search
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj
 
User-Agent: OmniExplorer_Bot
Disallow: /
 
# robots.txt for http://www.alexa.com
User-agent: googlebot  # allow Google crawler
Disallow: /search
 
User-agent: gulliver  # allow Northern Light crawler
Disallow: /search
 
User-agent: slurp  # allow Inktomi crawler
Disallow: /search
 
User-agent: fast  # allow FAST crawler
Disallow: /search
 
User-agent: scooter  # allow AltaVista crawler
Disallow: /search
 
User-agent: vscooter  # allow AltaVista image crawler
Disallow: /search
 
User-agent: ia_archiver  # allow Internet Archive crawler
Disallow: /search
 
User-agent: *    # Disallow all other crawlers access
Disallow: /
 
# robots.txt for http://www.technorati.com
User-agent: NPBot
Disallow: /
 
User-agent: TurnitinBot
Disallow: /
 
User-Agent: sitecheck.internetseer.com
Disallow: /
 
User-Agent: *
Crawl-Delay: 3
Disallow: /search/
Disallow: /search.php
Disallow: /cosmos.php
 
# robots.txt for www.sitepoint.com
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /forums/report.php
Disallow: /forums/search.php
Disallow: /forums/newreply.php
Disallow: /forums/editpost.php
Disallow: /forums/memberlist.php
Disallow: /forums/profile.php
Disallow: /launch/
Disallow: /search/
Disallow: /voucher/424/
Disallow: /email/
Disallow: /feedback/
Disallow: /contact?reason=articlesuggest
Disallow: /linktothis/
Disallow: /popup/
Disallow: /forums/archive/
 
# robots.txt for http://www.w3.org/
 
# For use by search.w3.org
User-agent: W3C-gsa
Disallow: /Out-Of-Date
 
User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)
Disallow: /
 
# W3C Link checker
User-agent: W3C-checklink
Disallow:
 
# exclude some access-controlled areas
User-agent: *
Disallow: /2004/ontaria/basic
Disallow: /Team
Disallow: /Project
Disallow: /Systems
Disallow: /Web
Disallow: /History
Disallow: /Out-Of-Date
Disallow: /2002/02/mid
Disallow: /mid/
Disallow: /People/all/
Disallow: /RDF/Validator/ARPServlet
Disallow: /2003/03/Translations/byLanguage
Disallow: /2003/03/Translations/byTechnology
Disallow: /2005/11/Translations/Query
Disallow: /2003/glossary/subglossary/
#Disallow: /2005/06/blog/
#Disallow: /2001/07/pubrules-checker
#shouldnt get transparent proxies but will ml links of things like pubrules
Disallow: /2000/06/webdata/xslt
Disallow: /2000/09/webdata/xslt
Disallow: /2005/08/online_xslt/xslt
Disallow: /Bugs/
Disallow: /Search/Mail/Public/
Disallow: /2006/02/chartergen
 
# robots.txt for www.google-analytics.com
User-Agent: *
Disallow: /
Noindex: /
 
# robots.txt for video.google.com
User-agent: *
Disallow: /videosearch?
Disallow: /videofeed?
Disallow: /videopreview?
Disallow: /videopreviewbig?
Disallow: /videoprograminfo?
Disallow: /videorandom
Disallow: /videolineup
Disallow: /downloadgvp
 
# robots.txt for www.google.com
User-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /froogle?
Disallow: /froogle_
Disallow: /print
Disallow: /books
Disallow: /patents?
Disallow: /scholar?
Disallow: /complete
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Disallow: /maps?
Disallow: /translate?
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /reader/
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Disallow: /mbd?
Disallow: /extern_js/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /notebook/search?
Disallow: /music
Disallow: /browsersync
Disallow: /call
Disallow: /archivesearch?
Disallow: /archivesearch/url
Disallow: /archivesearch/advanced_search
Disallow: /base/search?
Disallow: /base/reportbadoffer
Disallow: /base/s2
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /codesearch?
Disallow: /codesearch/feeds/search?
Disallow: /wapsearch?
Disallow: /safebrowsing
Disallow: /finance
Disallow: /reviews/search?
 
# robots.txt for validator.w3.org
# $Id: robots.txt,v 1.3 2000/12/13 13:04:09 gerald Exp $
 
User-agent: *
Disallow: /check
 
# robots.txt for httpd.apache.org
User-agent: *
Disallow: /websrc
 
# robots.txt for www.apache.org
User-agent: *
Disallow: /websrc
Crawl-Delay: 4

#  Please, we do NOT allow nonauthorized robots.
#  http://www.webmasterworld.com/robots
#  Actual robots can always be found here for: http://www.webmasterworld.com/robots2
#  Old full robots.txt can be found here: http://www.webmasterworld.com/robots3
#  Any unauthorized bot running will result in IP's being banned.
#  Agent spoofing is considered a bot.
#  Fair warning to the clueless - honey pots are - and have been - running.
#  If you have been banned for bot running - please sticky an admin for a reinclusion request.
#  http://www.searchengineworld.com/robots/
#  This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode
 
User-agent: *
Crawl-delay: 17
 
User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/
# WebmasterWorld.com: robots.txt
# GNU Robots.txt Feel free to use with credit
# given to WebmasterWorld.
# Please, we do NOT allow nonauthorized robots any longer.
# http://www.searchengineworld.com/robots/
# Yes, feel free to copy and use the following.
 
User-agent: OmniExplorer_Bot
Disallow: /
 
User-agent: FreeFind
Disallow: /
 
User-agent: BecomeBot
Disallow: /
 
User-agent: Nutch
Disallow: /
 
User-agent: Jetbot/1.0
Disallow: /
 
User-agent: Jetbot
Disallow: /
 
User-agent: WebVac
Disallow: /
 
User-agent: Stanford
Disallow: /
 
User-agent: naver
Disallow: /
 
User-agent: dumbot
Disallow: /
 
User-agent: Hatena Antenna
Disallow: /
 
User-agent: grub-client
Disallow: /
 
User-agent: grub
Disallow: /
 
User-agent: looksmart
Disallow: /
 
User-agent: WebZip
Disallow: /
 
User-agent: larbin
Disallow: /
 
User-agent: b2w/0.1
Disallow: /
 
User-agent: Copernic
Disallow: /
 
User-agent: psbot
Disallow: /
 
User-agent: Python-urllib
Disallow: /
 
User-agent: Googlebot-Image
Disallow: /
 
User-agent: NetMechanic
Disallow: /
 
User-agent: URL_Spider_Pro
Disallow: /
 
User-agent: CherryPicker
Disallow: /
 
User-agent: EmailCollector
Disallow: /
 
User-agent: EmailSiphon
Disallow: /
 
User-agent: WebBandit
Disallow: /
 
User-agent: EmailWolf
Disallow: /
 
User-agent: ExtractorPro
Disallow: /
 
User-agent: CopyRightCheck
Disallow: /
 
User-agent: Crescent
Disallow: /
 
User-agent: SiteSnagger
Disallow: /
 
User-agent: ProWebWalker
Disallow: /
 
User-agent: CheeseBot
Disallow: /
 
User-agent: LNSpiderguy
Disallow: /
 
User-agent: Mozilla
Disallow: /
 
User-agent: mozilla
Disallow: /
 
User-agent: mozilla/3
Disallow: /
 
User-agent: mozilla/4
Disallow: /
 
User-agent: mozilla/5
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
Disallow: /
 
User-agent: ia_archiver
Disallow: /
 
User-agent: ia_archiver/1.6
Disallow: /
 
User-agent: Alexibot
Disallow: /
 
User-agent: Teleport
Disallow: /
 
User-agent: TeleportPro
Disallow: /
 
User-agent: Stanford Comp Sci
Disallow: /
 
User-agent: MIIxpc
Disallow: /
 
User-agent: Telesoft
Disallow: /
 
User-agent: Website Quester
Disallow: /
 
User-agent: moget/2.1
Disallow: /
 
User-agent: WebZip/4.0
Disallow: /
 
User-agent: WebStripper
Disallow: /
 
User-agent: WebSauger
Disallow: /
 
User-agent: WebCopier
Disallow: /
 
User-agent: NetAnts
Disallow: /
 
User-agent: Mister PiX
Disallow: /
 
User-agent: WebAuto
Disallow: /
 
User-agent: TheNomad
Disallow: /
 
User-agent: WWW-Collector-E
Disallow: /
 
User-agent: RMA
Disallow: /
 
User-agent: libWeb/clsHTTP
Disallow: /
 
User-agent: asterias
Disallow: /
 
User-agent: httplib
Disallow: /
 
User-agent: turingos
Disallow: /
 
User-agent: spanner
Disallow: /
 
User-agent: InfoNaviRobot
Disallow: /
 
User-agent: Harvest/1.5
Disallow: /
 
User-agent: Bullseye/1.0
Disallow: /
 
User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /
 
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /
 
User-agent: CherryPickerSE/1.0
Disallow: /
 
User-agent: CherryPickerElite/1.0
Disallow: /
 
User-agent: WebBandit/3.50
Disallow: /
 
User-agent: NICErsPRO
Disallow: /
 
User-agent: Microsoft URL Control - 5.01.4511
Disallow: /
 
User-agent: DittoSpyder
Disallow: /
 
User-agent: Foobot
Disallow: /
 
User-agent: WebmasterWorldForumBot
Disallow: /
 
User-agent: SpankBot
Disallow: /
 
User-agent: BotALot
Disallow: /
 
User-agent: lwp-trivial/1.34
Disallow: /
 
User-agent: lwp-trivial
Disallow: /
 
User-agent: http://www.WebmasterWorld.com bot
Disallow: /
 
User-agent: BunnySlippers
Disallow: /
 
User-agent: Microsoft URL Control - 6.00.8169
Disallow: /
 
User-agent: URLy Warning
Disallow: /
 
User-agent: Wget/1.6
Disallow: /
 
User-agent: Wget/1.5.3
Disallow: /
 
User-agent: Wget
Disallow: /
 
User-agent: LinkWalker
Disallow: /
 
User-agent: cosmos
Disallow: /
 
User-agent: moget
Disallow: /
 
User-agent: hloader
Disallow: /
 
User-agent: humanlinks
Disallow: /
 
User-agent: LinkextractorPro
Disallow: /
 
User-agent: Offline Explorer
Disallow: /
 
User-agent: Mata Hari
Disallow: /
 
User-agent: LexiBot
Disallow: /
 
User-agent: Web Image Collector
Disallow: /
 
User-agent: The Intraformant
Disallow: /
 
User-agent: True_Robot/1.0
Disallow: /
 
User-agent: True_Robot
Disallow: /
 
User-agent: BlowFish/1.0
Disallow: /
 
User-agent: http://www.SearchEngineWorld.com bot
Disallow: /
 
User-agent: http://www.WebmasterWorld.com bot
Disallow: /
 
User-agent: JennyBot
Disallow: /
 
User-agent: MIIxpc/4.2
Disallow: /
 
User-agent: BuiltBotTough
Disallow: /
 
User-agent: ProPowerBot/2.14
Disallow: /
 
User-agent: BackDoorBot/1.0
Disallow: /
 
User-agent: toCrawl/UrlDispatcher
Disallow: /
 
User-agent: WebEnhancer
Disallow: /
 
User-agent: suzuran
Disallow: /
 
User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /
 
User-agent: VCI
Disallow: /
 
User-agent: Szukacz/1.4
Disallow: /
 
User-agent: QueryN Metasearch
Disallow: /
 
User-agent: Openfind data gathere
Disallow: /
 
User-agent: Openfind
Disallow: /
 
User-agent: Xenu's Link Sleuth 1.1c
Disallow: /
 
User-agent: Xenu's
Disallow: /
 
User-agent: Zeus
Disallow: /
 
User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /
 
User-agent: RepoMonkey
Disallow: /
 
User-agent: Microsoft URL Control
Disallow: /
 
User-agent: Openbot
Disallow: /
 
User-agent: URL Control
Disallow: /
 
User-agent: Zeus Link Scout
Disallow: /
 
User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /
 
User-agent: Webster Pro
Disallow: /
 
User-agent: EroCrawler
Disallow: /
 
User-agent: LinkScan/8.1a Unix
Disallow: /
 
User-agent: Keyword Density/0.9
Disallow: /
 
User-agent: Kenjin Spider
Disallow: /
 
User-agent: Iron33/1.0.2
Disallow: /
 
User-agent: Bookmark search tool
Disallow: /
 
User-agent: GetRight/4.2
Disallow: /
 
User-agent: FairAd Client
Disallow: /
 
User-agent: Gaisbot
Disallow: /
 
User-agent: Aqua_Products
Disallow: /
 
User-agent: Radiation Retriever 1.1
Disallow: /
 
User-agent: WebmasterWorld Extractor
Disallow: /
 
User-agent: Flaming AttackBot
Disallow: /
 
User-agent: Oracle Ultra Search
Disallow: /
 
User-agent: MSIECrawler
Disallow: /
 
User-agent: PerMan
Disallow: /
 
User-agent: searchpreview
Disallow: /
 
User-agent: sootle
Disallow: /
 
User-agent: es
Disallow: /
 
User-agent: Enterprise_Search/1.0
Disallow: /
 
User-agent: Enterprise_Search
Disallow: /
 
User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/

You don't have to block your feeds from indexing. Matt Cutts himself suggested not to block those because there is not real reason to doing so. If you are talking about a blog and its /feed and such urls it won't cause mess in your rankings, so my suggesting would be not touch those feeds.

As for blocking, no you won't have any affection to your main url if you block /feed or whatever urls you want. Gbot will just deindex them and stop crawling them.

Wget Robot Exclusion

It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.

As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the `--wait' option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by an, uh, bitchin' CGI Perl script that converts Info files to HTML on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the downloader.

To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion has been invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from the robots.

The most popular mechanism, and the de facto standard supported by all the major robots, is the "Robots Exclusion Standard" (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are supposed to download and parse.

Although Wget is not a web robot in the strictest sense of the word, it can downloads large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue: wget -r http://www.server.com/ First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.

Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft draft-koster-robots-00.txt titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.

This manual no longer includes the text of the Robot Exclusion Standard.

The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this: <meta name="robots" content="nofollow"> This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to `off' in your `.wgetrc'. You can achieve the same effect from the command line using the -e switch, e.g. wget -e robots=off url

Tags

October 20th, 2007

Comments Welcome

  • Hakob

    About Meta tags it's quite clear with nofollow and noindex . But is it possible hide only the part ? of the page add note the whole page.

    If anybody know about it please let me know too hhk_666(at)yahoo.co.uk

  • be4you-design

    port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols.

  • http://outings.thinusbotha.co.za Thinus

    Hi
    I've been working on my blog and in the process moved it to a WordPress install on my own server. Have been playing (and struggling) with many aspects of SEO and duplicate content. One of the main reasons I left Blogger for WordPress was that I would be able to do stuff with robots.txt and noindex,nofollow tags.

    I have used the examples above as a guide for my newest try. Im pretty confident it will remove some duplicate content from my SERPs.

    I've also done the excerpt instead of full posts trick now for all archive-type pages.

    The only major hassle I still have to bridge is the 3 column layout that I use: The 2 side ones with the links to previous posts and comments load first and are positioned using css. This means most of my searches include text from there instead of the actual content, and it might look to robots as if a use amount of text is duplicated at the top... Will have to get my content column loaded first...

    Anyway, thank you for your good advice on the robots.txt file and noindex,nofollow tags.
    Regards
    Thinus
    South Africa

  • http://www.bestwsiseo.com Malcolm Bradley

    Really great info and great advice - I'm off to settle down and absorb it all!
    Thanks again!

  • http://www.learn-internet-marketing-free.com Derrick Tan

    Wow... that was simply fantastic. Comprehensive article. Thanks!

  • http://oogletoogle.com Toogle

    I have learned so much about .htaccess from your site! Thanks for another great post - whens the book coming?

  • A. Michael Bussek

    How can I get rid of 404 pages in blogger.com?

  • HidupTreda

    nice posting I looking for this explain to fix my robots.txt.
    I'm learning much now with this robots.txt article

    thanks buddy

  • Cahyo

    Nice robots.txt article :)
    Thanks

  • Travis Le

    Thanks, I will try.

  • uwiuw

    wow, this is not only a nice posting but a great one! Thanks I have learned many things. Hm at least some SEO mumbo jumbo can be cleared just by doing what you explain here :D

  • obot

    thank you for this useful tips!

  • FlemmingLeer

    Hi,
    The

    User-agent: Mediapartners-Google*

    is not recognised in Google webmaster tools.

    Only

    User-agent: Mediapartners-Google

    is recognised, but now I have them both in my robots.txt just in case :/

    Thanks for a great post :)

    Scatter Joy
    FlemmingLeer

  • http://linuxossolutions.com Rory Hotson

    Great post used most on my site thank you

  • Skillman_92

    Utilissima questa guida ;)

  • Blackwolf

    Old information but very comprehensive. Not many good examples for phpbb 3.0 on the net... if you have the time and inclination, would you mind checking my robots txt?

    Thanks again

  • http://blogdecomputacion.com/blog/ ikki

    good post, very good information

  • http://www.bdmstudios.com Berend de Meyer

    GREAT STUFF!!! A must-see website for newbies like me!!! Thanks a lot.

  • http://www.bdmstudios.com Berend de Meyer

    GREAT STUFF! A must-see website for newbies like me (WP v3.0.5 & JingJang v1.1 user since 02/09/2011)! ;-D Thanks for the work guys, CHEERS!

  • http://www.highervisibility.com Adam

    Great article! The robots.txt file is essential to let the search engines know what pages you want indexed and which ones you don't. It is important to use this to remove any instances of duplicate content, which Google can penalize you for having. Thanks again for the informative information.

  • http://dsullana.blogspot.com/ jhonny

    Muy bueno solo que yo tengo un blog y en algunos casos me salen errores de robot vere que hago ahora mil gracias

  • http://URLwww.villagegreen.com Name Rich

    Is there any benefit to only having this in my header

    Someone suggested that I am blocking the site from the Google bot and my site will not show up in search engines?

    Thanks,

  • Pingback: Use robots.txt to Increase Search Engine Rankings by @LiewCF

  • Pingback: El archivo robots.txt - Analistas WEB y Consultores SEO

  • Pingback: SEO Robots.txt pour WordPress

My Online Tools
WordPress Sites

My Picks

Related Articles
Newest Posts
Twitter

  • @askapache · Apr 22
    Within our small sphere of the universe we can look at our past, but only our future is within our control
  • @askapache · Apr 16
    RT @commandlinefu:  t.co/htRbYF8SFf  - Print ASCII Character Chart (AskApache) #echo #column #ascii #for #octal #askapache #chart
  • @askapache · Apr 11
    Angel headed hipsters burning for the ancient heavenly connection to the starry dynamo in the machinery of the night - Ginsberg
  • @askapache · Apr 11
    If you haven't experienced a hacker war on a free shell network like SDF, you haven't lived!
  • @askapache · Apr 10
    @kovshenin @wpekadotcom @CloudFlare Maybe it's an April fools joke that finally propagated
  • @askapache · Apr 10
    @kovshenin @wpekadotcom @CloudFlare wait this is a joke right?
  • ZERO DAY - read before Trojan horse  t.co/pPMLGDJv8P 
  • Trojan Horse, a novel!  t.co/Hf8EtYaZVa 
  • The Hacker Playbook - very nice high level overview of attacks  t.co/lHwNVWi61u 
  • Clean Code - A Handbook of Agile Software Craftsmanship  t.co/hnJX0x1qIc 
  • Secrets of the JavaScript Ninja - By my absolute favorite JS hacker John Resig!  t.co/tZ42ljmcCl 
  • Hacking Exposed 7: Network Security Secrets & SolutionsMy all time favorite, basic but thorough and accurate.  t.co/jycW0RDVtZ 
  • Empty words will be no surrogate for cold resolve. Pain is nothing.  t.co/qXjpRxbjCw 
  • REVERSING: Secrets of Reverse Engineering  t.co/GaWo29lWWG 
  • NEUROMANCER  t.co/3OoknUcb5Z 
  • "The Shockwave Rider", by John Brunner (1975 hacker sci-fi)  t.co/ZW56HVUefW 
  • The Rootkit ARSENAL - Escape and Evasion in the Dark Corners of the System  t.co/1FzX6bHgsQ 
  • "We Are Anonymous - Inside the Hacker World of LulzSec, Anonymous, and the Global Cyber Insurgency" better be good!  t.co/GL0cFNiUOq 
  • THE IDEA FACTORY Bell Labs  t.co/FyVhgNwwT5 
  • The Datacenter as a Computer -- Urs Holzle  t.co/M5WIYs1OVg 

Friends and Recommends
Hacking and Hackers

The use of "hacker" to mean "security breaker" is a confusion on the part of the mass media. We hackers refuse to recognize that meaning, and continue using the word to mean someone who loves to program, someone who enjoys playful cleverness, or the combination of the two. See my article, On Hacking.
-- Richard M. Stallman






[hide]

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

| Google+ | askapache

Site Map | Contact Webmaster | License and Disclaimer | Terms of Service

↑ TOPMain