FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home  »  SEO  »  Updated robots.txt for WordPress

by 66 comments

Implementing an effective SEO robots.txt file for WordPress will help your blog to rank higher in Search Engines, receive higher paying relevant Ads, and increase your blog traffic. Using a robots.txt file gives you a search engine robots point of view... Sweet! Looking for the most updated robots.txt? Just look at mine, I don't slack.

Warning about robots.txt files

Your robots.txt file should never have more than 200 Disallow lines.. Start with as few as possible and add to it when needed.

Once google removes links referenced in your robots.txt file, if you want those links to be added back in it could take up to 3 months before Google re-indexes the previously disallowed links.

Google pays serious attention to robots.txt files. Google uses robots.txt files as an authoritative set of links to Disallow. If you Disallow a link in robots.txt, Google will completely and totally remove the disallowed links from the index which means you will not be able to find the disallowed links when searching google.

The big idea for you to take away, is to only use robots.txt to do hard disallows, that you know you don't want indexed. Not only will the links not be indexed, they won't be followed by search engines either, meaning the links and content on the disallowed pages will not be used by the search engines for indexing or for ranking.

So, use the robots.txt file only for disallowing links that you want totally removed from google. Use the robots meta tag to specify all the allows, and also use the rel='nofollow' attribute of the a link element when its temporary or you still want the link to be indexed but not followed.

WordPress robots.txt SEO

Here are some robots.txt files used with WordPress on this blog. For instance, I am disallowing /comment-page- links altogether in the robots.txt file below because I don't use separate comment pages, so I instruct Google to remove these links from the index. See also: Adding a 301 Redirect using mod_rewrite or RedirectMatch can further protect myself from this duplicate content issue.

User-agent: *
Allow: /
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /e/
Disallow: /show-error-*
Disallow: /xmlrpc.php
Disallow: /trackback/
Disallow: /comment-page-
Allow: /wp-content/uploads/

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /


# getting sick with the sitemaps
Sitemap: /sitemap.xml
Sitemap: /sitemap_index.xml
Sitemap: /page-sitemap.xml 
Sitemap: /post-sitemap.xml 
Sitemap: /sitemap-news.xml 
Sitemap: /sitemap-posttype-page.xml 
Sitemap: /sitemap-posttype-post.xml 
Sitemap: /sitemap-home.xml 



#               __                          __
#   ____ ______/ /______ _____  ____ ______/ /_  ___
#  / __ `/ ___/ //_/ __ `/ __ \/ __ `/ ___/ __ \/ _ \
# / /_/ (__  ) ,< / /_/ / /_/ / /_/ / /__/ / / /  __/
# \__,_/____/_/|_|\__,_/ .___/\__,_/\___/_/ /_/\___/
#                     /_/
#

Generic Default robots.txt

For many super-geeky reasons, every single website you control must have a robots.txt file in its root directory example.com/robots.txt. I also recommend having a favicon.ico file, bare minimum. This will ensure your site is viewed as somewhat SEO, and alerts google there are rules for crawling the site. IT will also save your server resources.

User-agent: *
Disallow:

Google Recommendations

Use robots.txt - Webmaster Guidelines

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.

Troubleshooting tips part IIb: Ad relevance and targeting continued. To follow up on our previous post about ad relevance and targeting, let's look at some other reasons why you may experience ad targeting issues on your site.

Have you blocked the AdSense crawler's access to your pages?

The AdSense crawler is an automated program that scans your web pages and tracks content for indexing. Sometimes we don't crawl pages because the AdSense crawler doesn't have access to your pages, in which case we're unable to determine their content and show relevant ads. Here are a few specific instances when our crawler can't access a site:If you use a robots.txt file which regulates the crawler access to your page. In this case, you can grant the AdSense crawler access by adding these lines to the top of your robots.txt file:

User-agent: Mediapartners-Google*
Disallow:

Eliminate Duplicate Content

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Store items shown or linked via multiple distinct URLs
  • Printer-only versions of web pages

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a "regular" and "printer" version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we'll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.

Prevent page from being indexed

Pages you block in this way may still be added to the Google index if other sites link to them. As a result, the URL of the page and, potentially, other publicly available information can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.

To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.

Prevent content being indexed or remove content from Google's index?

You can instruct us not to include content from your site in our index or to remove content from your site that is currently in our index in the following ways:

Google User-agents

Adsbot-Google
crawls pages to measure AdWords landing page quality
Googlebot
crawl pages from googles web and news index
Googlebot-Image
crawls pages for the image index
Googlebot-Mobile
crawls pages for the mobile index
Mediapartners-Google
crawls pages to determine AdSense content

Robots Meta Tags and Examples

The meta tag is very helpful and should be preferred over modifications to robots.txt. Using the robots meta tag.

Stop all robots from indexing a page on your site, but still follow the links on the page

Allow other robots to index the page on your site, preventing only Googles bots from indexing the page

Allow robots to index the page on your site but not to follow outgoing links

header.php Trick for Conditional Robots Meta

Note: I recommend using the Yoast WordPress SEO Plugin to do this now, but here's a quick and easy way to think about it.. Add this to your header.php

<?php if(is_single() || is_page() || is_category() || is_home()) { ?>
	
<?php } ?>
<?php if(is_archive()) { ?>
	
<?php } ?>
<?php if(is_search() || is_404()) { ?>
	
<?php } ?>

Robots.txt footnoteAlexa, Compete, and Quantcast are all guilty of firewalling unknown friendly search engine agents at the front gate. These sites that monitor the Internet should be the most in the know that unfriendly agents cloak as humans and will come in no matter what. So the general rule of thumb is that robots.txt directives are only for the good agents anyway.

Good Robots.txt Articles

  1. How Google Crawls My Site
  2. Controlling how search engines access and index your website
  3. Controlling Access with robots.txt
  4. Removing duplicate search engine content using robots.txt - Mark Wilson
  5. Revisiting robots.txt - Twenty Steps

Robots.txt References

  1. Robots.txt optimization
  2. The Web Robots Pages
  3. W3.org - Notes on helping search engines index your Web site
  4. Wikipedia robots.txt page
  5. Inside Google Sitemaps: Using a robots.txt file

Tags

March 15th, 2008

Comments Welcome


Related Articles


My Online Tools
Popular Articles


Hacking and Hackers

The use of "hacker" to mean "security breaker" is a confusion on the part of the mass media. We hackers refuse to recognize that meaning, and continue using the word to mean someone who loves to program, someone who enjoys playful cleverness, or the combination of the two. See my article, On Hacking.
-- Richard M. Stallman









[hide]

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

+Askapache | askapache

Site Map | Contact Webmaster | License and Disclaimer | Terms of Service

↑ TOPMain