Updated robots.txt for WordPress

FREE THOUGHT · FREE SOFTWARE · FREE WORLD

WordPress robots.txtMarch 15th, 2008

« Hack WP-Cache for Maximum SpeedIP Abuse Detection for DreamHost »

WordPress robots.txt SEOImplementing an effective SEO robots.txt file for WordPress will help your blog to rank higher in Search Engines, receive higher paying relevant Ads, and increase your blog traffic. Using a robots.txt file gives you a search engine robots point of view… Sweet!


WordPress robots.txt SEO

AskApache.com robots.txt files

For instance, I am disallowing /category/ in the robots.txt file below because askapache.com/category/htaccess/ is the same as askapache.com/htaccess/, and that would be duplicate content. Adding a 301 Redirect using mod_rewrite or RedirectMatch can further protect myself from this duplicate content issue.

www.AskApache.com/robots.txt

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /tag
Disallow: /author
Disallow: /wget/
Disallow: /httpd/
Disallow: /i/
Disallow: /f/
Disallow: /t/
Disallow: /c/
Disallow: /j/
 
User-agent: Mediapartners-Google
Allow: /
 
User-agent: Adsbot-Google
Allow: /
 
User-agent: Googlebot-Image
Allow: /
 
User-agent: Googlebot-Mobile
Allow: /
 
User-agent: ia_archiver-web.archive.org
Disallow: /
 
Sitemap: http://www.askapache.com/sitemap.xml
 
#               __                          __
#   ____ ______/ /______ _____  ____ ______/ /_  ___
#  / __ `/ ___/ //_/ __ `/ __ \/ __ `/ ___/ __ \/ _ \
# / /_/ (__  ) ,< / /_/ / /_/ / /_/ / /__/ / / /  __/
# \__,_/____/_/|_|\__,_/ .___/\__,_/\___/_/ /_/\___/
#                   /_/
#

z.AskApache.com/robots.txt

User-agent: *
Disallow:
Allow: /*
 
User-agent: ia_archiver
Disallow: /
 
User-agent: duggmirror
Disallow: /

Google Recommendations

Use robots.txt – Webmaster Guidelines

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.

Eliminate Duplicate Content

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Store items shown or linked via multiple distinct URLs
  • Printer-only versions of web pages

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we’ll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.

Prevent page from being indexed

Pages you block in this way may still be added to the Google index if other sites link to them. As a result, the URL of the page and, potentially, other publicly available information can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.

To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.

Prevent content being indexed or remove content from Google’s index?

You can instruct us not to include content from your site in our index or to remove content from your site that is currently in our index in the following ways:

Google User-agents

Adsbot-Google
crawls pages to measure AdWords landing page quality
Googlebot
crawl pages from googles web and news index
Googlebot-Image
crawls pages for the image index
Googlebot-Mobile
crawls pages for the mobile index
Mediapartners-Google
crawls pages to determine AdSense content

Good Robots.txt Articles

  1. How Google Crawls My Site
  2. Using the robots.txt analysis tool
  3. Controlling how search engines access and index your website
  4. Controlling Access with robots.txt
  5. Removing duplicate search engine content using robots.txt – Mark Wilson
  6. Revisiting robots.txt – Twenty Steps

Robots Meta Tags

Using the robots meta tag

Robots Meta Examples

Stop all robots from indexing a page on your site, but still follow the links on the page

<meta name="robots" content="noindex,follow" />

Allow other robots to index the page on your site, preventing only Googles bots from indexing the page

<meta name="googlebot" content="noindex,follow" />

Allow robots to index the page on your site but not to follow outgoing links

<meta name="robots" content="nofollow" />

header.php Trick for Conditional Robots Meta

Add this to your header.php

<?php if(is_single() || is_page() || is_category() || is_home()) { ?>
  <meta name="robots" content="all,noodp" />
<?php } ?>
<?php if(is_archive()) { ?>
  <meta name="robots" content="noarchive,noodp" />
<?php } ?>
<?php if(is_search() || is_404()) { ?>
  <meta name="robots" content="noindex,noarchive" />
<?php } ?>

Robots.txt footnote
Alexa, Compete, and Quantcast are all guilty of firewalling unknown friendly search engine agents at the front gate. These sites that monitor the Internet should be the most in the know that unfriendly agents cloak as humans and will come in no matter what. So the general rule of thumb is that robots.txt directives are only for the good agents anyway.


Robots.txt References

  1. Robots.txt optimization
  2. The Web Robots Pages
  3. W3.org – Notes on helping search engines index your Web site
  4. Wikipedia robots.txt page
  5. Inside Google Sitemaps: Using a robots.txt file

« Hack WP-Cache for Maximum SpeedIP Abuse Detection for DreamHost »

Reader Comments

  1. askapache.com commenter Peter says:

    Can I mention the robots.txt WordPress plugin? The default content is not the same as yours, but it’s certainly a handy way of creating and managing a robots.txt file for WordPress. Official page is at http://wordpress.org/extend/plugins/pc-robotstxt/

    Thanks,
    Peter.

  2. GearModa says:

    Great tip, actually clarified some questions I had about the robots txt prior. Thanks

  3. askapache.com commenter Prox says:

    Thank you for this post mate. I got my site indexed! :)

  4. askapache.com commenter pututik says:

    Thanks for the good points, my first robots get the SEO friendly but after some accident the file was lost. It will help me to optimize my WP.

  5. Bohdan says:

    Спасибо за статью, советы и рекомендации. Очень Вам благодарен.

  6. Consultor says:

    Fantastic article. This article has been of great help to me.

    good luck in your project

  7. JIm Chenoweth says:

    Ouch, I wish I had seen webdiggers post, thanks for getting 99 percent of my sites content ‘resrticted by robot.txt’

    Disallow: /*?*
    Disallow: /*?

    is the same as “get lost robots”

    maybe you should do use all a favor and put disallow in your robot.txt file so this kind of disinformation get weeded out and sifts to the bottom of the sludge pile where it belongs

  8. askapache.com commenter AskApache says:

    @ engfer

    Never heard of that, let me search google and see if mine turns up..

    Yep! Check the results of my search site:www.askapache.com Disallow: User-agent: *.

    You can Disallow this by adding this to your robots.txt if you want it removed, I am personally going to keep mine in the index as people use my robots.txt as an example.

    Disallow: /robots.txt
  9. engfer says:

    Have you seen a problem with Google showing your robots.txt in it’s search results?

  10. askapache.com commenter RocyHua says:

    Thanks a lot, This robots.txt tutorial is Useful!

  11. askapache.com commenter Renegado says:

    Pienso que los comentarios no tienen por qué estar indexados. En todo caso, los comentarios forman parte de cada post.

    La instrucción

    Disallow: /wp-

    se excede en alcance, ya que dentro de esa carpeta se halla la carpeta /uploads, así que hay que estar seguro de lo que se está haciendo al momento de usarla.

    Por último le estaremos impidiendo el acceso a googleblogs search si colocamos la orden

    Disallow: /feed/$

    . WARNING…!

  12. Emil says:

    Well, I think that the above is well done, duplicated content can indeed hurt any website. Good, high-quality content is what Google is looking for, not otherway around. If you are unsure if what askapache.com is trying to establish, simply take a closer look at their #1 PageRank, #2 SERP and you’ll get an immediate answer if this is good or not.

    Thanks,
    Emil
    SEO Agent

  13. askapache.com commenter webdiggr says:

    Your Robots.txt will block practically the whole site. Robots.txt does not take variables into consideration, so when you do a:

    Disallow: /*?*
    Disallow: /*?

    Its the same as :

    Disallow: /

    Which blocks everything. I suggest you visit http://www.robotstxt.org/faq/robotstxt.html it will explain how wildcards are not supported.

  14. askapache.com commenter AskApache says:

    @ Olivier

    Well its because otherwise a search for askapache on google might list urls like http://www.askapache.com/this-seo-post/great-url.html#comment-232497 which is what it means to have a URL indexed in a search engine.

    I have about 230 posts on this blog, all high-quality, and coincidentally I have about 240 urls indexed by Google and major search engines. So it really makes my good pages the center of attention

    .

  15. Olivier says:

    May I please ask you why you put this line :
    Disallow: */comments

    I guess it is to prevent specific comment URL from being indexed but the format of the comments URL isn’t like this, isn’t it?

  16. Will says:

    I was wondering why the robots.txt file in this example is different to the one at askapache.com/robots.txt

    nice blog btw.

  17. askapache.com commenter AskApache says:

    @ mehmet

    I’m not sure what you are asking, but all the information you need is on this page.

  18. askapache.com commenter Mehmet says:

    Hi Matt,

    I am not handy in Robots.txt file, but would i block whole pages from search engines incase if i use both your sample robots.txt file and php codes that you provided for Wordpress.

  19. David says:

    Thanks for the post, it’s really useful!

    But could you please tell more about following strings:

    Disallow: (without any symbol after colon)
    Allow: /*
    Disallow: /

    What do they mean? Is there any difference between “Allow: /*” and “Disallow:“?

    And how I should disallow indexing of a particular directory: “Disallow: /wp-admin” or “Disallow: /wp-admin/“? (should I use slash at the end or not?)

  20. [...] See the Updated WordPress robots.txt file [...]

  21. Hay que limitar aceso a la carpetas para lascuales nos interesa limitar el rastreo, sin embargo cuidado en no occurir en el Blackhat, algunos manipulan los CSS, y limitan el aceso al buscador para que no se de cuenta de la adaptacion de los H1…Hx…

    yo propongo uno como :
    http://www.vuelomania.com/robots.txt

  22. askapache.com commenter AskApache says:

    @ stacey

    Great!

    @ cosasdeviajes

    Why you are not disallowing /2007, /author and /page ???

    I fixed it so that now it disallows /2007 and /author and I’m allowing /page in my robots.txt so that bots can still follow the links on /page* but they will not be indexed. This makes sure that they don’t use up any link-juice on my site and also helps search engines find more interlinking between my main content, single pages.

    <meta name="robots" content="noindex" />
  23. stacey says:

    I fixed it. I put it in my root directory!

  24. cosasdeviajes says:

    Why you are not disallowing /2007, /author and /page ???

  25. cosasdeviajes says:

    Hello, my robots.txt is the following

    Sitemap: http://www.xxx.es/sitemap.xml
     
    User-agent: *
    Disallow: /wp-
    Disallow: /search
    Disallow: /feed
    Disallow: /comments/feed
    Disallow: /feed/$
    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$
    Disallow: /*/*/feed/$
    Disallow: /*/*/feed/rss/$
    Disallow: /*/*/trackback/$
    Disallow: /*/*/*/feed/$
    Disallow: /*/*/*/feed/rss/$
    Disallow: /*/*/*/trackback/$
    Disallow: /?s=
    Disallow: /dogs
    Disallow: /archives
    Disallow: /page
    Disallow: /author
    Disallow: /2007
    Disallow: /category
    Disallow: /2008
    Disallow: /2009
    Disallow: /?livehit=

    My issue is that my post only ranks when are on the homepage, is something wrong with my robots.txt ??? I´m ussing the same robots.txt on other two blogs and ranks really well

  26. askapache.com commenter MrGroove says:

    Nice writeup. Thnx

  27. Steven Wong says:

    Thanks for the great sharing, I am studying it now and plan to implement to all my blog sites.

  28. peter says:

    great list, thanks!

  29. Sahi says:

    If I have 10 pages website, Do we need to add Robots index,follow on each page?

    Or I need to add this only on index or default page so that robots can follow all links from there?

    Also what will happen if robot lands on a inner page first? does this line helps re-directing robot to follow links from index page?

  30. askapache.com commenter sam says:

    is it good to have the tag cloud crawled. since it is a way of humanly categorizing content?

    I hear this is good, but should I then block my original wordpress categories?

  31. Cherife says:

    I usually use these codes:

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-*

    I think I need some change.
    Thanks 4 your post:)

RSS feed for comments on this post.TrackBack URL

Add Your Opinion

Skip to Comments
437

It's very simple -
you read the protocol
and write the code.
-Bill Joy

HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C | WAI | DISA | ICSI | GIAC | SANS RR | GHOST

Authority: 110

↑ TOP

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed according to these terms. "Apache" is a trademark of The ASF.

Site Map | Contact Webmaster | Glossary | License and Disclaimer | Terms of Service