Updated robots.txt for WordPress
« Hack WP-Cache for Maximum SpeedIP Abuse Detection for DreamHost »
Implementing an effective SEO robots.txt file for WordPress will help your blog to rank higher in Search Engines, receive higher paying relevant Ads, and increase your blog traffic. Using a robots.txt file gives you a search engine robots point of view… Sweet!
Looking for the absolute newest version? Just look at mine, I don’t slack.
For instance, I am disallowing /category/ in the robots.txt file below because askapache.com/category/htaccess/ is the same as askapache.com/htaccess/, and that would be duplicate content. Adding a 301 Redirect using mod_rewrite or RedirectMatch can further protect myself from this duplicate content issue.
User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content Disallow: /tag Disallow: /author Disallow: /wget/ Disallow: /httpd/ Disallow: /i/ Disallow: /f/ Disallow: /t/ Disallow: /c/ Disallow: /j/ User-agent: Mediapartners-Google Allow: / User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Image Allow: / User-agent: Googlebot-Mobile Allow: / User-agent: ia_archiver-web.archive.org Disallow: / Sitemap: http://www.askapache.com/sitemap.xml # __ __ # ____ ______/ /______ _____ ____ ______/ /_ ___ # / __ `/ ___/ //_/ __ `/ __ \/ __ `/ ___/ __ \/ _ \ # / /_/ (__ ) ,< / /_/ / /_/ / /_/ / /__/ / / / __/ # \__,_/____/_/|_|\__,_/ .___/\__,_/\___/_/ /_/\___/ # /_/ #
User-agent: * Disallow: Allow: /* User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: /
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Store items shown or linked via multiple distinct URLs
- Printer-only versions of web pages
However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.
Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we’ll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.
Pages you block in this way may still be added to the Google index if other sites link to them. As a result, the URL of the page and, potentially, other publicly available information can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.
To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.
You can instruct us not to include content from your site in our index or to remove content from your site that is currently in our index in the following ways:
- Remove your entire website or part of your website using a robots.txt file.
- Remove individual pages of your website using a robots meta tag.
- Remove cached copies of your pages using a robots meta tag.
- Remove snippets that appear below your page’s title in our search results and describe the content of your page.
- Remove outdated pages by returning the proper server response.
- Remove images from Google Image Search using a robots.txt file.
- Remove blog entries from Google Blog Search.
- Remove a feed from our user-agent Feedfetcher, which provides content to our feed readers.
- Remove transcoded versions of your pages (pages we’ve reformatted for mobile browsers).
<meta name="robots" content="noindex,follow" />
<meta name="googlebot" content="noindex,follow" />
<meta name="robots" content="nofollow" />
<?php if(is_single() || is_page() || is_category() || is_home()) { ?>
<meta name="robots" content="all,noodp" />
<?php } ?>
<?php if(is_archive()) { ?>
<meta name="robots" content="noarchive,noodp" />
<?php } ?>
<?php if(is_search() || is_404()) { ?>
<meta name="robots" content="noindex,noarchive" />
<?php } ?>
Robots.txt footnote
Alexa, Compete, and Quantcast are all guilty of firewalling unknown friendly search engine agents at the front gate. These sites that monitor the Internet should be the most in the know that unfriendly agents cloak as humans and will come in no matter what. So the general rule of thumb is that robots.txt directives are only for the good agents anyway.
« Hack WP-Cache for Maximum Speed
IP Abuse Detection for DreamHost »
Tags: robots, robots.txt, SEO, WordPress
Please consider donating to support active development of the free software and articles here.![]()
The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. Tim Berners-Lee
How old is this information? Why don’t you show dates on the post or the comments?
hi ..
I have question about this file, can we use on free blog on wordpress.com or blogspot.com ?
Thanks for your information
Thanks Guys
@Ujjwol
If you really want archive.org to duplicate your content remove the following.
User-agent: ia_archiver-web.archive.org Disallow: /
User-agent: ia_archiver Disallow: /
But archive.org is a reasonably good way to proof that your site’s content existed at a certain date and was not parked with a an all ads page, or had the content first (in case of plagiarism).
Archive.org says it cannot crawl my website due to this robots.txt ?
How to fix this ?
Forgot to add the following from robotstxt.org. I’m going with the format of your robots.txt file for now. ( http://www.robotstxt.org/robotstxt.html )
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
I’m setting up a new blog and this time wanted to use a proper robots.txt file. So I started reasearching.
Over at wordpress.org, where this page is linked from ( just above the grey box : Search Engine Optimization for WordPress ) I see that wildcards are used in the robots.txt file sample.
Then I clicked the link and landed here, and this robots.txt file is not using wildcards at all.
This, and the examples over at robotstxt.org, are the first and only robots.txt file for WordPress which I have seen that are not using the asterisk * sign as a wildcard.
This is I *think* a good thing, because I read over at : robotstxt.org the following :
Note the ‘*’ is a special token, meaning “any other User-agent”; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
Two common errors:
- Wildcards are _not_ supported: instead of ‘Disallow: /tmp/*’ just say ‘Disallow: /tmp/’.
- You shouldn’t put more than one path on a Disallow line (this may change in a future version of the spec)
But still I see countless of other sites using wildcards in paths.
Was this changed so that wildcards are supported, and maybe the robotstxt.org site just not updated?
One other thing, I see the lines ‘Disallow: /i/’ etc in your robots.txt file.
Do these have to do with the fact that typing the first letter of a page take you to that page ?
For example :
domain.com/c/ and domain.com/c goes to the contact page.
When I enter for instance domain.com/i/ in my browser’s address bar, I get a “No posts found” message with a “Nothing found for I” in the title.
Should I stll be adding those lines ?
Or should I just go for the whole alphabet ? :)
Apologies for the lengthy post.
thanks :) It seems I was really thinking to much into robots.txt files, they are really just a simple old school method to block search engine crawlers in a very simple way. I was all confused with wordpress robots .txt files but they really are that simple.
Thanks and have a great day!
Jon @ IBM Core
Excellent post. I shall be using your robots.txt on my site and hopefully I’ll see good results enough to make write a short post on this and link to your site. Thanks a lot.
Can I mention the robots.txt WordPress plugin? The default content is not the same as yours, but it’s certainly a handy way of creating and managing a robots.txt file for WordPress. Official page is at http://wordpress.org/extend/plugins/pc-robotstxt/
Thanks,
Peter.
Great tip, actually clarified some questions I had about the robots txt prior. Thanks
Thank you for this post mate. I got my site indexed! :)
Thanks for the good points, my first robots get the SEO friendly but after some accident the file was lost. It will help me to optimize my WP.
Спасибо за статью, советы и рекомендации. Очень Вам благодарен.
Fantastic article. This article has been of great help to me.
good luck in your project
Ouch, I wish I had seen webdiggers post, thanks for getting 99 percent of my sites content ‘resrticted by robot.txt’
Disallow: /*?* Disallow: /*?
is the same as “get lost robots”
maybe you should do use all a favor and put disallow in your robot.txt file so this kind of disinformation get weeded out and sifts to the bottom of the sludge pile where it belongs
Have you seen a problem with Google showing your robots.txt in it’s search results?
Thanks a lot, This robots.txt tutorial is Useful!
Pienso que los comentarios no tienen por qué estar indexados. En todo caso, los comentarios forman parte de cada post.
La instrucción
Disallow: /wp-
se excede en alcance, ya que dentro de esa carpeta se halla la carpeta /uploads, así que hay que estar seguro de lo que se está haciendo al momento de usarla.
Por último le estaremos impidiendo el acceso a googleblogs search si colocamos la orden
Disallow: /feed/$
. WARNING…!
Well, I think that the above is well done, duplicated content can indeed hurt any website. Good, high-quality content is what Google is looking for, not otherway around. If you are unsure if what askapache.com is trying to establish, simply take a closer look at their #1 PageRank, #2 SERP and you’ll get an immediate answer if this is good or not.
Thanks,
Emil
SEO Agent
Your Robots.txt will block practically the whole site. Robots.txt does not take variables into consideration, so when you do a:
Disallow: /*?* Disallow: /*?
Its the same as :
Disallow: /
Which blocks everything. I suggest you visit http://www.robotstxt.org/faq/robotstxt.html it will explain how wildcards are not supported.
May I please ask you why you put this line :
Disallow: */comments
I guess it is to prevent specific comment URL from being indexed but the format of the comments URL isn’t like this, isn’t it?
I was wondering why the robots.txt file in this example is different to the one at askapache.com/robots.txt
nice blog btw.
Hi Matt,
I am not handy in Robots.txt file, but would i block whole pages from search engines incase if i use both your sample robots.txt file and php codes that you provided for Wordpress.
Thanks for the post, it’s really useful!
But could you please tell more about following strings:
Disallow: (without any symbol after colon) Allow: /* Disallow: /
What do they mean? Is there any difference between “Allow: /*” and “Disallow:“?
And how I should disallow indexing of a particular directory: “Disallow: /wp-admin” or “Disallow: /wp-admin/“? (should I use slash at the end or not?)
Hay que limitar aceso a la carpetas para lascuales nos interesa limitar el rastreo, sin embargo cuidado en no occurir en el Blackhat, algunos manipulan los CSS, y limitan el aceso al buscador para que no se de cuenta de la adaptacion de los H1…Hx…
yo propongo uno como :
http://www.vuelomania.com/robots.txt
I fixed it. I put it in my root directory!
Why you are not disallowing /2007, /author and /page ???
Hello, my robots.txt is the following
Sitemap: http://www.xxx.es/sitemap.xml User-agent: * Disallow: /wp- Disallow: /search Disallow: /feed Disallow: /comments/feed Disallow: /feed/$ Disallow: /*/feed/$ Disallow: /*/feed/rss/$ Disallow: /*/trackback/$ Disallow: /*/*/feed/$ Disallow: /*/*/feed/rss/$ Disallow: /*/*/trackback/$ Disallow: /*/*/*/feed/$ Disallow: /*/*/*/feed/rss/$ Disallow: /*/*/*/trackback/$ Disallow: /?s= Disallow: /dogs Disallow: /archives Disallow: /page Disallow: /author Disallow: /2007 Disallow: /category Disallow: /2008 Disallow: /2009 Disallow: /?livehit=
My issue is that my post only ranks when are on the homepage, is something wrong with my robots.txt ??? I´m ussing the same robots.txt on other two blogs and ranks really well
Nice writeup. Thnx
Thanks for the great sharing, I am studying it now and plan to implement to all my blog sites.
great list, thanks!
If I have 10 pages website, Do we need to add Robots index,follow on each page?
Or I need to add this only on index or default page so that robots can follow all links from there?
Also what will happen if robot lands on a inner page first? does this line helps re-directing robot to follow links from index page?
is it good to have the tag cloud crawled. since it is a way of humanly categorizing content?
I hear this is good, but should I then block my original wordpress categories?
I usually use these codes:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-*
I think I need some change.
Thanks 4 your post:)
It's very simple - you read the protocol and write the code. -Bill Joy
HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C | TLDP | WAI | DISA | ICSI | GIAC | SANS RR | GHOST | DEFCON | NIST | DHS CYBER | NIST | .:: Phrack Magazine ::.
↑ TOPExcept where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. HTTPD based on NCSA HTTPd
My previous posts are from October 1st/2nd 2009.
You shouldn’t worry about, this information is very static.
It’s not like they are changing the WordPress directory names, nor what’s in them frequently.
If it would change frequently, the theme designers, plugin coders, blog owners, and search engines wouldn’t bother.
What’s most likely to change (frequently) are your own directory names and their content.
If you don’t update that part of the robots.txt file, then it will be old information.
Off Topic 1:
Tip : Be careful with robots.txt. It’s a simple text file, that anyone can show in their browser.
Don’t giveaway your whole site structure, use meaningless names for the directories that you wish to protect.
Off Topic 2:
Bloody Hell, I see that Mehmet of GabfireThemes posted in here. How could I have missed that before ?