<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Blocking Bad Bots and Scrapers with .htaccess</title>
	<atom:link href="http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html</link>
	<description>Advanced Web Development</description>
	<lastBuildDate>Wed, 18 Nov 2009 23:28:48 -0500</lastBuildDate>
	
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Michael</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-86142</link>
		<dc:creator>Michael</dc:creator>
		<pubDate>Sun, 05 Jul 2009 22:05:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-86142</guid>
		<description>Thanks for this some bot used up masses of my traffic allowance this week Ive put your htaccess into my root.

Now just sit back and pray it works</description>
		<content:encoded><![CDATA[<p>Thanks for this some bot used up masses of my traffic allowance this week Ive put your htaccess into my root.</p>
<p>Now just sit back and pray it works</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ramon Fincken</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-61914</link>
		<dc:creator>Ramon Fincken</dc:creator>
		<pubDate>Wed, 04 Feb 2009 19:43:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-61914</guid>
		<description>@Spencer: no it doesnt ( for I&#039;ve seen in my own htaccess files )

However for human reading purposes you might put your normal mod rewrite rules (SEO urls) on top below the engine and php flags followed by all your blocking rules.</description>
		<content:encoded><![CDATA[<p>@Spencer: no it doesnt ( for I&#8217;ve seen in my own htaccess files )</p>
<p>However for human reading purposes you might put your normal mod rewrite rules (SEO urls) on top below the engine and php flags followed by all your blocking rules.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ramon Fincken</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-45874</link>
		<dc:creator>Ramon Fincken</dc:creator>
		<pubDate>Fri, 03 Oct 2008 15:37:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-45874</guid>
		<description>@Spencer: in your public file root ( httpdocs or httproot or www or public_html ) ...</description>
		<content:encoded><![CDATA[<p>@Spencer: in your public file root ( httpdocs or httproot or www or public_html ) &#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AskApache</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-45815</link>
		<dc:creator>AskApache</dc:creator>
		<pubDate>Thu, 02 Oct 2008 20:57:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-45815</guid>
		<description>&lt;p&gt;&lt;strong&gt;@ Bernie&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are right!  It didn&#039;t work for me either, but I figured it out.  One thing that could be at fault is if you have the SetEnvIF code at the bottom of your .htaccess file, put it at the top.  Another Likely reason is if your server is using suexec, which limits Environment variables to safe names.&lt;/p&gt;

&lt;p&gt;I updated the example .htaccess code above to show the correct code, including libwww-perl as well.&lt;/p&gt;

&lt;hr class=&quot;C&quot; /&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;How can/should I test whether-or-not my ‘bad_web_bot’ has been set?&lt;/li&gt;
&lt;li&gt;Can you suggest a SetIfNoCase User-Agent for empty user agents, and&lt;/li&gt;
&lt;li&gt;You seem to be the only place that uses the syntax &#039;User-Agent&#039;. Shouldn’t it be &#039;http_user_agent&#039;. (My experience seems to indicate cases are insensitive. Not so?)&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;change bad_web_bot to HTTP_SAFE_BADBOT to keep it suexec safe.  Then you can test whether it&#039;s been set by using mod_rewrite, mod_headers, mod_setenvif, etc..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;
SetEnvIf ^User-Agent$ &quot;^$&quot; HTTP_SAFE_EMPTY_UA
deny from env=HTTP_SAFE_EMPTY_UA
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;3.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;SetEnvIf only has 6 variables it can access, and those are specific to mod_setenvif.   What SetEnvIf is good at is parsing the HTTP REQUEST HEADERS such as the User-Agent request header.  And SetEnvIf is case-insensitive when dealing with headers.  HTTP_USER_AGENT is a variable used by mod_rewrite.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p><strong>@ Bernie</strong></p>
<p>You are right!  It didn&#8217;t work for me either, but I figured it out.  One thing that could be at fault is if you have the SetEnvIF code at the bottom of your .htaccess file, put it at the top.  Another Likely reason is if your server is using suexec, which limits Environment variables to safe names.</p>
<p>I updated the example .htaccess code above to show the correct code, including libwww-perl as well.</p>
<hr class="C" />
<blockquote>
<ol>
<li>How can/should I test whether-or-not my ‘bad_web_bot’ has been set?</li>
<li>Can you suggest a SetIfNoCase User-Agent for empty user agents, and</li>
<li>You seem to be the only place that uses the syntax &#8216;User-Agent&#8217;. Shouldn’t it be &#8216;http_user_agent&#8217;. (My experience seems to indicate cases are insensitive. Not so?)</li>
</ol>
</blockquote>
<p><strong>1.</strong></p>
<p>change bad_web_bot to HTTP_SAFE_BADBOT to keep it suexec safe.  Then you can test whether it&#8217;s been set by using mod_rewrite, mod_headers, mod_setenvif, etc..</p>
<p><strong>2.</strong></p>
<pre>SetEnvIf ^User-Agent$ "^$" HTTP_SAFE_EMPTY_UA
deny from env=HTTP_SAFE_EMPTY_UA</pre>
<p><strong>3.</strong></p>
<p>SetEnvIf only has 6 variables it can access, and those are specific to mod_setenvif.   What SetEnvIf is good at is parsing the HTTP REQUEST HEADERS such as the User-Agent request header.  And SetEnvIf is case-insensitive when dealing with headers.  HTTP_USER_AGENT is a variable used by mod_rewrite.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bernie</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-45418</link>
		<dc:creator>Bernie</dc:creator>
		<pubDate>Wed, 17 Sep 2008 20:02:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-45418</guid>
		<description>In checking into my access logs, it seems that my usage of for example ...
&lt;pre&gt;SetEnvIfNoCase User-Agent .*(libwww-perl&#124;aesop_com_spiderman).* bad_web_bot&lt;/pre&gt;
... might not be working.

For example, I&#039;ve still get &lt;strong&gt;libwww-perl&lt;/strong&gt; appearing in my access logs. As a test I tried adding in a portion of text appearing from my own &lt;em&gt;user agent string&lt;/em&gt; to see if I could block myself ... alas ... I was able to still access my website.

Three questions:
&lt;ol&gt;
&lt;li&gt;How can/should I test whether-or-not my &#039;bad_web_bot&#039; has been set?&lt;/li&gt;
&lt;li&gt;Can you suggest a SetIfNoCase User-Agent for empty user agents, and &lt;/li&gt;
&lt;li&gt;You seem to be the only place that uses the syntax &lt;code&gt;&#039;User-Agent&#039;&lt;/code&gt;. Shouldn&#039;t it be &lt;code&gt;&#039;http_user_agent&#039;&lt;/code&gt;. (My experience seems to indicate cases are insensitive. Not so?)&lt;/li&gt;
&lt;/ol&gt;


Thanks</description>
		<content:encoded><![CDATA[<p>In checking into my access logs, it seems that my usage of for example &#8230;</p>
<pre>SetEnvIfNoCase User-Agent .*(libwww-perl|aesop_com_spiderman).* bad_web_bot</pre>
<p>&#8230; might not be working.</p>
<p>For example, I&#8217;ve still get <strong>libwww-perl</strong> appearing in my access logs. As a test I tried adding in a portion of text appearing from my own <em>user agent string</em> to see if I could block myself &#8230; alas &#8230; I was able to still access my website.</p>
<p>Three questions:</p>
<ol>
<li>How can/should I test whether-or-not my &#8216;bad_web_bot&#8217; has been set?</li>
<li>Can you suggest a SetIfNoCase User-Agent for empty user agents, and </li>
<li>You seem to be the only place that uses the syntax <code>&#039;User-Agent&#039;</code>. Shouldn&#8217;t it be <code>&#039;http_user_agent&#039;</code>. (My experience seems to indicate cases are insensitive. Not so?)</li>
</ol>
<p>Thanks</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AskApache</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-45068</link>
		<dc:creator>AskApache</dc:creator>
		<pubDate>Fri, 12 Sep 2008 00:38:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-45068</guid>
		<description>&lt;p&gt;&lt;strong&gt;@ akshay&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It will slow down the apache httpd server process, but it won&#039;t be noticeable unless you have a crazy-high traffic site.&lt;/p&gt;  

&lt;p&gt;The Rewrite Engine (if using RewriteRules to block) looks at the incoming request, performs the rewrites, and then apache serves the response.  So by adding more RewriteRules for the RewriteEngine to process, you theoretically add more processing and time to each request being rewritten, but this &quot;extra&quot; time added will almost definately be unnoticed.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;@ Tom Dawkings&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, please read:  &lt;a href=&quot;http://www.askapache.com/htaccess/apache-status-code-headers-errordocument.html&quot; rel=&quot;nofollow&quot;&gt;57 HTTP Status Codes and &lt;strong&gt;Apache ErrorDocuments&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;@ Bernie&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nice one, just updated the code..&lt;/p&gt; 

&lt;p&gt;Actually that line is targeting any user-agent starting with &lt;code&gt;web&lt;/code&gt; and containing any of the items in parentheses.. Its a shortcut to typing &lt;code&gt;(webzip&#124;webemaile&#124;webenhancer)&lt;/code&gt;  so the correct line is:&lt;/p&gt;

&lt;pre&gt;
SetEnvIfNoCase ^User-Agent$ .*web(zip&#124;emaile&#124;enhancer).* bad_web_bot
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p><strong>@ akshay</strong></p>
<p>It will slow down the apache httpd server process, but it won&#8217;t be noticeable unless you have a crazy-high traffic site.</p>
<p>The Rewrite Engine (if using RewriteRules to block) looks at the incoming request, performs the rewrites, and then apache serves the response.  So by adding more RewriteRules for the RewriteEngine to process, you theoretically add more processing and time to each request being rewritten, but this &#8220;extra&#8221; time added will almost definately be unnoticed.</p>
<p><strong>@ Tom Dawkings</strong></p>
<p>Yes, please read:  <a href="http://www.askapache.com/htaccess/apache-status-code-headers-errordocument.html" rel="nofollow">57 HTTP Status Codes and <strong>Apache ErrorDocuments</strong></a></p>
<p><strong>@ Bernie</strong></p>
<p>Nice one, just updated the code..</p>
<p>Actually that line is targeting any user-agent starting with <code>web</code> and containing any of the items in parentheses.. Its a shortcut to typing <code>(webzip|webemaile|webenhancer)</code>  so the correct line is:</p>
<pre>SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer).* bad_web_bot</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bernie</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-45063</link>
		<dc:creator>Bernie</dc:creator>
		<pubDate>Thu, 11 Sep 2008 18:02:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-45063</guid>
		<description>Before I go on, just wanted to let you know about an error in the following line that causes a fatal error:
&lt;pre&gt;
SetEnvIfNoCase ^User-Agent$ .*(web(zip&#124;emaile&#124;enhancer).* bad_web_bot
&lt;/pre&gt;

Shouldn&#039;t this be

&lt;pre&gt;
SetEnvIfNoCase ^User-Agent$ .*(webzip&#124;emaile&#124;enhancer).* bad_web_bot
&lt;/pre&gt;

Now, with so many options for dealing with security, at least I&#039;ve got this one running for now. Thanks.</description>
		<content:encoded><![CDATA[<p>Before I go on, just wanted to let you know about an error in the following line that causes a fatal error:</p>
<pre>SetEnvIfNoCase ^User-Agent$ .*(web(zip|emaile|enhancer).* bad_web_bot</pre>
<p>Shouldn&#8217;t this be</p>
<pre>SetEnvIfNoCase ^User-Agent$ .*(webzip|emaile|enhancer).* bad_web_bot</pre>
<p>Now, with so many options for dealing with security, at least I&#8217;ve got this one running for now. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ranacse05</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-44984</link>
		<dc:creator>ranacse05</dc:creator>
		<pubDate>Sun, 07 Sep 2008 21:42:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-44984</guid>
		<description>Excellent :)</description>
		<content:encoded><![CDATA[<p>Excellent :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ramon Fincken</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-44550</link>
		<dc:creator>Ramon Fincken</dc:creator>
		<pubDate>Mon, 25 Aug 2008 09:22:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-44550</guid>
		<description>This list is OK, yet you have one entry not fitting ... 
namely the &lt;code&gt;Wget&lt;/code&gt; , which is most often used for &lt;strong&gt;cron jobs&lt;/strong&gt;. The Wget will GET or POST your page, and is in most cases pretty harmless..</description>
		<content:encoded><![CDATA[<p>This list is OK, yet you have one entry not fitting &#8230;<br />
namely the <code>Wget</code> , which is most often used for <strong>cron jobs</strong>. The Wget will GET or POST your page, and is in most cases pretty harmless..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Dawkings</title>
		<link>http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html#comment-43388</link>
		<dc:creator>Tom Dawkings</dc:creator>
		<pubDate>Mon, 21 Jul 2008 13:56:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.askapache.com/?p=549#comment-43388</guid>
		<description>Stupid newbie question, but do you need to make an ErrorDocument page for &lt;code&gt;403.html&lt;/code&gt;?</description>
		<content:encoded><![CDATA[<p>Stupid newbie question, but do you need to make an ErrorDocument page for <code>403.html</code>?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
