FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home  »  Htaccess  »  Blocking Bad Bots and Scrapers with .htaccess

by 33 comments

Bad Robot!This article shows 2 methods of blocking this entire list of bad robots and web scrapers with .htaccess files using SetEnvIfNoCase or using RewriteRules with mod_rewrite


Blocking Bad Robots and Web Scrapers with RewriteRules

ErrorDocument 403 /403.html
 
RewriteEngine On
RewriteBase /
 
# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) [NC,OR]
 
# STARTS WITH WEB
RewriteCond %{HTTP_USER_AGENT} ^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) [NC,OR]
 
# ANYWHERE IN UA -- GREEDY REGEX
RewriteCond %{HTTP_USER_AGENT} ^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures).*$ [NC]
 
# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]

Alternate RewriteCond Rules

RewriteEngine on
 
#Block spambots
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|
BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|
CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|
eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|
EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|
Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|
InfonaviRobot|InterGET|InternetsNinja|Jennybot|JetCar|JOCsWebsSpider|Kenjin.Spider|Keyword.Density|
larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|
MasssDownloader|Mata.Hari|Microsoft.URL|MIDownstool|MIIxpc|Mister.PiX|MistersPiX|moget|
Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|NetsVampire|
NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|OfflinesExplorer|OfflinesNavigator|Openfind|
Pagerabber|PapasFoto|pavuk|pcBrowser|ProgramsSharewares1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|
psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|
Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|TeleportsPro|Telesoft|The.Intraformant|
TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|
Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|
WebFetch|WebGosIS|Web.Image.Collector|WebsImagesCollector|WebLeacher|WebmasterWorldForumbot|
WebReaper|WebSauger|WebsiteseXtractor|Website.Quester|WebsitesQuester|Webster.Pro|WebStripper|
WebsSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|
XaldonsWebSpider|Xenu's|Zeus) [NC]
RewriteRule .? - [F]

Block Bad Bots with SetEnvIfNoCase

ErrorDocument 403 /403.html
 
# IF THE UA STARTS WITH THESE
SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOT
Deny from env=HTTP_SAFE_BADBOT

Original Bad Bot / Web Scraper List

  1. WebBandit
  2. 2icommerce
  3. Accoona
  4. ActiveTouristBot
  5. adressendeutschland
  6. aipbot
  7. Alexibot
  8. Alligator
  9. AllSubmitter
  10. almaden
  11. anarchie
  12. Anonymous
  13. Apexoo
  14. Aqua_Products
  15. asterias
  16. ASSORT
  17. ATHENS
  18. AtHome
  19. Atomz
  20. attache
  21. autoemailspider
  22. autohttp
  23. b2w
  24. bew
  25. BackDoorBot
  26. Badass
  27. Baiduspider
  28. Baiduspider+
  29. BecomeBot
  30. berts
  31. Bitacle
  32. Biz360
  33. Black.Hole
  34. BlackWidow
  35. bladder fusion
  36. Blog Checker
  37. BlogPeople
  38. Blogshares Spiders
  39. Bloodhound
  40. BlowFish
  41. Board Bot
  42. Bookmark search tool
  43. BotALot
  44. BotRightHere
  45. Bot mailto:craftbot@yahoo.com
  46. Bropwers
  47. Browsezilla
  48. BuiltBotTough
  49. Bullseye
  50. BunnySlippers
  51. Cegbfeieh
  52. CFNetwork
  53. CheeseBot
  54. CherryPicker
  55. Crescent
  56. charlotte/
  57. ChinaClaw
  58. Convera
  59. Copernic
  60. CopyRightCheck
  61. cosmos
  62. Crescent
  63. c-spider
  64. curl
  65. Custo
  66. Cyberz
  67. DataCha0s
  68. Daum
  69. Deweb
  70. Digger
  71. Digimarc
  72. digout4uagent
  73. DIIbot
  74. DISCo
  75. DittoSpyder
  76. DnloadMage
  77. Download
  78. dragonfly
  79. DreamPassport
  80. DSurf
  81. DTS Agent
  82. dumbot
  83. DynaWeb
  84. e-collector
  85. EasyDL
  86. EBrowse
  87. eCatch
  88. ecollector
  89. edgeio
  90. efp@gmx.net
  91. EirGrabber
  92. Email Extractor
  93. EmailCollector
  94. EmailSiphon
  95. EmailWolf
  96. EmeraldShield
  97. Enterprise_Search
  98. EroCrawler
  99. ESurf
  100. Eval
  101. Everest-Vulcan
  102. Exabot
  103. Express
  104. Extractor
  105. ExtractorPro
  106. EyeNetIE
  107. FairAd
  108. fastlwspider
  109. fetch
  110. FEZhead
  111. FileHound
  112. findlinks
  113. Flaming AttackBot
  114. FlashGet
  115. FlickBot
  116. Foobot
  117. Forex
  118. Franklin Locator
  119. FreshDownload
  120. FrontPage
  121. FSurf
  122. Gaisbot
  123. Gamespy_Arcade
  124. genieBot
  125. GetBot
  126. Getleft
  127. GetRight
  128. GetWeb!
  129. Go!Zilla
  130. Go-Ahead-Got-It
  131. GOFORITBOT
  132. GrabNet
  133. Grafula
  134. grub
  135. Harvest
  136. Hatena Antenna
  137. heritrix
  138. HLoader
  139. HMView
  140. holmes
  141. HooWWWer
  142. HouxouCrawler
  143. HTTPGet
  144. httplib
  145. HTTPRetriever
  146. HTTrack
  147. humanlinks
  148. IBM_Planetwide
  149. iCCrawler
  150. ichiro
  151. iGetter
  152. Image Stripper
  153. Image Sucker
  154. imagefetch
  155. imds_monitor
  156. IncyWincy
  157. Industry Program
  158. Indy
  159. InetURL
  160. InfoNaviRobot
  161. InstallShield DigitalWizard
  162. InterGET
  163. IRLbot
  164. Iron33
  165. ISSpider
  166. IUPUI Research Bot
  167. Jakarta
  168. java/
  169. JBH Agent
  170. JennyBot
  171. JetCar
  172. jeteye
  173. jeteyebot
  174. JoBo
  175. JOC Web Spider
  176. Kapere
  177. Kenjin
  178. Keyword Density
  179. KRetrieve
  180. ksoap
  181. KWebGet
  182. LapozzBot
  183. larbin
  184. leech
  185. LeechFTP
  186. LeechGet
  187. leipzig.de
  188. LexiBot
  189. libWeb
  190. libwww-FM
  191. libwww-perl
  192. LightningDownload
  193. LinkextractorPro
  194. Linkie
  195. LinkScan
  196. linktiger
  197. LinkWalker
  198. lmcrawler
  199. LNSpiderguy
  200. LocalcomBot
  201. looksmart
  202. LWP
  203. Mac Finder
  204. Mail Sweeper
  205. mark.blonin
  206. MaSagool
  207. Mass
  208. Mata Hari
  209. MCspider
  210. MetaProducts Download Express
  211. Microsoft Data Access
  212. Microsoft URL Control
  213. MIDown
  214. MIIxpc
  215. Mirror
  216. Missauga
  217. Missouri College Browse
  218. Mister
  219. Monster
  220. mkdb
  221. moget
  222. Moreoverbot
  223. mothra/netscan
  224. MovableType
  225. Mozi!
  226. Mozilla/22
  227. Mozilla/3.0 (compatible)
  228. Mozilla/5.0 (compatible; MSIE 5.0)
  229. MSIE_6.0
  230. MSIECrawler
  231. MSProxy
  232. MVAClient
  233. MyFamilyBot
  234. MyGetRight
  235. nameprotect
  236. NASA Search
  237. Naver
  238. Navroad
  239. NearSite
  240. NetAnts
  241. netattache
  242. NetCarta
  243. NetMechanic
  244. NetResearchServer
  245. NetSpider
  246. NetZIP
  247. Net Vampire
  248. NEWT ActiveX
  249. Nextopia
  250. NICErsPRO
  251. ninja
  252. NimbleCrawler
  253. noxtrumbot
  254. NPBot
  255. Octopus
  256. Offline
  257. OK Mozilla
  258. OmniExplorer
  259. OpaL
  260. Openbot
  261. Openfind
  262. OpenTextSiteCrawler
  263. Oracle Ultra Search
  264. OutfoxBot
  265. P3P
  266. PackRat
  267. PageGrabber
  268. PagmIEDownload
  269. panscient
  270. Papa Foto
  271. pavuk
  272. pcBrowser
  273. perl
  274. PerMan
  275. PersonaPilot
  276. PHP version
  277. PlantyNet_WebRobot
  278. playstarmusic
  279. Plucker
  280. Port Huron
  281. Program Shareware
  282. Progressive Download
  283. ProPowerBot
  284. prospector
  285. ProWebWalker
  286. Prozilla
  287. psbot
  288. psycheclone
  289. puf
  290. PushSite
  291. PussyCat
  292. PuxaRapido
  293. Python-urllib
  294. QuepasaCreep
  295. QueryN
  296. Radiation
  297. RealDownload
  298. RedCarpet
  299. RedKernel
  300. ReGet
  301. relevantnoise
  302. RepoMonkey
  303. RMA
  304. Rover
  305. Rsync
  306. RTG30
  307. Rufus
  308. SAPO
  309. SBIder
  310. scooter
  311. ScoutAbout
  312. script
  313. searchpreview
  314. searchterms
  315. Seekbot
  316. Serious
  317. Shai
  318. shelob
  319. Shim-Crawler
  320. SickleBot
  321. sitecheck
  322. SiteSnagger
  323. Slurpy Verifier
  324. SlySearch
  325. SmartDownload
  326. sna-
  327. snagger
  328. Snoopy
  329. sogou
  330. sootle
  331. So-net” bat_bot
  332. SpankBot” bat_bot
  333. spanner” bat_bot
  334. SpeedDownload
  335. Spegla
  336. Sphere
  337. Sphider
  338. SpiderBot
  339. sproose
  340. SQ Webscanner
  341. Sqworm
  342. Stamina
  343. Stanford
  344. studybot
  345. SuperBot
  346. SuperHTTP
  347. Surfbot
  348. SurfWalker
  349. suzuran
  350. Szukacz
  351. tAkeOut
  352. TALWinHttpClient
  353. tarspider
  354. Teleport
  355. Telesoft
  356. Templeton
  357. TestBED
  358. The Intraformant
  359. TheNomad
  360. TightTwatBot
  361. Titan
  362. toCrawl/UrlDispatcher
  363. True_Robot
  364. turingos
  365. TurnitinBot
  366. Twisted PageGetter
  367. UCmore
  368. UdmSearch
  369. UMBC
  370. UniversalFeedParser
  371. URL Control
  372. URLGetFile
  373. URLy Warning
  374. URL_Spider_Pro
  375. UtilMind
  376. vayala
  377. vobsub
  378. VCI
  379. VoidEYE
  380. VoilaBot
  381. voyager
  382. w3mir
  383. Web Image Collector
  384. Web Sucker
  385. Web2WAP
  386. WebaltBot
  387. WebAuto
  388. WebBandit
  389. WebCapture
  390. webcollage
  391. WebCopier
  392. WebCopy
  393. WebEMailExtrac
  394. WebEnhancer
  395. WebFetch
  396. WebFilter
  397. WebFountain
  398. WebGo
  399. WebLeacher
  400. WebMiner
  401. WebMirror
  402. WebReaper
  403. WebSauger
  404. WebSnake
  405. Website
  406. WebStripper
  407. WebVac
  408. webwalk
  409. WebWhacker
  410. WebZIP
  411. Wells Search
  412. WEP Search 00
  413. WeRelateBot
  414. Wget
  415. WhosTalking
  416. Widow
  417. Wildsoft Surfer
  418. WinHttpRequest
  419. WinHTTrack
  420. WUMPUS
  421. WWWOFFLE
  422. wwwster
  423. WWW-Collector
  424. Xaldon
  425. Xenu's
  426. Xenus
  427. XGET
  428. Y!TunnelPro
  429. YahooYSMcm
  430. YaDirectBot
  431. Yeti
  432. Zade
  433. ZBot
  434. zerxbot
  435. Zeus
  436. ZyBorg

Tags

April 8th, 2008

Comments Welcome

  • http://barefoot-webdesign.com Spencer

    Hey, this is awesome. Does it matter where in the .htaccess file you put it all?

  • Jernej

    SetEnvIfNoCase vs RewriteRules ? Which one is faster ?

  • http://www.askapache.com/ AskApache

    @ Spencer

    I would put it at the bottom.

    @ Jernej

    I'm not sure, it would probably be a difference of milliseconds if it was even measurable. I prefer the SetEnvIfNoCase myself, but then again, I don't use either of these methods. I use mod_security to block bad bots instead.

  • http://uck.in akshay

    will it not slow down the entire site if i add so many entries in apache? i have heard that this can cause problems?

  • http://www.darkfiberla.com Dan

    thanks

  • Tom Dawkings

    Stupid newbie question, but do you need to make an ErrorDocument page for 403.html?

  • Ramon Fincken

    This list is OK, yet you have one entry not fitting ...
    namely the Wget , which is most often used for cron jobs. The Wget will GET or POST your page, and is in most cases pretty harmless..

  • http://ranacse05.wordpress.com ranacse05

    Excellent :)

  • Bernie

    Before I go on, just wanted to let you know about an error in the following line that causes a fatal error:

    SetEnvIfNoCase ^User-Agent$ .*(web(zip|emaile|enhancer).* bad_web_bot

    Shouldn't this be

    SetEnvIfNoCase ^User-Agent$ .*(webzip|emaile|enhancer).* bad_web_bot

    Now, with so many options for dealing with security, at least I've got this one running for now. Thanks.

  • http://www.askapache.com/ AskApache

    @ akshay

    It will slow down the apache httpd server process, but it won't be noticeable unless you have a crazy-high traffic site.

    The Rewrite Engine (if using RewriteRules to block) looks at the incoming request, performs the rewrites, and then apache serves the response. So by adding more RewriteRules for the RewriteEngine to process, you theoretically add more processing and time to each request being rewritten, but this "extra" time added will almost definately be unnoticed.

    @ Tom Dawkings

    Yes, please read: 57 HTTP Status Codes and Apache ErrorDocuments

    @ Bernie

    Nice one, just updated the code..

    Actually that line is targeting any user-agent starting with web and containing any of the items in parentheses.. Its a shortcut to typing (webzip|webemaile|webenhancer) so the correct line is:

    SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer).* bad_web_bot
  • http://for-legacies-sake.ca Bernie

    In checking into my access logs, it seems that my usage of for example ...

    SetEnvIfNoCase User-Agent .*(libwww-perl|aesop_com_spiderman).* bad_web_bot

    ... might not be working.

    For example, I've still get libwww-perl appearing in my access logs. As a test I tried adding in a portion of text appearing from my own user agent string to see if I could block myself ... alas ... I was able to still access my website.

    Three questions:

    1. How can/should I test whether-or-not my 'bad_web_bot' has been set?
    2. Can you suggest a SetIfNoCase User-Agent for empty user agents, and
    3. You seem to be the only place that uses the syntax 'User-Agent'. Shouldn't it be 'http_user_agent'. (My experience seems to indicate cases are insensitive. Not so?)

    Thanks

  • http://www.askapache.com/ AskApache

    @ Bernie

    You are right! It didn't work for me either, but I figured it out. One thing that could be at fault is if you have the SetEnvIF code at the bottom of your .htaccess file, put it at the top. Another Likely reason is if your server is using suexec, which limits Environment variables to safe names.

    I updated the example .htaccess code above to show the correct code, including libwww-perl as well.


    1. How can/should I test whether-or-not my ‘bad_web_bot’ has been set?
    2. Can you suggest a SetIfNoCase User-Agent for empty user agents, and
    3. You seem to be the only place that uses the syntax 'User-Agent'. Shouldn’t it be 'http_user_agent'. (My experience seems to indicate cases are insensitive. Not so?)

    1.

    change bad_web_bot to HTTP_SAFE_BADBOT to keep it suexec safe. Then you can test whether it's been set by using mod_rewrite, mod_headers, mod_setenvif, etc..

    2.

    SetEnvIf ^User-Agent$ "^$" HTTP_SAFE_EMPTY_UA
    deny from env=HTTP_SAFE_EMPTY_UA

    3.

    SetEnvIf only has 6 variables it can access, and those are specific to mod_setenvif. What SetEnvIf is good at is parsing the HTTP REQUEST HEADERS such as the User-Agent request header. And SetEnvIf is case-insensitive when dealing with headers. HTTP_USER_AGENT is a variable used by mod_rewrite.

  • http://www.ramonfincken.com/ Ramon Fincken

    @Spencer: in your public file root ( httpdocs or httproot or www or public_html ) ...

  • http://www.ramonfincken.com Ramon Fincken

    @Spencer: no it doesnt ( for I've seen in my own htaccess files )

    However for human reading purposes you might put your normal mod rewrite rules (SEO urls) on top below the engine and php flags followed by all your blocking rules.

  • Michael

    Thanks for this some bot used up masses of my traffic allowance this week Ive put your htaccess into my root.

    Now just sit back and pray it works

  • http://www.ji-fashion.com Sara

    Thank you so much for your info... :) Of course I'm using these, in fact all of it despite the mentioning of "alternate" since I spotted that they differ. So I wonder why they aren't completely matching?

    Love it... and IT :)

    Sara, just a silly girl in Sweden building the ultimate OScommerce site :)

  • http://none Sergej

    For tonight, *you* are my hero!

  • Donna

    I have 2 questions:

    1. Why do we need a custom 404 page?
    2. Below is my htaccess. To save space I've removed some code. Will this work? Do I need multiple "RewriteEngine On" statements? If not, do I keep the top one and then just begin your code with "Rewrite Base"? Thank you!
    # redirect non-www to the www url always
    Options +FollowSymLinks
    RewriteEngine on
    ...
    # prevent hotlinking of images
    RewriteEngine on
    ....
    # custom not found file
    ErrorDocument 404 /notfound.shtml
     
    # beginning of the blocking of bad bots
    ErrorDocument 403 /403.html
    RewriteEngine On
    RewriteBase /
     # IF THE UA STARTS WITH THESE
    RewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]
    .............rest of rewrites...............
    # ISSUE 403 / SERVE ERRORDOCUMENT
    RewriteRule . - [F,L]
  • http://www.computereweb.com mark

    Very good, thanks for placing this online! There are scripts that set traps for bad bots but somehow I can't set them to block anything... at least with this a few bad bots are trapped :)

  • memoi

    thanks for the post.
    I am looking for a way to block/deny OneNote. Is there a certain rule to deny OneNote or any program ?

    many thanks

  • Erika

    ¿En dónde se pega ese código para evitar que se bajen el sitio web?

  • http://www.wpblogtips.com/ Anuj@WordPress SEO

    You're post is good... I agree that people should make the most of the built-ins before jumping to the advanced modules. That's precisely what I'm doing, however, I'm having some trouble and was hoping you might help.

  • Phoenix

    is it normal to still have logs of bad bots with RewriteRules?

    And another question, what is the best solution: redirection or forbidden ?

    RewriteRule ^(.*)$ http://www.example.com/ [R,L]
    RewriteRule . - [F,L]

    Thanks by advance for your answer and thanks for this great post! :)

  • http://www.techgazine.com Techgazine

    Would you know how to block the WP Robot autoblogging plugin? It's scraping my other websites. Thanks

  • http://phpsnips.com Ryan

    For nginx users:

    if ($http_user_agent ~*
    "^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack)"){
            set $rule_0 1;
            return 403;
            break;
    }
    if ($http_user_agent ~*
    "^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures).*$"){
            set $rule_0 1;
            return 403;
            break;
    }
  • Name rod

    hey great list and methodologies here, it must have taken a lot of work and study to produce this. sorry but am a pure simpleton (but know a little c++). Has anyone ever provided a list of good bots to only allow these in. The list of bad bots must grow exponentially but good bots only linearly? regards rod

  • http://www.ApkAndroidGame.com sunny Tewathia

    How can i specifically block a site or any type of bot using by that site to scrap my content. It is scraping my content and publishing as their own.
    Please help.

    Thanks.

  • Pingback: Unerwünschtes crawlen der Website durch Bots bzw. Spider verhindern | DevTec [de:ftig]

  • Minidus

    Dont use these big lists in htaccess, they dramatically slowdown your site load time by around half of the second! I tested it.

    • http://www.facebook.com/jasonpaulweber Jason Weber

      Agreed -- any of these clumpy things in your htaccess will slow your site down.

    • http://seomov.com/ SEO Singapore

      I also worried the slow down issue, anyway we can find a most updated and common spam/bad bots list so we can use it selectively?

    • Bagus

      What is the better option for blocking bots & scrapers, then?

  • Pingback: How to block bad bots in .htaccess – Hosting Fixes

My Online Tools
WordPress Sites

My Picks

Related Articles
Newest Posts
Twitter

  • ZERO DAY - read before Trojan horse  t.co/pPMLGDJv8P 
  • Trojan Horse, a novel!  t.co/Hf8EtYaZVa 
  • The Hacker Playbook - very nice high level overview of attacks  t.co/lHwNVWi61u 
  • Clean Code - A Handbook of Agile Software Craftsmanship  t.co/hnJX0x1qIc 
  • Secrets of the JavaScript Ninja - By my absolute favorite JS hacker John Resig!  t.co/tZ42ljmcCl 
  • Hacking Exposed 7: Network Security Secrets & SolutionsMy all time favorite, basic but thorough and accurate.  t.co/jycW0RDVtZ 
  • Empty words will be no surrogate for cold resolve. Pain is nothing.  t.co/qXjpRxbjCw 
  • REVERSING: Secrets of Reverse Engineering  t.co/GaWo29lWWG 
  • NEUROMANCER  t.co/3OoknUcb5Z 
  • "The Shockwave Rider", by John Brunner (1975 hacker sci-fi)  t.co/ZW56HVUefW 
  • The Rootkit ARSENAL - Escape and Evasion in the Dark Corners of the System  t.co/1FzX6bHgsQ 
  • "We Are Anonymous - Inside the Hacker World of LulzSec, Anonymous, and the Global Cyber Insurgency" better be good!  t.co/GL0cFNiUOq 
  • THE IDEA FACTORY Bell Labs  t.co/FyVhgNwwT5 
  • The Datacenter as a Computer -- Urs Holzle  t.co/M5WIYs1OVg 

Friends and Recommends
Hacking and Hackers

The use of "hacker" to mean "security breaker" is a confusion on the part of the mass media. We hackers refuse to recognize that meaning, and continue using the word to mean someone who loves to program, someone who enjoys playful cleverness, or the combination of the two. See my article, On Hacking.
-- Richard M. Stallman






[hide]

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

| Google+ | askapache

Site Map | Contact Webmaster | License and Disclaimer | Terms of Service

↑ TOPMain