Blocking Bad Bots and Scrapers with .htaccessApril 8th, 2008
« Crazy Cache WordPress Plugin Released.Htaccess rewrites, Mod_Rewrite Tricks and Tips »
This article shows 2 methods of blocking this entire list of bad robots and web scrapers with .htaccess files using SetEnvIfNoCase or using RewriteRules with mod_rewrite
ErrorDocument 403 /403.html
RewriteEngine On
RewriteBase /
# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) [NC,OR]
# STARTS WITH WEB
RewriteCond %{HTTP_USER_AGENT} ^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) [NC,OR]
# ANYWHERE IN UA -- GREEDY REGEX
RewriteCond %{HTTP_USER_AGENT} ^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures).*$ [NC]
# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]
RewriteEngine on
#Block spambots
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|\
BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|\
CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|\
eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|\
EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|\
Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|\
InfonaviRobot|InterGET|Internet\sNinja|Jennybot|JetCar|JOC\sWeb\sSpider|Kenjin.Spider|Keyword.Density|\
larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|\
Mass\sDownloader|Mata.Hari|Microsoft.URL|MIDown\stool|MIIxpc|Mister.PiX|Mister\sPiX|moget|\
Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|Net\sVampire|\
NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|Offline\sExplorer|Offline\sNavigator|Openfind|\
Pagerabber|Papa\sFoto|pavuk|pcBrowser|Program\sShareware\s1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|\
psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|\
Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|Teleport\sPro|Telesoft|The.Intraformant|\
TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|\
Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|\
WebFetch|WebGo\sIS|Web.Image.Collector|Web\sImage\sCollector|WebLeacher|WebmasterWorldForumbot|\
WebReaper|WebSauger|Website\seXtractor|Website.Quester|Website\sQuester|Webster.Pro|WebStripper|\
Web\sSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|\
Xaldon\sWebSpider|Xenu's|Zeus) [NC]
RewriteRule .? - [F]
ErrorDocument 403 /403.html # IF THE UA STARTS WITH THESE SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOT Deny from env=HTTP_SAFE_BADBOT
WebBandit2icommerceAccoonaActiveTouristBotadressendeutschlandaipbotAlexibotAlligatorAllSubmitteralmadenanarchieAnonymousApexooAqua_ProductsasteriasASSORTATHENSAtHomeAtomzattacheautoemailspiderautohttpb2wbewBackDoorBotBadassBaiduspiderBaiduspider+BecomeBotbertsBitacleBiz360Black.HoleBlackWidowbladder fusionBlog CheckerBlogPeopleBlogshares SpidersBloodhoundBlowFishBoard BotBookmark search toolBotALotBotRightHereBot mailto:craftbot@yahoo.comBropwersBrowsezillaBuiltBotToughBullseyeBunnySlippersCegbfeiehCFNetworkCheeseBotCherryPickerCrescentcharlotte/ChinaClawConveraCopernicCopyRightCheckcosmosCrescentc-spidercurlCustoCyberzDataCha0sDaumDewebDiggerDigimarcdigout4uagentDIIbotDISCoDittoSpyderDnloadMageDownloaddragonflyDreamPassportDSurfDTS AgentdumbotDynaWebe-collectorEasyDLEBrowseeCatchecollectoredgeioefp@gmx.netEirGrabberEmail ExtractorEmailCollectorEmailSiphonEmailWolfEmeraldShieldEnterprise_SearchEroCrawlerESurfEvalEverest-VulcanExabotExpressExtractorExtractorProEyeNetIEFairAdfastlwspiderfetchFEZheadFileHoundfindlinksFlaming AttackBotFlashGetFlickBotFoobotForexFranklin LocatorFreshDownloadFrontPageFSurfGaisbotGamespy_ArcadegenieBotGetBotGetleftGetRightGetWeb!Go!ZillaGo-Ahead-Got-ItGOFORITBOTGrabNetGrafulagrubHarvestHatena AntennaheritrixHLoaderHMViewholmesHooWWWerHouxouCrawlerHTTPGethttplibHTTPRetrieverHTTrackhumanlinksIBM_PlanetwideiCCrawlerichiroiGetterImage StripperImage Suckerimagefetchimds_monitorIncyWincyIndustry ProgramIndyInetURLInfoNaviRobotInstallShield DigitalWizardInterGETIRLbotIron33ISSpiderIUPUI Research BotJakartajava/JBH AgentJennyBotJetCarjeteyejeteyebotJoBoJOC Web SpiderKapereKenjinKeyword DensityKRetrieveksoapKWebGetLapozzBotlarbinleechLeechFTPLeechGetleipzig.deLexiBotlibWeblibwww-FMlibwww-perlLightningDownloadLinkextractorProLinkieLinkScanlinktigerLinkWalkerlmcrawlerLNSpiderguyLocalcomBotlooksmartLWPMac FinderMail Sweepermark.bloninMaSagoolMassMata HariMCspiderMetaProducts Download ExpressMicrosoft Data AccessMicrosoft URL ControlMIDownMIIxpcMirrorMissaugaMissouri College BrowseMisterMonstermkdbmogetMoreoverbotmothra/netscanMovableTypeMozi!Mozilla/22Mozilla/3.0 (compatible)Mozilla/5.0 (compatible; MSIE 5.0)MSIE_6.0MSIECrawlerMSProxyMVAClientMyFamilyBotMyGetRightnameprotectNASA SearchNaverNavroadNearSiteNetAntsnetattacheNetCartaNetMechanicNetResearchServerNetSpiderNetZIPNet VampireNEWT ActiveXNextopiaNICErsPROninjaNimbleCrawlernoxtrumbotNPBotOctopusOfflineOK MozillaOmniExplorerOpaLOpenbotOpenfindOpenTextSiteCrawlerOracle Ultra SearchOutfoxBotP3PPackRatPageGrabberPagmIEDownloadpanscientPapa FotopavukpcBrowserperlPerManPersonaPilotPHP versionPlantyNet_WebRobotplaystarmusicPluckerPort HuronProgram SharewareProgressive DownloadProPowerBotprospectorProWebWalkerProzillapsbotpsycheclonepufPushSitePussyCatPuxaRapidoPython-urllibQuepasaCreepQueryNRadiationRealDownloadRedCarpetRedKernelReGetrelevantnoiseRepoMonkeyRMARoverRsyncRTG30RufusSAPOSBIderscooterScoutAboutscriptsearchpreviewsearchtermsSeekbotSeriousShaishelobShim-CrawlerSickleBotsitecheckSiteSnaggerSlurpy VerifierSlySearchSmartDownloadsna-snaggerSnoopysogousootleSo-net” bat_botSpankBot” bat_botspanner” bat_botSpeedDownloadSpeglaSphereSphiderSpiderBotsprooseSQ WebscannerSqwormStaminaStanfordstudybotSuperBotSuperHTTPSurfbotSurfWalkersuzuranSzukacztAkeOutTALWinHttpClienttarspiderTeleportTelesoftTempletonTestBEDThe IntraformantTheNomadTightTwatBotTitantoCrawl/UrlDispatcherTrue_RobotturingosTurnitinBotTwisted PageGetterUCmoreUdmSearchUMBCUniversalFeedParserURL ControlURLGetFileURLy WarningURL_Spider_ProUtilMindvayalavobsubVCIVoidEYEVoilaBotvoyagerw3mirWeb Image CollectorWeb SuckerWeb2WAPWebaltBotWebAutoWebBanditWebCapturewebcollageWebCopierWebCopyWebEMailExtracWebEnhancerWebFetchWebFilterWebFountainWebGoWebLeacherWebMinerWebMirrorWebReaperWebSaugerWebSnakeWebsiteWebStripperWebVacwebwalkWebWhackerWebZIPWells SearchWEP Search 00WeRelateBotWgetWhosTalkingWidowWildsoft SurferWinHttpRequestWinHTTrackWUMPUSWWWOFFLEwwwsterWWW-CollectorXaldonXenu'sXenusXGETY!TunnelProYahooYSMcmYaDirectBotYetiZadeZBotzerxbotZeusZyBorg« Crazy Cache WordPress Plugin Released.Htaccess rewrites, Mod_Rewrite Tricks and Tips »
Tags: htaccess, mod_rewrite, Security, SetEnvIf
The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. Tim Berners-Lee
It's very simple -
you read the protocol
and write the code.
-Bill Joy
HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C | WAI | DISA | ICSI | GIAC | SANS RR | GHOST
Authority: 110 ↑ TOPExcept where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed according to these terms. "Apache" is a trademark of The ASF.
@Spencer: no it doesnt ( for I’ve seen in my own htaccess files )
However for human reading purposes you might put your normal mod rewrite rules (SEO urls) on top below the engine and php flags followed by all your blocking rules.
@Spencer: in your public file root ( httpdocs or httproot or www or public_html ) …
@ Bernie
You are right! It didn’t work for me either, but I figured it out. One thing that could be at fault is if you have the SetEnvIF code at the bottom of your .htaccess file, put it at the top. Another Likely reason is if your server is using suexec, which limits Environment variables to safe names.
I updated the example .htaccess code above to show the correct code, including libwww-perl as well.
1.
change bad_web_bot to HTTP_SAFE_BADBOT to keep it suexec safe. Then you can test whether it’s been set by using mod_rewrite, mod_headers, mod_setenvif, etc..
2.
3.
SetEnvIf only has 6 variables it can access, and those are specific to mod_setenvif. What SetEnvIf is good at is parsing the HTTP REQUEST HEADERS such as the User-Agent request header. And SetEnvIf is case-insensitive when dealing with headers. HTTP_USER_AGENT is a variable used by mod_rewrite.
In checking into my access logs, it seems that my usage of for example …
… might not be working.
For example, I’ve still get libwww-perl appearing in my access logs. As a test I tried adding in a portion of text appearing from my own user agent string to see if I could block myself … alas … I was able to still access my website.
Three questions:
'User-Agent'. Shouldn’t it be'http_user_agent'. (My experience seems to indicate cases are insensitive. Not so?)Thanks
@ akshay
It will slow down the apache httpd server process, but it won’t be noticeable unless you have a crazy-high traffic site.
The Rewrite Engine (if using RewriteRules to block) looks at the incoming request, performs the rewrites, and then apache serves the response. So by adding more RewriteRules for the RewriteEngine to process, you theoretically add more processing and time to each request being rewritten, but this “extra” time added will almost definately be unnoticed.
@ Tom Dawkings
Yes, please read: 57 HTTP Status Codes and Apache ErrorDocuments
@ Bernie
Nice one, just updated the code..
Actually that line is targeting any user-agent starting with
weband containing any of the items in parentheses.. Its a shortcut to typing(webzip|webemaile|webenhancer)so the correct line is:Before I go on, just wanted to let you know about an error in the following line that causes a fatal error:
Shouldn’t this be
Now, with so many options for dealing with security, at least I’ve got this one running for now. Thanks.
Excellent :)
This list is OK, yet you have one entry not fitting …
namely the
Wget, which is most often used for cron jobs. The Wget will GET or POST your page, and is in most cases pretty harmless..Stupid newbie question, but do you need to make an ErrorDocument page for
403.html?thanks
will it not slow down the entire site if i add so many entries in apache? i have heard that this can cause problems?
@ Spencer
I would put it at the bottom.
@ Jernej
I’m not sure, it would probably be a difference of milliseconds if it was even measurable. I prefer the
SetEnvIfNoCasemyself, but then again, I don’t use either of these methods. I use mod_security to block bad bots instead.SetEnvIfNoCase vs RewriteRules ? Which one is faster ?
Hey, this is awesome. Does it matter where in the .htaccess file you put it all?