Blocking Bad Bots and Scrapers with .htaccess
« Crazy Cache WordPress Plugin Released.Htaccess rewrites, Mod_Rewrite Tricks and Tips »
This article shows 2 methods of blocking this entire list of bad robots and web scrapers with .htaccess files using SetEnvIfNoCase or using RewriteRules with mod_rewrite
ErrorDocument 403 /403.html
RewriteEngine On
RewriteBase /
# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) [NC,OR]
# STARTS WITH WEB
RewriteCond %{HTTP_USER_AGENT} ^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) [NC,OR]
# ANYWHERE IN UA -- GREEDY REGEX
RewriteCond %{HTTP_USER_AGENT} ^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures).*$ [NC]
# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]
RewriteEngine on
#Block spambots
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|\
BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|\
CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|\
eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|\
EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|\
Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|\
InfonaviRobot|InterGET|Internet\sNinja|Jennybot|JetCar|JOC\sWeb\sSpider|Kenjin.Spider|Keyword.Density|\
larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|\
Mass\sDownloader|Mata.Hari|Microsoft.URL|MIDown\stool|MIIxpc|Mister.PiX|Mister\sPiX|moget|\
Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|Net\sVampire|\
NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|Offline\sExplorer|Offline\sNavigator|Openfind|\
Pagerabber|Papa\sFoto|pavuk|pcBrowser|Program\sShareware\s1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|\
psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|\
Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|Teleport\sPro|Telesoft|The.Intraformant|\
TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|\
Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|\
WebFetch|WebGo\sIS|Web.Image.Collector|Web\sImage\sCollector|WebLeacher|WebmasterWorldForumbot|\
WebReaper|WebSauger|Website\seXtractor|Website.Quester|Website\sQuester|Webster.Pro|WebStripper|\
Web\sSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|\
Xaldon\sWebSpider|Xenu's|Zeus) [NC]
RewriteRule .? - [F]
ErrorDocument 403 /403.html # IF THE UA STARTS WITH THESE SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT SetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOT Deny from env=HTTP_SAFE_BADBOT
WebBandit2icommerceAccoonaActiveTouristBotadressendeutschlandaipbotAlexibotAlligatorAllSubmitteralmadenanarchieAnonymousApexooAqua_ProductsasteriasASSORTATHENSAtHomeAtomzattacheautoemailspiderautohttpb2wbewBackDoorBotBadassBaiduspiderBaiduspider+BecomeBotbertsBitacleBiz360Black.HoleBlackWidowbladder fusionBlog CheckerBlogPeopleBlogshares SpidersBloodhoundBlowFishBoard BotBookmark search toolBotALotBotRightHereBot mailto:craftbot@yahoo.comBropwersBrowsezillaBuiltBotToughBullseyeBunnySlippersCegbfeiehCFNetworkCheeseBotCherryPickerCrescentcharlotte/ChinaClawConveraCopernicCopyRightCheckcosmosCrescentc-spidercurlCustoCyberzDataCha0sDaumDewebDiggerDigimarcdigout4uagentDIIbotDISCoDittoSpyderDnloadMageDownloaddragonflyDreamPassportDSurfDTS AgentdumbotDynaWebe-collectorEasyDLEBrowseeCatchecollectoredgeioefp@gmx.netEirGrabberEmail ExtractorEmailCollectorEmailSiphonEmailWolfEmeraldShieldEnterprise_SearchEroCrawlerESurfEvalEverest-VulcanExabotExpressExtractorExtractorProEyeNetIEFairAdfastlwspiderfetchFEZheadFileHoundfindlinksFlaming AttackBotFlashGetFlickBotFoobotForexFranklin LocatorFreshDownloadFrontPageFSurfGaisbotGamespy_ArcadegenieBotGetBotGetleftGetRightGetWeb!Go!ZillaGo-Ahead-Got-ItGOFORITBOTGrabNetGrafulagrubHarvestHatena AntennaheritrixHLoaderHMViewholmesHooWWWerHouxouCrawlerHTTPGethttplibHTTPRetrieverHTTrackhumanlinksIBM_PlanetwideiCCrawlerichiroiGetterImage StripperImage Suckerimagefetchimds_monitorIncyWincyIndustry ProgramIndyInetURLInfoNaviRobotInstallShield DigitalWizardInterGETIRLbotIron33ISSpiderIUPUI Research BotJakartajava/JBH AgentJennyBotJetCarjeteyejeteyebotJoBoJOC Web SpiderKapereKenjinKeyword DensityKRetrieveksoapKWebGetLapozzBotlarbinleechLeechFTPLeechGetleipzig.deLexiBotlibWeblibwww-FMlibwww-perlLightningDownloadLinkextractorProLinkieLinkScanlinktigerLinkWalkerlmcrawlerLNSpiderguyLocalcomBotlooksmartLWPMac FinderMail Sweepermark.bloninMaSagoolMassMata HariMCspiderMetaProducts Download ExpressMicrosoft Data AccessMicrosoft URL ControlMIDownMIIxpcMirrorMissaugaMissouri College BrowseMisterMonstermkdbmogetMoreoverbotmothra/netscanMovableTypeMozi!Mozilla/22Mozilla/3.0 (compatible)Mozilla/5.0 (compatible; MSIE 5.0)MSIE_6.0MSIECrawlerMSProxyMVAClientMyFamilyBotMyGetRightnameprotectNASA SearchNaverNavroadNearSiteNetAntsnetattacheNetCartaNetMechanicNetResearchServerNetSpiderNetZIPNet VampireNEWT ActiveXNextopiaNICErsPROninjaNimbleCrawlernoxtrumbotNPBotOctopusOfflineOK MozillaOmniExplorerOpaLOpenbotOpenfindOpenTextSiteCrawlerOracle Ultra SearchOutfoxBotP3PPackRatPageGrabberPagmIEDownloadpanscientPapa FotopavukpcBrowserperlPerManPersonaPilotPHP versionPlantyNet_WebRobotplaystarmusicPluckerPort HuronProgram SharewareProgressive DownloadProPowerBotprospectorProWebWalkerProzillapsbotpsycheclonepufPushSitePussyCatPuxaRapidoPython-urllibQuepasaCreepQueryNRadiationRealDownloadRedCarpetRedKernelReGetrelevantnoiseRepoMonkeyRMARoverRsyncRTG30RufusSAPOSBIderscooterScoutAboutscriptsearchpreviewsearchtermsSeekbotSeriousShaishelobShim-CrawlerSickleBotsitecheckSiteSnaggerSlurpy VerifierSlySearchSmartDownloadsna-snaggerSnoopysogousootleSo-net” bat_botSpankBot” bat_botspanner” bat_botSpeedDownloadSpeglaSphereSphiderSpiderBotsprooseSQ WebscannerSqwormStaminaStanfordstudybotSuperBotSuperHTTPSurfbotSurfWalkersuzuranSzukacztAkeOutTALWinHttpClienttarspiderTeleportTelesoftTempletonTestBEDThe IntraformantTheNomadTightTwatBotTitantoCrawl/UrlDispatcherTrue_RobotturingosTurnitinBotTwisted PageGetterUCmoreUdmSearchUMBCUniversalFeedParserURL ControlURLGetFileURLy WarningURL_Spider_ProUtilMindvayalavobsubVCIVoidEYEVoilaBotvoyagerw3mirWeb Image CollectorWeb SuckerWeb2WAPWebaltBotWebAutoWebBanditWebCapturewebcollageWebCopierWebCopyWebEMailExtracWebEnhancerWebFetchWebFilterWebFountainWebGoWebLeacherWebMinerWebMirrorWebReaperWebSaugerWebSnakeWebsiteWebStripperWebVacwebwalkWebWhackerWebZIPWells SearchWEP Search 00WeRelateBotWgetWhosTalkingWidowWildsoft SurferWinHttpRequestWinHTTrackWUMPUSWWWOFFLEwwwsterWWW-CollectorXaldonXenu'sXenusXGETY!TunnelProYahooYSMcmYaDirectBotYetiZadeZBotzerxbotZeusZyBorg« Crazy Cache WordPress Plugin Released
.Htaccess rewrites, Mod_Rewrite Tricks and Tips »
Tags: htaccess, mod_rewrite, Security, SetEnvIf
Please consider donating to support active development of the free software and articles here.![]()
The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. Tim Berners-Lee
@Spencer: no it doesnt ( for I’ve seen in my own htaccess files )
However for human reading purposes you might put your normal mod rewrite rules (SEO urls) on top below the engine and php flags followed by all your blocking rules.
@Spencer: in your public file root ( httpdocs or httproot or www or public_html ) …
In checking into my access logs, it seems that my usage of for example …
SetEnvIfNoCase User-Agent .*(libwww-perl|aesop_com_spiderman).* bad_web_bot
… might not be working.
For example, I’ve still get libwww-perl appearing in my access logs. As a test I tried adding in a portion of text appearing from my own user agent string to see if I could block myself … alas … I was able to still access my website.
Three questions:
'User-Agent'. Shouldn’t it be 'http_user_agent'. (My experience seems to indicate cases are insensitive. Not so?)Thanks
Before I go on, just wanted to let you know about an error in the following line that causes a fatal error:
SetEnvIfNoCase ^User-Agent$ .*(web(zip|emaile|enhancer).* bad_web_bot
Shouldn’t this be
SetEnvIfNoCase ^User-Agent$ .*(webzip|emaile|enhancer).* bad_web_bot
Now, with so many options for dealing with security, at least I’ve got this one running for now. Thanks.
Excellent :)
This list is OK, yet you have one entry not fitting …
namely the Wget , which is most often used for cron jobs. The Wget will GET or POST your page, and is in most cases pretty harmless..
Stupid newbie question, but do you need to make an ErrorDocument page for 403.html?
thanks
will it not slow down the entire site if i add so many entries in apache? i have heard that this can cause problems?
SetEnvIfNoCase vs RewriteRules ? Which one is faster ?
Hey, this is awesome. Does it matter where in the .htaccess file you put it all?
It's very simple - you read the protocol and write the code. -Bill Joy
HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C | TLDP | WAI | DISA | ICSI | GIAC | SANS RR | GHOST | DEFCON | NIST | DHS CYBER | NIST | .:: Phrack Magazine ::.
↑ TOPExcept where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. HTTPD based on NCSA HTTPd
Thanks for this some bot used up masses of my traffic allowance this week Ive put your htaccess into my root.
Now just sit back and pray it works