« Speed Tips: Turn On CompressionFirefox, Firebug, and yslow are REQUIRED »
Wget Trick to Download from Restrictive Sites
September 6th, 2007
Before
After trick
I am often logged in to my servers via SSH, and I need to download a file like a WordPress plugin. I've noticed many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimick a normal browser.
Testing Wget Trick
Just add the -d option. Like: $ wget -O/dev/null -d http://www.askapache.com
GET / HTTP/1.1 Referer: http://www.askapache.com/ User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Host: www.askapache.com Connection: keep-alive Accept-Language: en-us,en;q=0.5
Wget Function
Rename to wget to replace wget.
function wgets()
{
local H='--header'
wget $H='Accept-Language: en-us,en;q=0.5' $H='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' $H='Connection: keep-alive' -U 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2' --referer=http://www.askapache.com/ "$@";
}
Wget alias
Add this to your .bash_profile or other shell startup script, or just type it at the prompt. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.
alias wgets='H="--header"; wget $H="Accept-Language: en-us,en;q=0.5" $H="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" $H="Connection: keep-alive" -U "Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" --referer=http://www.askapache.com/ '
Using custom .wgetrc
Alternatively, and probably the best way, you could instead just create or modify your $HOME/.wgetrc file like this. Or download and rename to .wgetrc.wgetrc. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.
### Sample Wget initialization file .wgetrc by http://www.askapache.com ## Local settings (for a user to set in his $HOME/.wgetrc). It is ## *highly* undesirable to put these settings in the global file, since ## they are potentially dangerous to "normal" users. ## ## Even when setting up your own ~/.wgetrc, you should know what you ## are doing before doing so. header = Accept-Language: en-us,en;q=0.5 header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 header = Connection: keep-alive user_agent = Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2 referer = http://www.askapache.com/ robots = off
Other command line
wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv http://www.askapache.com/sitemap.xml
Wget Alternative
Once you get tired of how basic wget is, start using curl, which is 100x better.
Please Comment!
Reader Comments
-
You will have to keep changing your Gecko details every 2-3 times. The server side bot identifies there's something wrong with the incoming wget request and then 403's it.
Any smart solution for this?
-
Awesome !!
-
When I use this, I get an error:
--header=Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7: command not foundAny idea why that might be happening?
-
Thanks a TON for the very detailed examples. Not only was I able to get my error page (403 Forbidden page) using the command line example, but .wgetrc also worked like a charm.
-
Advanced examples have existed for years on wget wikipedia. These examples are great, please update any errors if there are any. This website is now a reference wikipedia link Advanced Examples. PLEASE DO NOT BREAK THE LINKS BY RENAMING/MOVING THIS WEB PAGE. {tjc}
-
Didn't work for me -- the "Accept" lines returned errors. *SIGH*
-
This can also be used to see if competitors web sites are feeding google spider food. Just set your user agent to googlebot.
-
I was looking for "user agent", works great for me, specially with rapidshare.
-
There's an uncommented feature "
robots=off", if all else fails, you could add that to your list.Cheers,
Torrid -
how could you download a page what has 404 return code?
This also used to work using download managers that could manually set referrers, but in doing some security testing today I am seeing that is failing. Did Apache get smarter? (not complaining)