« Speed Tips: Turn On CompressionFirefox, Firebug, and yslow are REQUIRED »
Wget Trick to Download from Restrictive Sites
Before![]()
After trick
I am often logged in to my servers via SSH, and I need to download a file like a WordPress plugin. I’ve noticed many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimick a normal browser.
function wgets()
{
wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" \
--header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \
--header="Accept-Language: en-us,en;q=0.5" \
--header="Accept-Encoding: gzip,deflate" \
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \
--header="Keep-Alive: 300" "$@"
}
Add this to your .bash_profile or other shell startup script, or just type it at the prompt. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.
alias wget='wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300"'
Alternatively, you could instead just create or modify your $HOME/.wgetrc file like this. Or download and rename to .wgetrc.wgetrc. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.
### ### Sample Wget initialization file .wgetrc by http://www.askapache.com ### ## ## Local settings (for a user to set in his $HOME/.wgetrc). It is ## *highly* undesirable to put these settings in the global file, since ## they are potentially dangerous to "normal" users. ## ## Even when setting up your own ~/.wgetrc, you should know what you ## are doing before doing so. ## header = Accept-Language: en-us,en;q=0.5 header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 header = Accept-Encoding: gzip,deflate header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 header = Keep-Alive: 300 user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 referer = http://www.google.com
wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv http://www.askapache.com/sitemap.xml
« Speed Tips: Turn On Compression
Firefox, Firebug, and yslow are REQUIRED »
The love of liberty is the love of others; the love of power is the love of ourselves.
-- William Hazlitt
The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. Tim Berners-Lee
This can also be used to see if competitors web sites are feeding google spider food. Just set your user agent to googlebot.
I was looking for “user agent”, works great for me, specially with rapidshare.
There’s an uncommented feature “robots=off“, if all else fails, you could add that to your list.
Cheers,
Torrid
how could you download a page what has 404 return code?
Tags: 403 Forbidden, Apache, askapache, bash, bash_profile, Blocking, Firefox, GET, Google, Htaccess, HTTP Headers, Prompt, Robot, robots, Security, server, servers, Shell, SSH, SSI, trick, Wget, WordPress,
It's very simple - you read the protocol and write the code. -Bill Joy
HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group.
POSIX ® is a registered Trademark of The IEEE.
Didn’t work for me — the “Accept” lines returned errors. *SIGH*