FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home » Linux » Wget Trick to Download from Restrictive Sites

Wget Trick to Download from Restrictive Sites

Wget Trick to Download from Restrictive Sites » Wget Trick to Download from Restrictive Sites

Wget Trick to Download from Restrictive Sites

September 6th, 2007

Before
wget 403 Forbidden After trick
wget bypassing restrictions
I am often logged in to my servers via SSH, and I need to download a file like a WordPress plugin. I've noticed many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimick a normal browser.


Update

function wgets()
{
  wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" \
  --header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \
  --header="Accept-Language: en-us,en;q=0.5" \
  --header="Accept-Encoding: gzip,deflate" \
  --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \
  --header="Keep-Alive: 300" "$@"
}

Using alias

Add this to your .bash_profile or other shell startup script, or just type it at the prompt. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

alias wget='wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate"
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300"'

Using custom .wgetrc

Alternatively, you could instead just create or modify your $HOME/.wgetrc file like this. Or download and rename to .wgetrc.wgetrc. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

###
### Sample Wget initialization file .wgetrc by http://www.askapache.com
###
##
## Local settings (for a user to set in his $HOME/.wgetrc).  It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to "normal" users.
##
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
##
 
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
header = Accept-Encoding: gzip,deflate
header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
referer = http://www.google.com

From the command line

wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate"
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv http://www.askapache.com/sitemap.xml

http://www.askapache.com/linux/wget-header-trick.html#comments

Reader Comments

  1. Abhishek D ~December 20, 2011 @ 8:03 pm
    Awesome !!
  2. R Raman ~November 2, 2010 @ 10:04 pm
    Thanks a TON for the very detailed examples. Not only was I able to get my error page (403 Forbidden page) using the command line example, but .wgetrc also worked like a charm.
  3. clutkin ~August 18, 2010 @ 1:30 pm
    Advanced examples have existed for years on wget wikipedia. These examples are great, please update any errors if there are any. This website is now a reference wikipedia link Advanced Examples. PLEASE DO NOT BREAK THE LINKS BY RENAMING/MOVING THIS WEB PAGE. {tjc}
  4. Tensigh ~February 24, 2010 @ 9:37 am
    Didn't work for me -- the "Accept" lines returned errors. *SIGH*
  5. AskApache ~September 21, 2009 @ 11:11 pm
    @ lien

    Nice idea there, haven't tried that yet, but I plan on it..

  6. lien ~September 5, 2009 @ 2:03 pm
    This can also be used to see if competitors web sites are feeding google spider food. Just set your user agent to googlebot.
  7. Deserio ~May 1, 2009 @ 7:41 pm
    I was looking for "user agent", works great for me, specially with rapidshare.
  8. Torrid Luna ~February 25, 2009 @ 10:07 pm
    There's an uncommented feature "robots=off", if all else fails, you could add that to your list. Cheers, Torrid
  9. elkdbrlg ~November 30, 2007 @ 8:33 am
    how could you download a page what has 404 return code?

Add Comment!

Leave a Reply

Your email address will not be published.


Google +

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

Site Map | Contact Webmaster | Glossary | License and Disclaimer | Terms of Service |

↑ TOPMain