FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Home  »  Linux  »  Wget Trick to Download from Restrictive Sites

by 14 comments

Before
wget 403 ForbiddenAfter trick
wget bypassing restrictions
I am often logged in to my servers via SSH, and I need to download a file like a WordPress plugin. I've noticed many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimick a normal browser.



Testing Wget Trick

Just add the -d option. Like: $ wget -O/dev/null -d http://www.askapache.com

GET / HTTP/1.1
Referer: http://www.askapache.com/
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: www.askapache.com
Connection: keep-alive
Accept-Language: en-us,en;q=0.5

Wget Function

Rename to wget to replace wget.

function wgets()
{
  local H='--header'
  wget $H='Accept-Language: en-us,en;q=0.5' $H='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' $H='Connection: keep-alive' -U 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2' --referer=http://www.askapache.com/ "$@";
}

Wget alias

Add this to your .bash_profile or other shell startup script, or just type it at the prompt. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

alias wgets='H="--header"; wget $H="Accept-Language: en-us,en;q=0.5" $H="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" $H="Connection: keep-alive" -U "Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" --referer=http://www.askapache.com/ '

Using custom .wgetrc

Alternatively, and probably the best way, you could instead just create or modify your $HOME/.wgetrc file like this. Or download and rename to .wgetrc.wgetrc. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

### Sample Wget initialization file .wgetrc by http://www.askapache.com
## Local settings (for a user to set in his $HOME/.wgetrc).  It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to "normal" users.
##
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
referer = http://www.askapache.com/
robots = off

Other command line

wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate"
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv http://www.askapache.com/sitemap.xml

Wget Alternative

Once you get tired of how basic wget is, start using curl, which is 100x better.

Tags

September 6th, 2007

Comments Welcome

  • http://primforge.com/ Torrid Luna

    There's an uncommented feature "robots=off", if all else fails, you could add that to your list.

    Cheers,
    Torrid

  • Deserio

    I was looking for "user agent", works great for me, specially with rapidshare.

  • lien

    This can also be used to see if competitors web sites are feeding google spider food. Just set your user agent to googlebot.

  • http://www.askapache.com/ AskApache

    @ lien

    Nice idea there, haven't tried that yet, but I plan on it..

  • Tensigh

    Didn't work for me -- the "Accept" lines returned errors. *SIGH*

  • http://en.wikipedia.org/wiki/Wget clutkin

    Advanced examples have existed for years on wget wikipedia. These examples are great, please update any errors if there are any. This website is now a reference wikipedia link Advanced Examples. PLEASE DO NOT BREAK THE LINKS BY RENAMING/MOVING THIS WEB PAGE. {tjc}

  • R Raman

    Thanks a TON for the very detailed examples. Not only was I able to get my error page (403 Forbidden page) using the command line example, but .wgetrc also worked like a charm.

  • tim

    When I use this, I get an error:

    --header=Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7: command not found

    Any idea why that might be happening?

  • Abhishek D

    Awesome !!

  • Swagnik Mitra

    You will have to keep changing your Gecko details every 2-3 times. The server side bot identifies there's something wrong with the incoming wget request and then 403's it.

    Any smart solution for this?

  • http://htmwrestling.com Darrius

    This also used to work using download managers that could manually set referrers, but in doing some security testing today I am seeing that is failing. Did Apache get smarter? (not complaining)

  • http://URL palamin

    Everyone says that wget is sooooo basic, but it can do recursive download. Can curl do that? It might be offtopic here but if anyone knows how, please let me know.

  • http://URL yotam

    Your example with

    --header="Keep-Alive: 300"

    helped me download successfully from Diino(dot)com.
    Thanks!

  • Totgia

    I like the "other command line^" section since it is simpler to apply for me:)

Popular Articles
My Online Tools

Related Articles
Newest Posts
Twitter


  •  t.co/ShKrGdqXuJ 
  • RUN GCC! This is a typical shirt I wear, from the  t.co/46LYbFr4k2  shop. A clerk at the LQ recognized it!  t.co/jjmT0dkCPu 
  • Merlin the Magician  t.co/iMmRbanUi4 
  • ROGUE CODE - Latest novel from @markrussinovich  t.co/apkn0LoPIt 
  • RTFM - surprisingly very helpful and way more comprehensive than it looks! @redteamfieldman #pwnAllTheThings  t.co/xiaJ5g0aC9 
  • Dear Hacker - Letters to the Editor of 2600, from Emmanuel Goldstein  t.co/JCfLab7FAJ 
  • The Mythical Man-Month - Essays on Software Engineering, by Frederick P. Brooks, Jr.  t.co/ilWN5GHElr 
  • "where wizards stay up late" - The Origins of the Internet. Favorite book detailing the birth of the net and IMPs  t.co/gY9VTGJgZz 
  • ZERO DAY - read before Trojan horse  t.co/pPMLGDJv8P 
  • Trojan Horse, a novel!  t.co/Hf8EtYaZVa 
  • The Hacker Playbook - very nice high level overview of attacks  t.co/lHwNVWi61u 
  • Clean Code - A Handbook of Agile Software Craftsmanship  t.co/hnJX0x1qIc 
  • Secrets of the JavaScript Ninja - By my absolute favorite JS hacker John Resig!  t.co/tZ42ljmcCl 
  • Hacking Exposed 7: Network Security Secrets & SolutionsMy all time favorite, basic but thorough and accurate.  t.co/jycW0RDVtZ 

Hacking and Hackers

The use of "hacker" to mean "security breaker" is a confusion on the part of the mass media. We hackers refuse to recognize that meaning, and continue using the word to mean someone who loves to program, someone who enjoys playful cleverness, or the combination of the two. See my article, On Hacking.
-- Richard M. Stallman






[hide]

It's very simple - you read the protocol and write the code. -Bill Joy

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd.
UNIX ® is a registered Trademark of The Open Group. POSIX ® is a registered Trademark of The IEEE.

| Google+ | askapache

Site Map | Contact Webmaster | License and Disclaimer | Terms of Service

↑ TOPMain