A large client has a secure website where they assemble and create presentations consisting of a single Table of Contents page with many pdf's attached to it. They use the site to make presentations. This is a big client, a rich client, and they needed a way to guarantee they would be able to access the site. So I got this request:

Can you make an offline version of the Page that is always updated and available for download so we can download the offline version and present using that in case the website is down or more often, in case Internet Access is unavailable?

Update: I used COOKIE based authentication to secure this clients site, so that only logged in users can see anything at all, so how do I enable the curl requests to authenticate as well using the COOKIE of the requesting user in each request made by curl? Just add the users HTTP_COOKIE to the headers array used by curl like so:

"Cookie: {$_SERVER['HTTP_COOKIE']}"

That now means the scraped version of the page is an exact duplicate that the user is looking at. Very sweet!

I can GET anything

I can do anythingSo, here's what I hacked together last night, that is being used today. It's essentially 2 files.

  1. A php file that scrapes uses curl to scrape all the urls for the page (favicon, css, images, pdfs, etc..)
  2. A simple bash shell script acting as a cgi that creates a zip file of all the urls, and a self-extracting exe file for those without a winzip tool

The PHP File

This is a simple script that is given 2 parameters:

  1. The url to scrape
  2. The type of download to return



* gogeturl2() - grabs a url with curl, and saves it to disk, works for all media types, pdf, js, img, etc.
* @return
function gogeturl2($url, $saveto)
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
if ($fp = fopen($saveto, 'w')) // {curl_setopt($ch,CURLOPT_STDERR, $fp); curl_setopt($ch,CURLOPT_VERBOSE,1);}
  curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
  curl_setopt($ch, CURLOPT_MAXCONNECTS, 4);
  curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
  curl_setopt($ch, CURLOPT_FAILONERROR, 1);
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
  curl_setopt($ch, CURLOPT_FILE, $fp);
  curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
  curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2008092417 Firefox/3.0.3",
    "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
    "Accept-Language: en-us,en;q=0.5",
    "Accept-Encoding: none",
    "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    "Keep-Alive: 300",
    "Connection: Keep-Alive",
  $r = curl_exec($ch);
  $ch_info = curl_getinfo($ch);
  if (curl_errno($ch)) error_log(print_r($ch_info, 1) . print_r(curl_errno($ch), 1) . print_r(curl_error($ch), 1));
  else curl_close($ch);
* gogeturl()  returns the source of the requested url (thanks to accept-encoding: none)
* @return
function gogeturl($url)
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_MAXCONNECTS, 4);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
  "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 GTB6",
  "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language: en-us,en;q=0.5",
  "Accept-Encoding: none",
  "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
  "Keep-Alive: 115",
  "Connection: keep-alive",
$g = curl_exec($ch);
if (curl_errno($ch)) error_log(print_r(array(
'chinfo' => $ch_info,
'curl_errno' => curl_errno($ch),
'curl_error' => curl_error($ch)
), 1));
return $g;

* _mkdir() makes a directory
* @return
function _mkdir($path, $mode = 0755)
$old = umask(0);
$res = @mkdir($path, $mode);
return $res;

* rmkdir()  recursively makes a directory tree
* @return
function rmkdir($path, $mode = 0755)
return is_dir($path) || (rmkdir(dirname($path), $mode) && _mkdir($path, $mode));

The following should be in a couple functions, but I was running on a tight time schedule, and hey this $hitt aint free... wait yes it is, always.

// Ok lets get it on!
// first lets setup some variables
if (!isset($_GET['url']) || empty($_GET['url']))die();
$td = $th = $urls = array();
$FDATE = date("m-d-y-Hms");
$FTMP = '/web/askapache/sites/askapache.com/tmp';
$fetch_url = $_GET['url'];
$fu = parse_url($fetch_url);
$fd = substr($FTMP . $fu['path'], 0, - 1);
$FEXE = "{$fd}-{$FDATE}.exe";
$FZIP = "{$fd}-{$FDATE}.zip";

// now this is a shortcut to download the css file and add all the images in it to the img_urls array
$img_urls = array();
$gg = preg_match_all("/url(([^)]*?))/Ui", gogeturl('https://www.askapache.com/askapache-0128770124.css'), $th);
$imgs = array_unique($th[1]);
foreach($imgs as $img)
// only because all the links are relative
$img_urls[] = 'https://www.askapache.com' . $img;

// now fetch the main page, and assemble an array of all the external resources into the urls array
$gg = preg_match_all("/(background|href)=(["'])([^"'#]+?)(["'])/Ui", gogeturl($fetch_url), $th);
foreach($th[3] as $url)
if (strpos($url, '.js') !== false)continue;
if (strpos($url, 'wp-login.php') !== false || $url == 'https://www.askapache.com/') continue;
if (strrpos($url, '/') == strlen($url) - 1)continue;
if (strpos($url, 'https://www.askapache.com/') === false)
  if ($url[0] == '/') $urls[] = 'https://www.askapache.com' . $url;
  else continue;
else $urls[] = $url;

// now create a uniq array of urls, then download and save each of them
$urls = array_merge(array_unique($img_urls), array_unique($urls));
foreach($urls as $url)
  $pu = parse_url($url);
  rmkdir(substr($fd . $pu['path'], 0, strrpos($fd . $pu['path'], '/')));
  gogeturl2($url, $fd . $pu['path']);

// deletes dir ie. /this-page/this-page/ when it should be /this-page/index.html
if (is_dir($fd . $fu['path'])) rmdir($fd . $fu['path']);

// now save the page as index.html
gogeturl2($fetch_url, $fd . '/index.html');

// fixup to be able3 to parse
$g = file_get_contents($fd . '/index.html');
$g = str_replace('<script', '<!--<script', $g);
$g = str_replace('script>', 'script>!-->', $g);
$g = str_replace('href="https://www.askapache.com/', 'href="/', $g);
$g = str_replace('src="https://www.askapache.com/', 'src="/', $g);
$g = str_replace('href="/', 'href="', $g);
$g = str_replace('src="/', 'src="', $g);
$g = str_replace("href='https://www.askapache.com/", "href='/", $g);
$g = str_replace("src='https://www.askapache.com/", "src='/", $g);
$g = str_replace("href='/", "href='", $g);
$g = str_replace("src='/", "src='", $g);
file_put_contents($fd . '/index.html', $g);

// fixup for css file
foreach($urls as $url)
if (strpos($url, '.css') !== false)
  $fuu = parse_url($url);
  $css = file_get_contents($fd . $fuu['path']);
  $css = str_replace('url(/', 'url(../', $css);
  file_put_contents($fd . $fuu['path'], $css);

// my favorite technique, using fsockopen to initiate a shell script server-side.
// passing the args in the HTTP Headers... genius!!
// close the sucker fast with HTTP/1.0 and connection: close
$fp = fsockopen ($_SERVER['SERVER_NAME'], 80, $errno, $errstr, 5);
fwrite($fp, "GET /cgi-bin/sh/zip.sh HTTP/1.0rnHost: www.askapache.comrnX-Pad: {$fd}rnX-Allow: {$FDATE}rnConnection: Closernrn");

// loop until the file created by /cgi-bin/sh/zip.sh is found
$c = 0;
if (is_file("{$FEXE}")) continue;
while ($c < 20);

// either zip or exe
$type = $_GET['type'];
if ($type == 'zip') $file = $FZIP;
else $file = $FEXE;

// wow great debugging dude

// if the file is there, do a 302 redirect to initiate download
if (file_exists("{$file}"))
  @header("HTTP/1.1 302 Moved", 1, 302);
  @header("Status: 302 Moved", 1, 302);
  @header('Location: https://www.askapache.com/' . basename($file));




# all you need for cgi
  echo -e "Content-type: text/plainnn"

# blank that run log
  echo "" > /web/askapache/sites/askapache.com/tmp/run.log

# redirect 1 and 2 to the run log for the whole script
  exec &>/web/askapache/sites/askapache.com/tmp/run.log

# basename

# date-based

# create recursively the dir tree
  mkdir -pv $HTTP_X_PAD

# cd to the tmp
  cd /web/askapache/sites/askapache.com/tmp

# the zip version with date

# the exe version with date

# for debugging, only goes to run log
  echo "F=$F"
  echo "NN=$NN"
  echo "N=$N"

# create a relative (r is recursive) archive of the entire dir
  /usr/bin/zip -rvv $F $N

# add the self-extracting stub to the archive
  /bin/cat unzipsfx.exe $F > $NN

# fix the sfx stub
  /usr/bin/zip -A $NN

# move both the exe and zip to the web-docroot to be dl'd directly
  cp -vf $NN /web/askapache/sites/askapache.com/htdocs/
  cp -vf $F /web/askapache/sites/askapache.com/htdocs/

# sleep for 60 seconds and then rm all the files, so you better download that file fast
  sleep 60 && rm -rvf $HTTP_X_PAD $F $NN /web/askapache/sites/askapache.com/htdocs/$NN /web/askapache/sites/askapache.com/htdocs/$F

#suh suh cya
exit 0;

Creating SFX Archives

The best way is 7z, but I couldn't get p7zip's sfx module to work and didn't have time to compile it. Instead I just used the stub available here: curl -O -L ftp://ftp.info-zip.org/pub/infozip/win32/unz552xn.exe which works great but no customization like password, icons, etc.. oh well.


Simple, just create a link to the php file with the url and type parameters. /cgi-bin/php/scrapeit.php?url=http://www.askapache.com/htaccess/&type=exe or if you integrate into wordpress like I did you can add this to your header or admin_bar and use the get_permalink() for the url arg.

Lock It Down

This was used on a private site so what I did was add some code to the scrapeit.php file that just copied the HTTP_COOKIE value sent by the requesting user and sent that as part of the request in the fsockopen request. That means only logged in users can do this, and furthermore, if a user doesn't have access to a page and tries to use this to circumvent, they can't. And also htaccess was used to limit the scripts to only allow the ip's running the server to make connections.


What can't be done with linux, bash, http, php, and a little server-side finesse? My clients are very happy, and I had some fun!