A large client has a secure website where they assemble and create presentations consisting of a single Table of Contents page with many pdf's attached to it. They use the site to make presentations. This is a big client, a rich client, and they needed a way to guarantee they would be able to access the site. So I got this request:

Can you make an offline version of the Page that is always updated and available for download so we can download the offline version and present using that in case the website is down or more often, in case Internet Access is unavailable?

Update: I used COOKIE based authentication to secure this clients site, so that only logged in users can see anything at all, so how do I enable the curl requests to authenticate as well using the COOKIE of the requesting user in each request made by curl? Just add the users HTTP_COOKIE to the headers array used by curl like so:

array(
...
"Cookie: {$_SERVER['HTTP_COOKIE']}"
)

That now means the scraped version of the page is an exact duplicate that the user is looking at. Very sweet!

I can GET anything

So, here's what I hacked together last night, that is being used today. It's essentially 2 files.

  1. A php file that scrapes uses curl to scrape all the urls for the page (favicon, css, images, pdfs, etc..)
  2. A simple bash shell script acting as a cgi that creates a zip file of all the urls, and a self-extracting exe file for those without a winzip tool

The PHP File

This is a simple script that is given 2 parameters:

  1. The url to scrape
  2. The type of download to return

scrapeit.php

 $ch_info,
'curl_errno' => curl_errno($ch),
'curl_error' => curl_error($ch)
), 1));
curl_close($ch);
return $g;
}



/**
* _mkdir() makes a directory
*
* @return
*/
function _mkdir($path, $mode = 0755)
{
$old = umask(0);
$res = @mkdir($path, $mode);
umask($old);
return $res;
}

/**
* rmkdir()  recursively makes a directory tree
*
* @return
*/
function rmkdir($path, $mode = 0755)
{
return is_dir($path) || (rmkdir(dirname($path), $mode) && _mkdir($path, $mode));
}

The following should be in a couple functions, but I was running on a tight time schedule, and hey this $hitt aint free... wait yes it is, always.

// Ok lets get it on!
// first lets setup some variables
if (!isset($_GET['url']) || empty($_GET['url']))die();
$td = $th = $urls = array();
$FDATE = date("m-d-y-Hms");
$FTMP = '/web/askapache/sites/askapache.com/tmp';
$fetch_url = $_GET['url'];
$fu = parse_url($fetch_url);
$fd = substr($FTMP . $fu['path'], 0, - 1);
$FEXE = "{$fd}-{$FDATE}.exe";
$FZIP = "{$fd}-{$FDATE}.zip";

// now this is a shortcut to download the css file and add all the images in it to the img_urls array
$img_urls = array();
$gg = preg_match_all("/url(([^)]*?))/Ui", gogeturl('https://www.askapache.com/askapache-0128770124.css'), $th);
$imgs = array_unique($th[1]);
foreach($imgs as $img)
{
// only because all the links are relative
$img_urls[] = 'https://www.askapache.com' . $img;
}

// now fetch the main page, and assemble an array of all the external resources into the urls array
$gg = preg_match_all("/(background|href)=(["'])([^"'#]+?)(["'])/Ui", gogeturl($fetch_url), $th);
foreach($th[3] as $url)
{
if (strpos($url, '.js') !== false)continue;
if (strpos($url, 'wp-login.php') !== false || $url == 'https://www.askapache.com/') continue;
if (strrpos($url, '/') == strlen($url) - 1)continue;
if (strpos($url, 'https://www.askapache.com/') === false)
{
	if ($url[0] == '/') $urls[] = 'https://www.askapache.com' . $url;
	else continue;
}
else $urls[] = $url;
}

// now create a uniq array of urls, then download and save each of them
$urls = array_merge(array_unique($img_urls), array_unique($urls));
foreach($urls as $url)
{
  $pu = parse_url($url);
  rmkdir(substr($fd . $pu['path'], 0, strrpos($fd . $pu['path'], '/')));
  gogeturl2($url, $fd . $pu['path']);
}

// deletes dir ie. /this-page/this-page/ when it should be /this-page/index.html
if (is_dir($fd . $fu['path'])) rmdir($fd . $fu['path']);

// now save the page as index.html
gogeturl2($fetch_url, $fd . '/index.html');

// fixup to be able3 to parse
$g = file_get_contents($fd . '/index.html');
$g = str_replace('