Running a Reverse Proxy in Apache

FREE THOUGHT · FREE SOFTWARE · FREE WORL

Running a Reverse Proxy in Apache

In 2003, Nick Kew released a new module that complements Apache’s
mod_proxy and is essential for reverse-proxying. Since then he gets
regular questions and requests for help on proxying with Apache. In
this article he attempts to give a comprehensive overview of the
proxying and mod_proxy_html


This article was originally published at ApacheWeek in January 2004,
and moved to ApacheTutor with minor updates in October 2006.

Web Proxies

A proxy server is a gateway for users to the Web at large. Users
configure the proxy in their browser settings, and all HTTP requests
are routed via the proxy. Proxies are typically operated by ISPs and
network administrators, and serve several purposes: for example,

  • to speed access to the Web by caching pages fetched, so that
    popular pages don’t have to be re-fetched for every user who views
    them.
  • to enable controlled access to the web for users behind a
    firewall.
  • to filter or transform web content.

Reverse Proxies

A reverse proxy is a gateway for servers, and enables one web server
to provide content from another transparently. As with a standard
proxy, a reverse proxy may serve to improve performance of the web by
caching; this is a simple way to mirror a website. But the most common
reason to run a reverse proxy is to enable controlled access from the
Web at large to servers behind a firewall.

The proxied server may be a webserver itself, or it may be an
application server using a different protocol, or an application
server with just rudimentary HTTP that needs to be shielded from
the web at large. Since 2004, reverse proxying has been the preferred
method of deploying JAVA/Tomcat applications on the Web, replacing
the old mod_jk (itself a special-purpose reverse proxy module).

Proxying with Apache

The standard Apache module mod_proxy supports both types of proxy
operation. Under Apache 1.x, mod_proxy only supported HTTP/1.0, but
from Apache 2.0, it supports HTTP/1.1. This distinction is
particularly important in a proxy, because one of the most significant
changes between the two protocol versions is that HTTP/1.1 introduces
rich new cache control mechanisms.

This article deals with running a reverse proxy with Apache 2. Users
of earlier versions of Apache are encouraged to upgrade and take
advantage of the altogether richer architecture and improved
application support. At the time of writing, the reason most commonly
cited for not upgrading is difficulties running PHP on Apache 2. I
cannot speak from personal experience, but several well-informed
sources tell me the difficulty lies with non-thread-safe code in PHP,
and that it works well with Apache 2 if it is built with the
non-threaded Prefork MPM.

The Apache Proxy Modules

So far, we have spoken loosely of mod_proxy. However, it’s a little
more complicated than that. In keeping with Apache’s modular
architecture, mod_proxy is itself modular, and a typical proxy server
will need to enable several modules. Those relevant to proxying and
this article include:

  • mod_proxy: The core module deals with proxy infrastructure and
    configuration and managing a proxy request.
  • mod_proxy_http: This handles fetching documents with HTTP and
    HTTPS.
  • mod_proxy_ftp: This handles fetching documents with FTP.
  • mod_proxy_connect: This handles the CONNECT method for secure
    (SSL) tunneling.
  • mod_proxy_ajp: This handles the AJP protocol for Tomcat
    and similar backend servers.
  • mod_proxy_balancer implements clustering and load-balancing
    over multiple backends.
  • mod_cache, mod_disk_cache, mod_mem_cache: These deal with managing
    a document cache. To enable caching requires mod_cache and one or
    both of disk_cache and mem_cache.
  • mod_proxy_html: This rewrites HTML links into a proxy’s address
    space.
  • mod_headers: This modifies HTTP request and response headers.
  • mod_deflate: Negotiates compression with clients and backends.

Having mentioned the modules, I’m going to ignore caching for the
remainder of this article. You may want to add it if you are concerned
about the load on your network or origin servers, but the details are
outside the scope of this article. I’m also going to ignore all
non-HTTP protocols, and load balancing.

Building Apache for Proxying

With the exception of mod_proxy_html, the above are all included in
the core Apache distribution. They can easily be enabled in the Apache
build process. For example:

$ ./configure --enable-so --enable-mods-shared="proxy \
proxy_http proxy_ftp proxy_connect headers"
$ make
# make install

Of course, you may want other build options too, and you could just as
well build the modules as static.

If you are adding proxying to an existing installation, you should use
apxs instead:

# apxs -c -i [module-name].c
noting that mod_proxy itself is in two source files
(mod_proxy.c and proxy_util.c).

This leaves mod_proxy_html, which is not included in the core
distribution. mod_proxy_html is a third-party module, and requires a
third-party library libxml2. At the time of writing, libxml2 is
installed as standard or packaged for most operating systems. If you
don’t have it, you can download it from xmlsoft.org and install it
yourself. For the purposes of this article, we’ll assume libxml2 is
installed as /usr/lib/libxml2.so, with headers in
/usr/include/libxml2/libxml/.

  1. Check libxml2 is installed. If you have a version older than
    2.5.10, then upgrade – there’s a bug in earlier versions that can,
    in some particular cases, severely affect performance.
  2. Download mod_proxy_html.c from http://apache.webthing.com/
  3. Build mod_proxy_html with apxs:
# apxs -c -I/usr/include/libxml2 -i mod_proxy_html.c

A Reverse Proxy Scenario

Company example.com has a website at www.example.com, which has a
public IP address and DNS entry, and can be accessed from anywhere
on the Internet.

The company also has a couple of application servers which have
private IP addresses and unregistered DNS entries, and are inside the
firewall. The application servers are visible within the network -
including the webserver, as “internal1.example.com” and
“internal2.example.com”, But because they have no public DNS entries,
anyone looking at internal1.example.com from outside the company
network will get a “no such host” error.

A decision is taken to enable Web access to the application servers.
But they should not be exposed to the Internet directly, instead they
should be integrated with the webserver, so that
http://www.example.com/app1/any-path-here is mapped internally to
http://internal1.example.com/any-path-here and
http://www.example.com/app2/other-path-here is mapped internally to
http://internal2.example.com/other-path-here. This is a typical
reverse-proxy situation.

Configuring the Proxy

As with any modules, the first thing to do is to load them in
httpd.conf (this is not necessary if we build them statically into
Apache).

LoadModule  proxy_module         modules/mod_proxy.so
LoadModule  proxy_http_module    modules/mod_proxy_http.so
#LoadModule proxy_ftp_module     modules/mod_proxy_ftp.so
#LoadModule proxy_connect_module modules/mod_proxy_connect.so
LoadModule  headers_module       modules/mod_headers.so
LoadModule  deflate_module       modules/mod_deflate.so
LoadFile    /usr/lib/libxml2.so
LoadModule  proxy_html_module    modules/mod_proxy_html.so

For windows users this is slightly different: you’ll need to load
libxml2.dll rather than libxml2.so, and you’ll probably need to
load iconv.dll and xlib.dll as prerequisites to libxml2 (you
can download them from zlatkovic.com, the same site that
maintains windows binaries of libxml2). The LoadFile directive is the same.

Of course, you may not need all the modules. Two that are not required
in our typical scenario are shown commented out above.

Having loaded the modules, we can now configure the Proxy. But before
doing so, we have an important security warning:

Do Not set “ProxyRequests On”. Setting ProxyRequests On turns your
server into an Open Proxy. There are ‘bots scanning the Web for open
proxies. When they find you, they’ll start using you to route around
blocks and filters to access questionable or illegal material. At
worst, they might be able to route email spam through your proxy. Your
legitimate traffic will be swamped, and you’ll find your server
getting blocked by things like family filters.

Of course, you may also want to run a forward proxy with
appropriate security measures, but that lies outside the scope of this
article. The author runs both forward and reverse proxies on the same
server (but under different Virtual Hosts).

The fundamental configuration directive to set up a reverse proxy is
ProxyPass. We use it to set up proxy rules for each of the application
servers:

ProxyPass       /app1/  http://internal1.example.com/
ProxyPass       /app2/  http://internal2.example.com/

Now as soon as Apache re-reads the configuration (the recommended way
to do this is with “apachectl graceful”), proxy requests will work, so
http://www.example.com/app1/some-path maps to
http://internal1.example.com/some-path as required.

However, this is not the whole story. ProxyPass just sends traffic
straight through. So when the application servers generate references
to themselves (or to other internal addresses), they will be passed
straight through to the outside world, where they won’t work.

For example, an HTTP redirection often takes place when a user (or
author) forgets a trailing slash in a URL. So the response to a
request for http://www.example.com/app1/foo proxies to
http://internal.example.com/foo which generates a response:

HTTP/1.1 302 Found
Location: http://internal.example.com/foo/
(etc)

But from the outside world, the net effect of this is a “No such host”
error. The proxy needs to re-map the Location header to its own
address space and return a valid URL

HTTP/1.1 302 Found
Location: http://www.example.com/app1/foo/

The command to enable such rewrites in the HTTP Headers is
ProxyPassReverse. The Apache documentation suggests the form:

 
ProxyPassReverse /app1/ http://internal1.example.com/
ProxyPassReverse /app2/ http://internal2.example.com/

However, there is a slightly more complex alternative form that I
recommend as more robust:

<Location /app1/>
ProxyPassReverse /
</Location>
<Location /app2/>
ProxyPassReverse /
</Location>

The reason for recommending this is that a problem arises with some
application servers. Suppose for example we have a redirect:

HTTP/1.1 302 Found
Location: /some/path/to/file.html

This is a violation of the HTTP protocol and so should never happen:
HTTP only permits full URLs in Location headers. However, it is also a
source of much confusion, not least because the CGI spec has a similar
Location header with different semantics where relative paths are
allowed. There are a lot of broken servers out there! In this
instance, the first form of ProxyPassReverse will return the incorrect
response

HTTP/1.1 302 Found
Location: /some/path/to/file.html

which, even allowing for error-correcting browsers, is outside the
Proxy’s address space and won’t work. The second form fixes this to

HTTP/1.1 302 Found
Location: /app2/some/path/to/file.html

which is still broken, but will at least work in error-correcting
browsers. Most browsers will deal with this.

If your backend server uses cookies, you may also need the
ProxyPassReverseCookiePath and ProxyPassReverseCookieDomain
directives. These are similar to ProxyPassReverse, but deal with the
different form of cookie headers. These require mod_proxy from
Apache 2.2 (recommended), or a patched version of 2.0.

Fixing HTML Links

As we have seen, ProxyPassReverse remaps URLs in the HTTP headers to
ensure they work from outside the company network. There is, however,
a separate problem when links appear in HTML pages served. Consider
the following cases:

  1. This link will be resolved by the browser
    and will work correctly.
  2. This link will be resolved by the
    browser to http://www.example.com/otherfile.html, which is
    incorrect.
  3. This link will resolve to
    “no such host” for the browser.

The same problem of course applies to included content such as images,
stylesheets, scripts or applets, and other contexts where URLs occur
in HTML.

To fix this requires us to parse the HTML and rewrite the links. This
is the purpose of mod_proxy_html. It works as an output filter,
parsing the HTML and rewriting links as it is served. Two
configuration directives are required to set it up:

  • SetOutputFilter proxy-html This simply inserts the filter, to
    enable ProxyHTMLURLMap
  • ProxyHTMLURLMap from-pattern to-pattern [flags] In its basic form,
    this has a similar purpose and semantics to ProxyPassReverse.
    Additionally, an extended form is available to enable
    search-and-replace rewriting of URLs within Scripts and
    Stylesheets.

How it works

mod_proxy_html is based on a SAX parser: specifically the HTMLparser
module from libxml2 running in SAX mode (any other parse mode would of
course be very much slower, especially for larger documents). It has
full knowledge of all URI attributes that can occur in HTML 4 and
XHTML 1. Whenever a URL is encountered, it is matched against
applicable ProxyHTMLURLMap directives. If it starts with any
from-pattern, that will be rewritten to the to-pattern. Rules are
applied in the reverse order to their appearance in httpd.conf, and
matching stops as soon as a match is found.

Here’s how we set up a reverse proxy for HTML. Firstly, full links to
the internal servers should be rewritten regardless of where they
arise, so we have:

ProxyHTMLURLMap http://internal1.example.com /app1
ProxyHTMLURLMap http://internal2.example.com /app2

Note that in this instance we omitted the “trailing” slash. Since the
matching logic is starts-with, we use the minimal matching pattern. We
have now globally fixed case 3 above.

Case 2 above requires a little more care. Because the link doesn’t
include the hostname, the rewrite rule must be context-sensitive. As
with ProxyPassReverse above, we deal with that using

<Location /app1/>
ProxyHTMLURLMap / /app1/
</Location>
<Location /app2/>
ProxyHTMLURLMap / /app2/
</Location>

Debugging your Proxy Configuration

The above is a simple case taken from mod_proxy_html version 1. With
the more complex URLmapping and rewriting enabled by Version 2, you
may need a bit of help setting up a complex ruleset, perhaps involving
a series of complex regexps, chained anc blocking rules, etc. To help
with setting up and troubleshooting your rulesets, mod_proxy_html 2
provides a “debug” mode, in which all the ‘interesting’ things it does
are written to the Apache error log. To analyse and fix your rulesets,
set

ProxyHTMLLogVerbose On
LogLevel Info   (or LogLevel Debug)

Now run your testcases through your rulesets, and examine the apache
error log for details of exactly how it was processed.

Do not leave ProxyHTMLLogVerbose On for normal use. Although the
effect is marginal, it is an overhead.

Extended URL Mapping

The previous section sets up remapping of HTML URLs, but leaves any
URL encountered in a Stylesheet or Script untouched. mod_proxy_html
doesn’t parse Javascript or CSS, so dealing with URLs in them requires
text-based search-and-replace. This is enabled by the directive
ProxyHTMLExtended On.

Because the extended mode is text-based, it can no longer guarantee to
match exact URLs. It’s up to you to devise matching rules that can
pick out URLs, just as if you were writing an old-fashioned Perl or
PHP regexp-based filter (though of course it’s still massively more
efficient than performing search-and-replace on an entire document
in-memory). To help with this, ProxyHTMLExtended supports both simple
text-based and regular expression search-and-replace, according to the
flags. You can also use the flags to specify rules separately for HTML
links, scripting events, and embedded scripts and stylesheets.

A second key consideration with extended URL mapping is that whereas
an HTML link contains exactly one URL, a script or stylesheet may
contain many. So instead of stopping after a successful match, the
processor will apply all applicable mapping rules. This can be stopped
with the L (last) flag.

Dealing with multimedia content

We just set up a proxy to parse and where necessary correct HTML. But
of course, the web isn’t just HTML. Surely feeding non-HTML content
through an HTML parser is at best inefficient, if not totally broken?

Yes indeed. mod_proxy_html deals with that by checking the
Content-Type header, and removing itself from the processing chain
when a document is not HTML (text/html) or XHTML
(application/xhtml+xml). This happens in the filter initialisation
phase, before any data are processed by the filter.

But that still leaves a problem. Consider compressed HTML:

Content-Type: text/html
Content-Encoding: gzip

Feeding that into an HTML parser is clearly broken!

There are two solutions to this. One is to uncompress the incoming
data with mod_deflate.
Uncompressing and compressing content radically reduces network
traffic, but increases the processor load on the proxy. It is
worthwhile if and only if bandwidth between the proxy and the
backend is at a premium: this is common on the ‘net at large,
but unlikely to be the case on a company internal network.

SetOutputFilter  INFLATE;proxy-html;DEFLATE

The alternative solution is to refuse to support
compression. Stripping any Accept-Encoding request header does the
job. So invoking mod_headers, we add a directive

RequestHeader unset Accept-Encoding

This should only apply to the Proxy, so we put it inside our containers.

A similar situation arises in the case of encrypted (https) content.
But in this case, there is no such workaround: if we could decrypt the
data to process it then so could any other man-in-the-middle, and the
security would be worthless. This can only be circumvented by
installing mod_ssl and a certificate on the proxy, so that the actual
secure session is between the browser and the proxy, not the origin
server.

The Complete Configuration

We are now in a position to write a complete configuration for our
reverse proxy. Here is a bare minimum, that ignores extended
urlmapping:

LoadModule proxy_module      modules/mod_proxy.so
LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule headers_module    modules/mod_headers.so
LoadFile   /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so
 
ProxyRequests off
ProxyPass /app1/ http://internal1.example.com/
ProxyPass /app2/ http://internal2.example.com/
ProxyHTMLURLMap http://internal1.example.com /app1
ProxyHTMLURLMap http://internal2.example.com /app2
 
<Location /app1/>
ProxyPassReverse /
SetOutputFilter  proxy-html
ProxyHTMLURLMap  /      /app1/
ProxyHTMLURLMap  /app1  /app1
RequestHeader    unset  Accept-Encoding
</Location>
 
<Location /app2/>
ProxyPassReverse /
SetOutputFilter proxy-html
ProxyHTMLURLMap /       /app2/
ProxyHTMLURLMap /app2   /app2
RequestHeader   unset   Accept-Encoding
</Location>

Of course, there’s more than one way to do it. Our configuration would
actually have been simpler if we’d used Virtual Hosts for each
application server. But that takes you beyond the realm of Apache
configuration and into DNS. If you don’t fully understand that (or if
you think “why can’t I see my domain” is a webserver question), then
please don’t try using virtual hosts for this.

Further topics

Caching

We haven’t dealt with caching in this article. In a company-intranet
situation, the connection from the proxy to the application servers is
the local LAN, which is probably fast and has ample capacity. In such
cases, caching at the proxy will have little effect, and can probably
be omitted.

If we want to cache pages, we can of course do so with mod_cache But
that is beyond the scope of this article.

Content Transformation

Another powerful use for a proxy is to transform the content
on-the-fly according to the user’s preferences. This author’s flagship
mod_accessibility product (from which mod_proxy_html is a spinoff)
serves to transform HTML and XHTML on-demand to enhance usability and
accessibility.

Filtering and Security

A reverse proxy is not the natural place for a “family filter”, but is
ideal for defining access controls and imposing security restrictions.
We could, for example, configure the proxy to recognise a custom
header from an origin server and block content based on it. This
delegates control to the application servers.

Questions and Answers

(Q) Where can I get the software?
(A) Most of it from the obvious place,
http://httpd.apache.org/ mod_proxy_html is available from
http://apache.webthing.com/ libxml2 is available from
http://xmlsoft.org/. Windows users should read libxml2.dll
for libxml2.so, and can obtain it together with the
prerequisites iconv.dll and zlib.dll from Igor Zlatkovic’s site.
(Q) Can I get a binaries of software ?
(A) If there’s no link at the websites above, ask the provider
of your operating system or distribution. The author can
compile it on different platforms but does not provide a free
compilation service.
(Q) What is httpd.conf? My apache has different configuration files.
(A) Some distribution packagers mess about with the Apache
configuration. If this applies to you, the details should be
documented by your distributor, and have nothing to do with
Apache itself! Substitute your distributions choice of
configuration file for httpd.conf in the above discussion, or
create your own proxy.conf file and Include it.
(Q) You mentioned apxs and apachectl. Where do I find them?
(A) They’re part of a standard Apache installation (except on
Windows). If you don’t have them or can’t find them, that’s a
problem with your installation. The easiest solution is
probably to download a complete Apache from
httpd.apache.org.
(Q) Does mod_proxy_html deal with Javascript links?
(A) From mod_proxy_html 2.0, yes!
(Q) The proxy appears to change my HTML?

(A) It doesn’t really, but it may appear to. Here are the possible causes:

  1. Changing the FPI (the line) may affect some browsers. FIX: set the doctype explicitly if this bothers you.

  2. mod_proxy_html has the side-effect of transforming content to
    utf-8 (Unicode) encoding. This should not be a problem: utf-8
    is well-supported by browsers, and offers comprehensive
    support for internationalisation. If it appears to cause a
    problem, that’s almost certainly a bug in the application
    server, or possibly a misconfigured browser. FIX: filter
    through mod_charset_lite to your chosen charset.

  3. mod_proxy_html will perform some minor normalisations. If
    your HTML includes elements that are closed implicitly, it
    will explicitly close them. In other words:

    <body>
    <p>Hello, World!
    </body>

    will be transformed to

    <body>
    <p>Hello, World!</p>
    <body>

    If this affects the rendition in your browser, it almost
    certainly means you are using malformed HTML and relying on
    error-correction in a browser. FIX: validate your HTML! The
    online Page Valet service will both validate and show your
    markup normalised by the DTD, while a companion tool
    AccessValet will show markup normalised by the same parser
    used in the proxy, and highlight other problems. Both are
    available at http://valet.webthing.com/

(Q) I need a customised solution.
(A) The author is available for development and consultancy.

«
»

Skip to Comments

Add Your Opinion

Reader Comments

  1. ice_zombie ~

    THANK YOU!!! This was a great help. Needed a bit customization but works like a charm. =)

  2. Mauricio Matias ~

    It is hard to read it :)

  3. pradyumna ~

    Hi,

    I want an architecture based on the following scenario.

    1. I will have a server that will play a role of a web server/reverse proxy.
    2. This proxy server will be placed in my DMZ
    3. My applications will be running on the MZ.
    4. When some external user will access they have to use _https_ where as internal users should access by _http_
    5. The reverse proxy must be able to provide “local” websites (PHP/HTML etc )

    Please can someone help me in designing the architecture and how i will do it.

    /Pradyumna

  4. Chris ~

    Could you possibly post the “simplified” example that you mentioned if vhosts were used?

    We use virtual hosts extensively, and I’m curious as to how this would be simplified…

    Thanks, this is a great read!

  5. Des ~

    If backend server returns 302 redirect response can this be “trapped” by proxy and redirected internally without sending response back to client?

  6. dumbo ~

    Hi,
    I have slightly different scenario then the reverse proxy examples given.i have to redirect the incoming requests (http:/)to secured site (https://) and get the data back.
    Also there is user authentication.
    how can that be acompished in reverse proxy.

  7. Cd-MaN ~

    Hello. I have an issue with mod_proxy I blogged about (http://hype-free.blogspot.com/2006/09/apache-and-modproxy.html). Basically the scenario is the following:
    -one internal server with CentOS running an old(er) version of Apache (some 2.0.x) with mod proxy
    -the target server out on the web with Windows and a new Apache 2.2
    -connections from the proxy to the target server are made through SSL for security reasons
    -if keep-alive connections are enabled at every 2-3 requests the proxy says something similar to: “I received an invalid response from an upstream server”. With keep-alive connections disabled everything works fine, but the performance penalty is big, because the encryption has to be renegotiated for every query. Any ideas what could be the problem? I tried to google it by came up with no useful info.

  8. LonerVamp ~

    This is just FYI. The article says:

    “At the time of writing, the reason most commonly cited for not upgrading is difficulties running PHP on Apache 2. I cannot speak from personal experience, but several well-informed sources tell me the difficulty lies with non-thread-safe code in PHP, and that it works well with Apache 2 if it is built with the non-threaded Prefork MPM.”

    This was true in Ocftober 2006 when this was updated, but is no longer the issue with the latest Apache and PHP installs.


It's very simple - you read the protocol and write the code. -Bill Joy

HTML | DCMI | GRDDL | XOXO | XDMP | XFN | DOM | XML | XHTML 1.1 Strict | CSS 2.1 | W3C | TLDP | WAI | DISA | ICSI | GIAC | SANS RR | GHOST | DEFCON | NIST | DHS CYBER | NIST | Phrack | GDB | IEEE | GIT | GNU LIBC

↑ TOPExcept where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License, just credit with a link.
This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation produced by The ASF is licensed. "Apache" is a trademark of The ASF. HTTPD based on NCSA HTTPd

Site Map | Contact Webmaster | Email AskApache | Glossary | License and Disclaimer | Terms of Service