Though it tends to get treated poorly, HTTP isn’t a dumb
file-transfer protocol. It allows you, to a certain degree, to specify
an intention with your requests (GET/POST, with PUT and DELETE available, if you wish), it has a somewhat structured method for
parameter passing, supports content negotiation and offers authentication. The focus of this article, however, is
caching.
Until recently, my experiences with caching had been mostly in the
form of fighting against it. Web developers in particular have likely experienced
browsers (one of them in particular) inappropriately caching pages and
scripts — causing users to see old content, or load broken scripts
even after being fixed. Typically, one needs to add some magical
headers found through a random Web search and behold, the
browser stops caching and all is well.
The reason browsers behave like this is because it’s a lot
of work to constantly fetch all these files. Not doing it saves
bandwidth, and makes pages load faster. When it works, it’s
a positive thing. The orginal vision for HTTP was for a more
static world than today’s Web, where clients and proxies could happily
cache content, dramatically reducing the load on the infrastructure.
It didn’t work out that way, since somewhere near the turn
of the century, the use of Perl and PHP to generate pages exploded, and
many Web developers started to insist on making
everything dynamic. Some of the fads that were born then
(visitor counters!) have since died out, but the idea of Web pages I
visit being tailored just for me (showing whether I’m logged in, what
I’m allowed to do), and being derived from databases rather than
static files, has become ‘the way things are done.’
This dynamic approach is relatively easy to set up, and for systems
without heavy use it works well. Only when a
site is being accessed heavily does the wastefulness of such an
approach becomes apparent: You are building up a page from scratch — probably involving a bunch of calls to a database, some file accesses,
potentially some expensive computation — for every single hit, even
though the page returned is likely to be very similar to the one you
generated 37 milliseconds ago.
One solution is to use a system to cache chunks of data
and fragments of pages on the server side. For many situations,
though, HTTP itself provides a well-specified and elegant model for
caching content.
There are two forms of caching in HTTP: The expiration model and the
validation model. In the first, the server includes a header in its
response that tells the client how long the content stays ‘fresh.’ The
client is allowed to use this content as long as it’s fresh without
checking back with the server. Using this with a long expiration
period is rarely what you want, since you tend to lose
control over what the user is seeing, but it does have the advantage of reptitious access without causing any server load.
The second model, validation, is more interesting. Here, the server
sends a piece of information identifying the version of the current
content, either in the form of a last-modified date, or an ‘entity
tag’ — an opaque identifying string, such as an MD5 hash of the
content. On subsequent requests, the client may send a header
indicating the version it currently has cached, and the server has the
choice of sending back a response with status code 304 (not modified)
if that version is still up-to-date. If it isn’t, you can proceed as
normal and generate the proper content. Web servers typically do this
automatically when serving static files, using the file system’s
modification times.
To use expiration caching, you simply send along a header like this:
Note the convoluted date format. Your Web library hopefully has
functions for reading and writing timestamps in that format.
Validation caching requires you to add either a Last-Modified or an
ETag to their responses, or both:
(The Cache-Control headers tells the browser that it’s not okay to
reuse its cached version of the content without asking the server
whether it is up-to-date.)
Before responding, you determine the resource’s current last-modified
date or entity tag, and check for If-Modified-Since or If-None-Match
headers. When an If-Modified-Since header with a date no older than
the resource’s modification date is given, you immediately respond
with a 304 response, and don’t have to generate your page. The same
happens when an If-None-Match header is given that includes the
current entity tag — though in this case, you have to make sure to
resend the ETag header along with the 304 response.
(For the fine print, consult the the HTTP 1.1
specification — which is relatively concise and readable, and a better source of information than a lot of the stuff that gets written about the
subject online.)
The tricky aspect of this is reliably determining when a
page should be considered modified. How this works depends entirely on
the application. For a blog it can be relatively simple; for a
complicated site full of dynamic widgets it might be impossible. If
you take caching into account while designing your site, it’s wise to
avoid obvious things, like showing the current time on the page. One useful trick is to have JavaScript
take care of some dynamic aspects, such as showing the name of a
logged-in user, and hiding controls that he/she doesn’t have
access to (though this has some accessibility ramifications).
Getting people’s browsers to cache stuff, while it can help, is
hardly a lifesaver. The beauty of the HTTP protocol is that if you do
caching properly, it’s easy to add your own proxy server in front
of your server, and have it cache requests for multiple clients. The
proxy will behave in almost the same way as a client, understanding
cache-related headers and asking for modified content at the
appropriate time and is relatively easy to ‘drop in’ when load
becomes an issue.
One likely way to screw up when using a proxy is being too liberal
with your caching. If you do render the name of the logged in user in
your HTML, you don’t want your proxy to serve the page with Bob’s
name in it to Alice. And if you render a page showing a user’s private
info, such as credit card number (well, you should probably never do
that, certainly not over non-secure HTTP), you clearly don’t want
that page to be cached and sent to someone else. There are a few more
directives that can be given to the Cache-Control header for cases
like this, and will be respected by any half-decent proxy program.
‘private’ indicates that the response is meant only for the current
recipient, and only that recipient should cache it. ‘no-store’
can be used to tell any and all caches to never store this response on
disk. It’s a good idea to add that whenever you’re returning
sensitive information that you don’t want to accidentally end up on
someone’s hard disk.
Finally, for services that provide some kind of remote procedure call
interface– XML-RPC, JSON, whatever, as long as it is HTTP —
determining last-modified dates or entity tags is often quite
simple, since such requests tend to be more focused than Web page
requests. You could even use a proxied HTTP service internally in your
server to access data that benefits from local caching, as an
alternative to memcached.
This article originally appeared on WebReference.com.