I have a project over at trafficspoke.com called IndustryStats. The perl app was basically an experiment into SEO and large free data sets such as those offered up by the US Census. The app itself isn’t terribly interesting, but it did have a problem that I didn’t have an elegant way to solve. The app makes several external requests per page. It loads a little superpages module, an RSS feed from Yahoo! and another RSS feed from Jobster. There are thousands of pages on the site so I needed a way to implement caching so I wouldn’t nail the external providers when Google crawled the site.

Last week, I wrote my own caching mechanism. It was really simple but did the trick. Basically, I just had a perl subroutine named ‘getYahoo’ that went out to Yahoo! and pulled the RSS data, stuffed it into a file handle pointing to an appropriately named file in a cache folder on the webserver, then printed it out to the web page. The filenames were consistent between requests, but unique to the specific page. For example, the Yahoo feed for Maricopa County, Arizona was named “yahoo_news_Maricopa_Arizona.txt”.

Subsequent calls to the getYahoo subroutine would check to see if a file with the appropriate name already existed in the cache folder. If the file existed, it would simply open the file, read it into an array, then subsequently print all the data in the file out onto the web page. If the file didn’t exist yet, it would do the routine defined above to create the file, write the stuff out to it, etc.

The whole mechanism was very simple and all fit in an IF statement of about 10 lines or so:

$yahoofilename = " yahoo_news_ " . $fcty . "_" . $friendlystate . ".txt";

if (-e "./cache/$yahoofilename") {
open (YAHOORSSFILE, "./cache/$yahoofilename");
@yahoorss_array = <YAHOORSSFILE>;

foreach $yahooline(@yahoorss_array) {
$yahoorss .= $yahooline;
}

close YAHOORSSFILE;

} else {
# Do all that other stuff I talked about before if the file doesn't exist yet #
}

As you can imagine, such a simple caching mechanism has its drawbacks. It doesn’t update the cache if the content has been updated. It only caches it once and all subsequent reads, forever, pull from that file (I planned to get around that via cron job to delete files in the cache dir periodically, and a host of other issues. I knew there had to be a better way.

I was browsing Safari tech books online and found a short book/paper on making Tag Clouds in perl. I was reading the article and ran across a mention of HTTP::Cache::Transparent. Its a CPAN module that handles caching of HTTP get requests (like via LWP)…wait for it…transparently! Thats right, you just need to ‘use’ the module at the top of your script, specify a cache folder and, presto! you’ve got caching. It uses basic If-modified-since and last-modified headers in HTTP requests to figure out if the script should use the cache or pull from the external source. You don’t get all the speed of a purely-local cache (since it’ll still call out to the external server to see if the cache is up to date), but you do get the benefit of ensuring that the content you use is always the latest.

Its got some basic params like how long to use the cached file for if its still the most up to date, how long to use the file for without going to the external provider at all (in case you want to say always use the cached copy if the cached copy is less than a day old), etc.

I implemented it on IndustryStats tonight and commented out my old caching code. It seems to be doing the trick so far. See what you can learn by reading docs on trendy web 2.0-ish tag clouds? Useful stuff that can be used on my very, very web 1.0 site!

Here is an example of using HTTP::Cache::Transparent in a perl script. See, told you its easy! Once you include these two lines at the top of your script, all subsequent ‘get’ requests in the file will be run through the HTTP::Cache::Transparent logic to either pull directly from the cache or pull from the external server and save into the cache for future use.


use HTTP::Cache::Transparent;

HTTP::Cache::Transparent::init( {
BasePath => '../transcache',
NoUpdate => 12*60*60,
MaxAge => 7*24} );