Home > General > Threaded/Parallel Web Crawler (or Web Server Killing Software)

Threaded/Parallel Web Crawler (or Web Server Killing Software)

January 26th, 2010

Short Version


Parallel URL Fetcher – If you want to put load on a webserver by crawling it, this is what you’re looking for. No java, no python, just a nice small, fast C program.

Long Version


It’s time to re-evaluate our HTTP caching software. At present we use Apache mod_cache (disk cache) and we’ve run into some problems.

Apache mod_cache + ZFS + millions of URLs and hundreds of gigs of cache files = bad

I’m not sure which of these guys is the culprit in this one. But I do know that when the ZFS dataset holding Apache’s cache gets to a certain size, disk I/O requests go through the roof. By clearing the cache (and freeing up that I/O), we see a good 5%-10% (extremely significant) jump in traffic.

At any rate, this prompted us to start looking into alternatives to Apache. The obvious first choice is Squid in accelerator mode. So I got Squid all set up in our offline datacenter, fixed the little things, and was ready the beat the crap out of it with web requests.

I can easily request all of our 500k+ “static” URLs, but those pesky URLs with arguments aren’t quite that easy. I needed a crawler. Something like wget –mirror but much, much, much faster.

After a lot of searching, I found a few python apps that failed to compile on Solaris, had deprecated/old dependencies, required specific python, etc. Python is starting to feel more and more like Java. Either the developers are horrible or the language interpreter is too picky to work properly (think…. JRE 1.2.5 build 1482???? no no no, you need build 1761!!!).

Speaking of Java, I also found a Java app (JCrawler) that looked perfect for what I needed. It certainly claimed to be “perfect.” It actually worked better than the Python apps that failed to build/run properly, but it didn’t actually work. It just kept spawning threads until it ran out of memory.

I was almost to the point where I thought I would have to write one myself, until I clicked on a link and a bright light from the heavens shone down on my monitor and a choir started singing in the background.

I had found the Parallel URL Fetcher. It was exactly what I needed. It was like wget, but ran parallel requests. It didn’t compile on Solaris either, but adding timeradd() and timersub() macros fixed that real quick.

I don’t think it supports Keep-Alive requests either, which would have been nice, but either way it rocked through some URLs. After letting it run for a few hours, I had my Squid server maxed out at 100Gigs of cache and ready for some I/O testing.

General

  1. Param
    February 19th, 2010 at 09:36 | #1

    For doing something similar, I used http://freshmeat.net/projects/aget/

  1. No trackbacks yet.