Archive for January, 2010

Threaded/Parallel Web Crawler (or Web Server Killing Software)

January 26th, 2010

Short Version

Parallel URL Fetcher – If you want to put load on a webserver by crawling it, this is what you’re looking for. No java, no python, just a nice small, fast C program.

Long Version

It’s time to re-evaluate our HTTP caching software. At present we use Apache mod_cache (disk cache) and we’ve run into some problems.

Apache mod_cache + ZFS + millions of URLs and hundreds of gigs of cache files = bad

I’m not sure which of these guys is the culprit in this one. But I do know that when the ZFS dataset holding Apache’s cache gets to a certain size, disk I/O requests go through the roof. By clearing the cache (and freeing up that I/O), we see a good 5%-10% (extremely significant) jump in traffic.

At any rate, this prompted us to start looking into alternatives to Apache. The obvious first choice is Squid in accelerator mode. So I got Squid all set up in our offline datacenter, fixed the little things, and was ready the beat the crap out of it with web requests.

I can easily request all of our 500k+ “static” URLs, but those pesky URLs with arguments aren’t quite that easy. I needed a crawler. Something like wget –mirror but much, much, much faster.

After a lot of searching, I found a few python apps that failed to compile on Solaris, had deprecated/old dependencies, required specific python, etc. Python is starting to feel more and more like Java. Either the developers are horrible or the language interpreter is too picky to work properly (think…. JRE 1.2.5 build 1482???? no no no, you need build 1761!!!).

Speaking of Java, I also found a Java app (JCrawler) that looked perfect for what I needed. It certainly claimed to be “perfect.” It actually worked better than the Python apps that failed to build/run properly, but it didn’t actually work. It just kept spawning threads until it ran out of memory.

I was almost to the point where I thought I would have to write one myself, until I clicked on a link and a bright light from the heavens shone down on my monitor and a choir started singing in the background.

I had found the Parallel URL Fetcher. It was exactly what I needed. It was like wget, but ran parallel requests. It didn’t compile on Solaris either, but adding timeradd() and timersub() macros fixed that real quick.

I don’t think it supports Keep-Alive requests either, which would have been nice, but either way it rocked through some URLs. After letting it run for a few hours, I had my Squid server maxed out at 100Gigs of cache and ready for some I/O testing.


PCM Audio | Part 3: Basic Audio Effects – Volume Control

January 12th, 2010

So now we know what data is stored in a PCM stream, let’s look at some real waveform examples. The easiest is a simple sine wave:

sine wave

Now if we “amplify” that wave by 5, we’d get a much louder sound, represented by a wave that looked like this:

sine wave times 10

So if you want to increase the volume of your PCM stream, just multiply every PCM value by some number. If we had 2048 bytes of audio (remember… that’s 1024 samples since each sample is two bytes), we could amplify the stream with this type of code:

int16_t pcm[1024] = read in some pcm data;
for (ctr = 0; ctr < 1024; ctr++) {
    pcm[ctr] *= 2;

Volume control is almost that simple. There's two catches.


Clipping occurs when your resulting value increases above the maximum value for a sample. So since we're dealing with signed 16 bit integers our maximum positive sample is 32767. If we have a PCM sample value of 5000 and we multiplied it by 10, the resulting value is -15536, not the expected 50000. When clipping occurs, you end up with noise in the audio. You should always check to see if the result of your multiplication would cause clipping, and if so, set the value to 32767 (or -32768) instead.

So our code above becomes:

int16_t pcm[1024] = read in some pcm data;
int32_t pcmval;
for (ctr = 0; ctr < 1024; ctr++) {
    pcmval = pcm[ctr] * 2;
    if (pcmval < 32767 && pcmval > -32768) {
        pcm[ctr] = pcmval
    } else if (pcmval > 32767) {
        pcm[ctr] = 32767;
    } else if (pcmval < -32768) {
        pcm[ctr] = -32768;

Volume Is Logarithmic

The other catch is that volume as perceived by humans (measured in decibels) is logarithmic, not linear. Your first instinct would be to think "Well if I wanted to double the volume, I should just multiply the samples by 2." Unfortunately, it's not quite that easy.

Multiplying a value by 1 will obviously give you no amplification. So to decrease volume, you would multiply by a value less than 1 and greater than 0. To increase volume, multiply by a number greater than one. Unfortunately, I didn't pay enough attention to logarithms in school, so I don't have a clever answer as to how to implement a proper volume control, but I've found that this function works pretty well:

int some_level;
float multiplier = tan(some_level/100.0);

If some_level is set to a value between 0 and 148 or so, this will give you a rather linear sounding multiplier. 79 is almost a multiplier of 1 (no amplification). It is far -- really far -- from perfect, but it worked well enough for my needs of implementing a volume slider. Graphing that function from 0 to 148 gives you this:

volume multiplier

So to set an appropriate level, now we have a volume slider at 39 (roughly half volume):

int16_t pcm[1024] = read in some pcm data;
int32_t pcmval;
uint8_t level = 39; // half as loud
// uint8_t level = 118 // twice as loud (79 * 1.5)
float multiplier = tan(level/100.0);
for (ctr = 0; ctr < 1024; ctr++) {
    pcmval = pcm[ctr] * multiplier;
    if (pcmval < 32767 && pcmval > -32768) {
        pcm[ctr] = pcmval
    } else if (pcmval > 32767) {
        pcm[ctr] = 32767;
    } else if (pcmval < -32768) {
        pcm[ctr] = -32768;

I wasn't able to find a simple logarithmic slider example, so if you have one, please post in the comments. I'd love to replace my hack.

Using some simple algorithms and that function above, you could easily implement a fade-in/out effect on PCM data by stepping through all 148 possible values over a period of time. And don't worry, we'll get to "time" later in the series.

That's pretty much all there is to know about volume, in the next part of the series, we're going to discuss mixing two streams together to create one stream.

General , ,

PCM Audio | Part 2: What does a PCM stream look like?

January 9th, 2010

In Part 1, we looked at how a PCM stream is described. Once you know all of the parameters for your PCM stream, we can examine the data and put it in memory as useful data.

So, let’s assume we have a file that contains signed 16-bit little endian mono PCM. That means that data in the file is just a collection of 16 bit integers. Each integer represents one sample. So the first 9 samples in the file could be:

|  500 |  300 | -100 | -20  | -300 |  900 | -200 |  -50 |  250 |      

Each of those integers is stored in the file as 2 bytes (16-bit), so the 9 samples above take up 18 bytes of space. The value of each sample, obviously, can range from -32768 to 32767. If you take those samples and plot them on a graph, you’ll end up with a visualization of the waveform for the audio that you see in your music player.

If we wanted to read that into an array in C, we would do something like this (obviously this is pseudo-code):

FILE *pcmfile
int16_t *pcmdata;
pcmfile = fopen(your pcm data file);
pcmdata = malloc(size of the file);
fread(pcmdata, sizeof(int16_t), size of file / sizeof(int16_t), pcmfile);

Of course, if you’re dealing with large files, you probably shouldn’t read the whole thing into memory. You should buffer the data and read it in chunks at a time.

If you take that data and send it to your sound card, you’ll hear the sample being played. However, the sound card will require you to know the sample rate. If you have an 8kHz stream and tell the sound card to play it at 16kHz, it’s like playing a 33.3 RPM record at 45 RPM. For the younger crowd out there, that means it will be too fast and it’ll be high pitched… think Alvin and the Chipmunks here.

Since this is a description of the waveform, a stream of all zeros would be silence (a flat line if you graphed it).

I haven’t really explained what those samples actually MEAN though… just what they are. It will be incredibly obvious what those samples mean starting in the next post, when we get to the fun stuff: basic audio effects processing (don’t get scared… it’s actually really easy).

General , ,

PCM Audio | Part 1: What is PCM?

January 8th, 2010

It’s been a long time since I posted anything. Most of my free time has been spent working on my ventrilo client for linux project. Of course, that project adds tons of things to discuss, such as how PCM audio works. I’m going to make this a multi-part series, because there is so much information to discuss.

When I first started working on that project, I knew nothing about how audio worked. I knew a little bit about encoders and decoders, but not really the inner workings. What are they encoding/decoding? It turns out, that the answer is PCM (pulse control modulation) audio. After messing with PCM for a few months, there are a lot of things that are painfully obvious now that were confusing. This guide is meant to be an introduction to at least give you the working knowledge you’ll need to ask proper questions and perform simple tasks. So let’s get started…

If you’ve ever used a computer MP3 player, you’ve probably seen those options to display the waveform of the audio or the little bars that pop up and down showing you treble and bass levels. What those are measuring is the PCM audio as it plays it. So what does all that crap mean?

Let’s start with the basics. There’s five terms that are important to know for PCM:

Sample Rate

Real actual audio (like someone talking to you in person) is transmitted as a wave. PCM is a digital representation of that audio wave at a specified sample rate. The sample rate is measured in Hz (cycles per second) and more often in kilohertz. So when you hear someone talk about about 128kHz vs. 160kHz audio, what they’re talking about is the sample rate. If you’ve ever done integrals in calculus, it’s a lot like that. The higher the sample rate, the better your quality (at the cost of size). There is no guessing here. You need to know what the sample rate is.


Whether the data is signed or unsigned. It is almost always signed. Treating a signed PCM stream as unsigned will hurt your ears… painfully… (I speak from experience here).

Sample Size

This determines how many bits make up one sample. 16-bit seems to be the most common.

Byte Ordering

Byte ordering refers to little-endian vs. big-endian data. If you don’t know what endian-ness means, you can probably assume little endian. If you have the option to choose endian for your data, you should always choose little-endian.

Number of channels

I’m mostly going to cover mono (1 channel), but multichannel PCM is usually handled by interleaving the PCM samples. Don’t worry about this for now. Once you understand mono, stereo is easy.

Add those five things together and you’ll come up with a description of a PCM stream. For example: signed 16-bit little-endian mono @ 44.1kHz. In order to actually play audio, you’ll need to know those 5 things.

Various sound devices support various types of streams, but there’s usually a set list of sign, sample size, and endian-ness options. Different APIs use different constants to specify, but usually you’ll see them as something like S16LE (signed 16-bit little-endian) or S32BE (signed 32-bit big-endian) and so on.

In my next post, I’ll go over how those are represented in a PCM stream.

General , ,