Archive

Posts Tagged ‘programming’

Sync A Large Directory Structure to S3

October 23rd, 2012

There’s a handful of commands out there that deal with command line operations for s3. The most popular (I think) is s3tool’s s3cmd. However, we have a filesystem that we would like to keep in sync with S3 while we are working on migrating. s3cmd has a sync command that works really well for filesystems with a small to medium number of files (not total file size… total file count). We have a filesystem that contains many millions of files which can be problematic for programs like s3cmd (even rsync has issues with this many files). The problem (or feature) is that they tend to calculate the changes for everything recursively all at once, then they start performing operations.

If you do not need this feature, it takes a lot less memory to calculate all the changes on a directory by directory basis. Of course, if you’re syncing a single directory with millions of files, you have bigger problems anyway and this won’t help. Luckily, we tend to split up the files into categorized directories.

So, I wrote this very simple little PHP script that keeps S3 in sync with a local directory structure. It shouldn’t be too hard to rewrite this in just about any language. It’s not complicated at all.

IMPORTANT NOTES:

  • This WILL dereference symlinks. So make sure you do not have recursive symlinks in your directory structure. For example: “ln -s . recurseme” would be bad
  • The local filesystem is always authoritative. If it doesn’t exist locally, it will get deleted from S3
  • It does not compare MD5 sums (even though you can see that I thought about it in the code)
  • It does not update the S3 side timestamp with the local timestamp and will only sync if the file size is different or the local timestamp is later than the S3 timestamp
#!/usr/bin/php
<?
require_once('AWSSDKforPHP/sdk.class.php');

$s3 = new AmazonS3();
$basepath = '/path/to/sync';
$bucket = 'your-bucket-name';

function getDirectoryList($localdir) {
    global $directoryList;

    /*
    // this is useful for testing
    if (substr_count($localdir, '/') > 2) {
        return;
    }
    */
    $d = opendir($localdir);
    while ($ent = readdir($d)) {
        if ($ent == '.' || $ent == '..') {
            continue;
        }
        if (is_dir($localdir . '/' . $ent)) {
            $directoryList[] = $localdir . '/' . $ent;
            getDirectoryList($localdir . '/' . $ent);
        }
    }
    closedir($d);
}

function syncDirectory($basepath, $localdir) {
    global $s3;

    $remotedir = preg_replace('%^' . $basepath . '/?%', '', $localdir);
    echo "getting s3 file list for $remotedir\n";
    $s3filelist = getRemoteDirectory($remotedir);
    echo "getting local file list for $localdir\n";
    $localfilelist = getLocalDirectory($basepath, $localdir);
    echo "calculating differences\n";
    foreach ($localfilelist as $key => $linfo) {
        if (! array_key_exists($key, $s3filelist)) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
        $rinfo = $s3filelist[$key];
        if ($linfo['lastmodified'] > $rinfo['lastmodified']) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
        if ($linfo['size'] != $rinfo['size']) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
    }
    foreach ($s3filelist as $key => $rinfo) {
        if (! array_key_exists($key, $localfilelist)) {
            deleteFile($key);
            continue;
        }
    }
}

function getRemoteDirectory($remotedir) {
    global $s3, $bucket;

    $s3filelist = array();
    do {
        $args['delimiter'] = '/';
        if (strlen($remotedir)) {
            $args['prefix'] = $remotedir . '/';
        }
        if (isset($lastkey)) {
            $args['marker'] = $lastkey;
        }
        $response = $s3->list_objects($bucket, $args);
        if (! $response->isOK()) {
            echo "error: failed to get S3 object list for static $remotedir\n";
            return false;
        }
        foreach ($response->body->Contents as $s3object) {
            $s3filelist[(string)$s3object->Key] = array(
                    'md5' => preg_replace('/^\"(.*)\"$/', '$1',
                        (string)$s3object->ETag),
                    'size' => (string)$s3object->Size,
                    'lastmodified' => strtotime((string)$s3object->LastModified),
                    );
            $lastkey = (string)$s3object->Key;
        }
        $isTruncated = (string)$response->body->IsTruncated;
        unset($response);
    } while ($isTruncated == 'true');
    return $s3filelist;
}

function getLocalDirectory($basepath, $localdir) {
    $d = opendir($localdir);
    if (! $d) {
        return false;
    }
    $localfilelist = array();
    while ($ent = readdir($d)) {
        if ($ent == '.' || $ent == '..') {
            continue;
        }
        if (is_dir($localdir . '/' . $ent)) {
            continue;
        }
        $localfile = $localdir . '/' . $ent;
        $key = preg_replace('%^' . $basepath . '/?%', '', $localfile);
        $localfilelist[$key] = array(
                'md5' => $GLOBAL['checkmd5'] == true ? md5_file($localfile) : null,
                'size' => filesize($localfile),
                'lastmodified' => filemtime($localfile),
                );
    }
    closedir($d);
    return $localfilelist;
}

function syncFile($localfile, $remotefile) {
    global $s3, $bucket;

    echo "     sync  : $localfile -> s3://$bucket/$remotefile\n";
    try {
        $response = $s3->create_object($bucket, $remotefile,
                array('fileUpload' => $localfile));
        if (! $response->isOK()) {
            echo "error: failed to sync $localfile\n";
            echo $response->body->Code . ": " . $response->body->Message . "\n";
        }
    } catch (Exception $e) {
        echo "error: failed to sync $localfile\n";
        echo $e->getMessage . "\n";
    }
}

function deleteFile($remotefile) {
    global $s3, $bucket;

    echo "     delete: s3://$bucket/$remotefile\n";
    try {
        $response = $s3->delete_object($bucket, $key);
        if (! $response->isOK()) {
            echo "error: failed to delete s3://$bucket/$key:\n";
            echo $response->body->Code . ": " . $response->body->Message . "\n";
        }
    } catch (Exception $e) {
        echo "error: failed to sync $localfile\n";
        echo $e->getMessage . "\n";
    }
}

$directoryList = array();
getDirectoryList($basepath);
foreach ($directoryList as $localdir) {
    syncDirectory($basepath, $localdir);
}

?>

General , , ,

Patch for the VastHTML WordPress Forum Server

March 3rd, 2010

So, I’ve made a number of fixes to the VastHTML WordPress forum server plugin. It has some pretty big bugs, and I don’t know if the project is being maintained anymore or not. At any rate, the fixes I’ve made should have been considered critical and should have been fixed long ago by whoever is maintaining it, but I digress…

I’m not going to support people trying to apply this patch. If you don’t know what a diff is and you don’t know what the patch command does, you’re probably out of luck. If you want me to fix all of the problems in this code and release it, pay me a bunch of money…

Also, the security problems in their code makes babies cry… but that’s for another day.

Lastly, to make the search actually work, you need to connect to your wordpress mysql database and issue this SQL statement:

alter table wp_forum_posts add fulltext key `text` (`text`);

Here's the patch: vasthtml-forum-server.diff

Here's what it fixes (in no particular order):

  • RSS feeds now contain the username of the poster instead of "feeds@r.us"
  • All &amp; characters in the links have been properly changed to & as they should be
  • Page 2+ of your forums will work
  • Page 2+ of posts will work
  • The number of replies shown in the topic list is properly set to number of posts - 1
  • The title delimeter is changed from » to "|" (don't remember why i did this, but there ya go)
  • The search form/box uses HTTP GET instead of POST so your back button works without complaining about having to resubmit your request
  • You can press enter in the search box to submit
  • A $ followed by a number doesn't get filtered out
  • Apostrophes in posts/titles get their slashes properly stripped

I may have fixed other things in this patch and forgot about it. This works for me... your mileage may vary.

General, PHP , ,

PCM Audio | Part 3: Basic Audio Effects – Volume Control

January 12th, 2010

So now we know what data is stored in a PCM stream, let’s look at some real waveform examples. The easiest is a simple sine wave:

sine wave

Now if we “amplify” that wave by 5, we’d get a much louder sound, represented by a wave that looked like this:

sine wave times 10

So if you want to increase the volume of your PCM stream, just multiply every PCM value by some number. If we had 2048 bytes of audio (remember… that’s 1024 samples since each sample is two bytes), we could amplify the stream with this type of code:

int16_t pcm[1024] = read in some pcm data;
for (ctr = 0; ctr < 1024; ctr++) {
    pcm[ctr] *= 2;
}

Volume control is almost that simple. There's two catches.

Clipping

Clipping occurs when your resulting value increases above the maximum value for a sample. So since we're dealing with signed 16 bit integers our maximum positive sample is 32767. If we have a PCM sample value of 5000 and we multiplied it by 10, the resulting value is -15536, not the expected 50000. When clipping occurs, you end up with noise in the audio. You should always check to see if the result of your multiplication would cause clipping, and if so, set the value to 32767 (or -32768) instead.

So our code above becomes:

int16_t pcm[1024] = read in some pcm data;
int32_t pcmval;
for (ctr = 0; ctr < 1024; ctr++) {
    pcmval = pcm[ctr] * 2;
    if (pcmval < 32767 && pcmval > -32768) {
        pcm[ctr] = pcmval
    } else if (pcmval > 32767) {
        pcm[ctr] = 32767;
    } else if (pcmval < -32768) {
        pcm[ctr] = -32768;
    }
}

Volume Is Logarithmic

The other catch is that volume as perceived by humans (measured in decibels) is logarithmic, not linear. Your first instinct would be to think "Well if I wanted to double the volume, I should just multiply the samples by 2." Unfortunately, it's not quite that easy.

Multiplying a value by 1 will obviously give you no amplification. So to decrease volume, you would multiply by a value less than 1 and greater than 0. To increase volume, multiply by a number greater than one. Unfortunately, I didn't pay enough attention to logarithms in school, so I don't have a clever answer as to how to implement a proper volume control, but I've found that this function works pretty well:

int some_level;
float multiplier = tan(some_level/100.0);

If some_level is set to a value between 0 and 148 or so, this will give you a rather linear sounding multiplier. 79 is almost a multiplier of 1 (no amplification). It is far -- really far -- from perfect, but it worked well enough for my needs of implementing a volume slider. Graphing that function from 0 to 148 gives you this:

volume multiplier

So to set an appropriate level, now we have a volume slider at 39 (roughly half volume):

int16_t pcm[1024] = read in some pcm data;
int32_t pcmval;
uint8_t level = 39; // half as loud
// uint8_t level = 118 // twice as loud (79 * 1.5)
float multiplier = tan(level/100.0);
for (ctr = 0; ctr < 1024; ctr++) {
    pcmval = pcm[ctr] * multiplier;
    if (pcmval < 32767 && pcmval > -32768) {
        pcm[ctr] = pcmval
    } else if (pcmval > 32767) {
        pcm[ctr] = 32767;
    } else if (pcmval < -32768) {
        pcm[ctr] = -32768;
    }
}

I wasn't able to find a simple logarithmic slider example, so if you have one, please post in the comments. I'd love to replace my hack.

Using some simple algorithms and that function above, you could easily implement a fade-in/out effect on PCM data by stepping through all 148 possible values over a period of time. And don't worry, we'll get to "time" later in the series.

That's pretty much all there is to know about volume, in the next part of the series, we're going to discuss mixing two streams together to create one stream.

General , ,

PCM Audio | Part 2: What does a PCM stream look like?

January 9th, 2010

In Part 1, we looked at how a PCM stream is described. Once you know all of the parameters for your PCM stream, we can examine the data and put it in memory as useful data.

So, let’s assume we have a file that contains signed 16-bit little endian mono PCM. That means that data in the file is just a collection of 16 bit integers. Each integer represents one sample. So the first 9 samples in the file could be:

+------+------+------+------+------+------+------+------+------+
|  500 |  300 | -100 | -20  | -300 |  900 | -200 |  -50 |  250 |      
+------+------+------+------+------+------+------+------+------+

Each of those integers is stored in the file as 2 bytes (16-bit), so the 9 samples above take up 18 bytes of space. The value of each sample, obviously, can range from -32768 to 32767. If you take those samples and plot them on a graph, you’ll end up with a visualization of the waveform for the audio that you see in your music player.

If we wanted to read that into an array in C, we would do something like this (obviously this is pseudo-code):

FILE *pcmfile
int16_t *pcmdata;
pcmfile = fopen(your pcm data file);
pcmdata = malloc(size of the file);
fread(pcmdata, sizeof(int16_t), size of file / sizeof(int16_t), pcmfile);

Of course, if you’re dealing with large files, you probably shouldn’t read the whole thing into memory. You should buffer the data and read it in chunks at a time.

If you take that data and send it to your sound card, you’ll hear the sample being played. However, the sound card will require you to know the sample rate. If you have an 8kHz stream and tell the sound card to play it at 16kHz, it’s like playing a 33.3 RPM record at 45 RPM. For the younger crowd out there, that means it will be too fast and it’ll be high pitched… think Alvin and the Chipmunks here.

Since this is a description of the waveform, a stream of all zeros would be silence (a flat line if you graphed it).

I haven’t really explained what those samples actually MEAN though… just what they are. It will be incredibly obvious what those samples mean starting in the next post, when we get to the fun stuff: basic audio effects processing (don’t get scared… it’s actually really easy).

General , ,

PCM Audio | Part 1: What is PCM?

January 8th, 2010

It’s been a long time since I posted anything. Most of my free time has been spent working on my ventrilo client for linux project. Of course, that project adds tons of things to discuss, such as how PCM audio works. I’m going to make this a multi-part series, because there is so much information to discuss.

When I first started working on that project, I knew nothing about how audio worked. I knew a little bit about encoders and decoders, but not really the inner workings. What are they encoding/decoding? It turns out, that the answer is PCM (pulse control modulation) audio. After messing with PCM for a few months, there are a lot of things that are painfully obvious now that were confusing. This guide is meant to be an introduction to at least give you the working knowledge you’ll need to ask proper questions and perform simple tasks. So let’s get started…

If you’ve ever used a computer MP3 player, you’ve probably seen those options to display the waveform of the audio or the little bars that pop up and down showing you treble and bass levels. What those are measuring is the PCM audio as it plays it. So what does all that crap mean?

Let’s start with the basics. There’s five terms that are important to know for PCM:

Sample Rate

Real actual audio (like someone talking to you in person) is transmitted as a wave. PCM is a digital representation of that audio wave at a specified sample rate. The sample rate is measured in Hz (cycles per second) and more often in kilohertz. So when you hear someone talk about about 128kHz vs. 160kHz audio, what they’re talking about is the sample rate. If you’ve ever done integrals in calculus, it’s a lot like that. The higher the sample rate, the better your quality (at the cost of size). There is no guessing here. You need to know what the sample rate is.

Sign

Whether the data is signed or unsigned. It is almost always signed. Treating a signed PCM stream as unsigned will hurt your ears… painfully… (I speak from experience here).

Sample Size

This determines how many bits make up one sample. 16-bit seems to be the most common.

Byte Ordering

Byte ordering refers to little-endian vs. big-endian data. If you don’t know what endian-ness means, you can probably assume little endian. If you have the option to choose endian for your data, you should always choose little-endian.

Number of channels

I’m mostly going to cover mono (1 channel), but multichannel PCM is usually handled by interleaving the PCM samples. Don’t worry about this for now. Once you understand mono, stereo is easy.

Add those five things together and you’ll come up with a description of a PCM stream. For example: signed 16-bit little-endian mono @ 44.1kHz. In order to actually play audio, you’ll need to know those 5 things.

Various sound devices support various types of streams, but there’s usually a set list of sign, sample size, and endian-ness options. Different APIs use different constants to specify, but usually you’ll see them as something like S16LE (signed 16-bit little-endian) or S32BE (signed 32-bit big-endian) and so on.

In my next post, I’ll go over how those are represented in a PCM stream.

General , ,

PulseAudio: An Async Example To Get Device Lists

October 13th, 2009

I have a love/hate relationship with PulseAudio. The PulseAudio simple API is… well…. simple. For 99% of the applications out there, you’ll rarely need anything more than the simple API. The documentation leaves a little to be desired, but it’s not to hard to figure out since you have the sample source code for pacat and parec.

The asynchronous API, on the other hand, is really complex. The learning curve isn’t really a curve. It’s more like a brick wall. Compounding the issue is that the documentation is atrocious. If you know exactly what you’re looking for and if you already know how it works, the documentation can be helpful.

More importantly, simple example code is nearly impossible to come by. So, since I took the time to figure it out, I figured I would document this here in the hopes that this little example will help someone else. This is not production ready code. There’s a lot of error checking that’s not being done. But this should at least give you an idea of how to use the PulseAudio asyncrhonous API.

Update: I spoke with the PulseAudio team and they encouraged me to put this source code on their wiki. So now you can find it at the main PulseAudio wiki: http://pulseaudio.org/wiki/SampleAsyncDeviceList

Read more…

General , , ,

Netflix Has a Developer API

July 27th, 2009

I wasn’t incredibly happy with the movie synopses I was getting from IMDB. They’re generally pretty crappy. I went looking around to see if I could scrape the Netflix synopses, and lo-and-behold, Netflix has an API!

In another open source project I’m working on, I have a need to learn GTK+. So I figured the easy way to learn GTK+ was to start with php-gtk. It’s more-or-less a replica of the gtkmm OO interface, so I set out to update my little movie categorization script with a GTK+ interface. After learning the ropes, I finally have a nice interface that queries Netflix and returns all of their data for display.

This is what I have so far (keep in mind this is all in PHP):

screenshot-yflix-movie-categorizer-and-netflix-manager-1

When you click on a movie in the list, it queries netflix and fills out the description pane. So far it’s really simple, but hopefully I can use this to generate something that will categorize movies specifically for a uPNP client. I can’t put any source code out yet since I’m not too sure how the Netflix API deals with publishing an app. Right now, it has my personal developer key hard coded, and I only get 5000 queries per day.

Here’s a video (and of course, you’ll need Firefox 3.5):

General , ,

Detecting Xlib’s Keyboard Auto-repeat Functionality (and how to fix it)

June 23rd, 2009

If you’ve ever messed around with listening for XWindows keyboard events, you may have noticed something that’s odd. XWindows has a very strange way of dealing with keyboard auto-repeating. Let’s take a scenario:

Hold down the “h” key on the keyboard for a few seconds and you’ll get something that looks like this:

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Now, if you try this and watch carefully, you’ll notice a couple of things. The first thing to notice, is that after the first character, there is a slight delay before it starts repeating. That delay is about a half of a second. After that initial delay, it repeats every 30ms. These delays are configurable in X. It may be somewhere in your window manager control panel, but the xset command will tell you everything you need to know:

$ xset q
Keyboard Control:
  auto repeat:  on    key click percent:  0    LED mask:  00000000
  auto repeat delay:  500    repeat rate:  30
  auto repeating keys:  00fffffffffffbbf
                        fbdfffefffedffff
                        9fffffffffffffff
                        fff7ffffffffffff

Now, when you grab the keyboard in X using XGrabKeyboard or XGrabKey (so you’ll get the keyboard events), something strange happens with all those h’s. When holding down a key, you would at first expect to see 2 events. The KeyPress event and the KeyRelease event. But in order for auto-repeat to work, there must be more. For each auto-repeating character, you should get an additional KeyPress event. This solves the auto-repeat problem, because you can write some simple code that can easily detect auto-repeat and ignore it if you just care if the user is holding down a key.
Read more…

General , , ,

The GNU people have gone off their collective rocker

April 6th, 2009

I wanted to find out why more and more libraries are now licensed under the GPL instead of the LGPL. In my search, I found GNU’s article telling people not to use the LGPL. I thought that was odd.

Today, I had the need for an RSS reader/writer library for work. MagpieRSS did almost everything I needed. It parses pretty much any RSS feed and has fairly loose parsing methods. I thought: well that’s easy enough, I can just write a function to turn the MagpieRSS object back into RSS XML. I’d need to add <enclosure> functionality as well. I could submit the code back to the MagpieRSS people and MagpieRSS would be an RSS writer as well.

That’s when I saw the license… GPL. So, that makes MagpieRSS useless for this project. That means I have to write my own RSS reader and writer. It’s not that big of a deal to write an RSS parser. It’s not like RSS is incredibly complicated. For that matter, it’s not that big of a deal to write an RSS writer, either. But as we all know, why reinvent the wheel?

So now I have to write my own RSS parser and writer which I’ll then release under a normal license like BSD or LGPL. Now there will be competing open source projects that provide the same functionality thanks to GNU’s short-sightedness.

People are not thinking their cunning plans all the way through. By preventing commercial use of libraries, they lose the support that commercial users also provide.

In short, the GNU (and MagpieRSS) people are being annoying fanatics.

General