Archive

Posts Tagged ‘PHP’

Sync A Large Directory Structure to S3

October 23rd, 2012

There’s a handful of commands out there that deal with command line operations for s3. The most popular (I think) is s3tool’s s3cmd. However, we have a filesystem that we would like to keep in sync with S3 while we are working on migrating. s3cmd has a sync command that works really well for filesystems with a small to medium number of files (not total file size… total file count). We have a filesystem that contains many millions of files which can be problematic for programs like s3cmd (even rsync has issues with this many files). The problem (or feature) is that they tend to calculate the changes for everything recursively all at once, then they start performing operations.

If you do not need this feature, it takes a lot less memory to calculate all the changes on a directory by directory basis. Of course, if you’re syncing a single directory with millions of files, you have bigger problems anyway and this won’t help. Luckily, we tend to split up the files into categorized directories.

So, I wrote this very simple little PHP script that keeps S3 in sync with a local directory structure. It shouldn’t be too hard to rewrite this in just about any language. It’s not complicated at all.

IMPORTANT NOTES:

  • This WILL dereference symlinks. So make sure you do not have recursive symlinks in your directory structure. For example: “ln -s . recurseme” would be bad
  • The local filesystem is always authoritative. If it doesn’t exist locally, it will get deleted from S3
  • It does not compare MD5 sums (even though you can see that I thought about it in the code)
  • It does not update the S3 side timestamp with the local timestamp and will only sync if the file size is different or the local timestamp is later than the S3 timestamp
#!/usr/bin/php
<?
require_once('AWSSDKforPHP/sdk.class.php');

$s3 = new AmazonS3();
$basepath = '/path/to/sync';
$bucket = 'your-bucket-name';

function getDirectoryList($localdir) {
    global $directoryList;

    /*
    // this is useful for testing
    if (substr_count($localdir, '/') > 2) {
        return;
    }
    */
    $d = opendir($localdir);
    while ($ent = readdir($d)) {
        if ($ent == '.' || $ent == '..') {
            continue;
        }
        if (is_dir($localdir . '/' . $ent)) {
            $directoryList[] = $localdir . '/' . $ent;
            getDirectoryList($localdir . '/' . $ent);
        }
    }
    closedir($d);
}

function syncDirectory($basepath, $localdir) {
    global $s3;

    $remotedir = preg_replace('%^' . $basepath . '/?%', '', $localdir);
    echo "getting s3 file list for $remotedir\n";
    $s3filelist = getRemoteDirectory($remotedir);
    echo "getting local file list for $localdir\n";
    $localfilelist = getLocalDirectory($basepath, $localdir);
    echo "calculating differences\n";
    foreach ($localfilelist as $key => $linfo) {
        if (! array_key_exists($key, $s3filelist)) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
        $rinfo = $s3filelist[$key];
        if ($linfo['lastmodified'] > $rinfo['lastmodified']) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
        if ($linfo['size'] != $rinfo['size']) {
            syncFile($basepath . '/' . $key, $key);
            continue;
        }
    }
    foreach ($s3filelist as $key => $rinfo) {
        if (! array_key_exists($key, $localfilelist)) {
            deleteFile($key);
            continue;
        }
    }
}

function getRemoteDirectory($remotedir) {
    global $s3, $bucket;

    $s3filelist = array();
    do {
        $args['delimiter'] = '/';
        if (strlen($remotedir)) {
            $args['prefix'] = $remotedir . '/';
        }
        if (isset($lastkey)) {
            $args['marker'] = $lastkey;
        }
        $response = $s3->list_objects($bucket, $args);
        if (! $response->isOK()) {
            echo "error: failed to get S3 object list for static $remotedir\n";
            return false;
        }
        foreach ($response->body->Contents as $s3object) {
            $s3filelist[(string)$s3object->Key] = array(
                    'md5' => preg_replace('/^\"(.*)\"$/', '$1',
                        (string)$s3object->ETag),
                    'size' => (string)$s3object->Size,
                    'lastmodified' => strtotime((string)$s3object->LastModified),
                    );
            $lastkey = (string)$s3object->Key;
        }
        $isTruncated = (string)$response->body->IsTruncated;
        unset($response);
    } while ($isTruncated == 'true');
    return $s3filelist;
}

function getLocalDirectory($basepath, $localdir) {
    $d = opendir($localdir);
    if (! $d) {
        return false;
    }
    $localfilelist = array();
    while ($ent = readdir($d)) {
        if ($ent == '.' || $ent == '..') {
            continue;
        }
        if (is_dir($localdir . '/' . $ent)) {
            continue;
        }
        $localfile = $localdir . '/' . $ent;
        $key = preg_replace('%^' . $basepath . '/?%', '', $localfile);
        $localfilelist[$key] = array(
                'md5' => $GLOBAL['checkmd5'] == true ? md5_file($localfile) : null,
                'size' => filesize($localfile),
                'lastmodified' => filemtime($localfile),
                );
    }
    closedir($d);
    return $localfilelist;
}

function syncFile($localfile, $remotefile) {
    global $s3, $bucket;

    echo "     sync  : $localfile -> s3://$bucket/$remotefile\n";
    try {
        $response = $s3->create_object($bucket, $remotefile,
                array('fileUpload' => $localfile));
        if (! $response->isOK()) {
            echo "error: failed to sync $localfile\n";
            echo $response->body->Code . ": " . $response->body->Message . "\n";
        }
    } catch (Exception $e) {
        echo "error: failed to sync $localfile\n";
        echo $e->getMessage . "\n";
    }
}

function deleteFile($remotefile) {
    global $s3, $bucket;

    echo "     delete: s3://$bucket/$remotefile\n";
    try {
        $response = $s3->delete_object($bucket, $key);
        if (! $response->isOK()) {
            echo "error: failed to delete s3://$bucket/$key:\n";
            echo $response->body->Code . ": " . $response->body->Message . "\n";
        }
    } catch (Exception $e) {
        echo "error: failed to sync $localfile\n";
        echo $e->getMessage . "\n";
    }
}

$directoryList = array();
getDirectoryList($basepath);
foreach ($directoryList as $localdir) {
    syncDirectory($basepath, $localdir);
}

?>

General , , ,

Netflix Has a Developer API

July 27th, 2009

I wasn’t incredibly happy with the movie synopses I was getting from IMDB. They’re generally pretty crappy. I went looking around to see if I could scrape the Netflix synopses, and lo-and-behold, Netflix has an API!

In another open source project I’m working on, I have a need to learn GTK+. So I figured the easy way to learn GTK+ was to start with php-gtk. It’s more-or-less a replica of the gtkmm OO interface, so I set out to update my little movie categorization script with a GTK+ interface. After learning the ropes, I finally have a nice interface that queries Netflix and returns all of their data for display.

This is what I have so far (keep in mind this is all in PHP):

screenshot-yflix-movie-categorizer-and-netflix-manager-1

When you click on a movie in the list, it queries netflix and fills out the description pane. So far it’s really simple, but hopefully I can use this to generate something that will categorize movies specifically for a uPNP client. I can’t put any source code out yet since I’m not too sure how the Netflix API deals with publishing an app. Right now, it has my personal developer key hard coded, and I only get 5000 queries per day.

Here’s a video (and of course, you’ll need Firefox 3.5):

General , ,

Online Dictionary with Random words

March 28th, 2009

I rewrote the online dictionary section of the website to make it a little prettier and easier to use. I came across a WordNet SQL file which allows for much better manipulation of the data. Originally, I had used the WordNet software to export everything and just scrape it into the database (back in ’98 or so). This SQL import file actually keeps the relations intact from WordNet’s prolog format.
Read more…

General