Skip to Main Content

Duplicate Image Detection in PHP/MySQL – Part 1: Fingerprinting

I was looking back over my old code from 2005, and I found some of the Duplicate Image Detection (DID) code that I wrote for Gallery 2, back in 2005.  I decided to clean it up and post a simplified version for people working in PHP with the GD extension.

There are many ways to do duplicate image detection, and I have no doubt that mine is sub-par.  But it works fairly well, and is pretty easy to explain.

ADVERTISEMENT:

If you are new to the world of programmatic image manipulation, you may be thinking “why not just use the MD5 (or CRC8) hash?”  This is actually a very common implementation of DID, but has serious limitations.  In most real-world scenarios, I don’t care if the images are byte-identical.  One might be a jpg, and the other a png, but if they’re the same image, I want to locate the duplicate.   Ideally, this same logic applies to different resolutions, watermarks, crops, rotations, and even some post-processing effects.  So, instead of traditional byte-based hashes, we’ll use what is sometimes called a “perceptional hash”.

In this part, we’ll handle creating what I call a “fingerprint string”.  This is a “perceptional hash” that represents what the image looks like in a minimal way.  There are many advanced perceptional hashing libraries like Libpuzzle and pHash, but I decided to work in a PHP environment, with a simple form of the hash, based on resizing and palettizing the image itself.

Here’s how I create a “Fingerprint”:

// Takes a GD image resource and an optional hash resolution
// Returns a fingerprint string, representing this image
function fingerprint_image($img, $resolution=8) {
    //This associative array is based on my own 62 color palette.
    //Keys are HTML color codes
    //Vals are a base-62 value representing each palette entry
    $palette_array = array(
             '000000' => '0', '000b0a' => '1', '00240c' => '2', '005ab5' => '3', '006234' => '4',
             '00a8a4' => '5', '02000c' => '6', '021357' => '7', '02be28' => '8', '03112a' => '9',
             '03efa2' => 'A', '0411a4' => 'B', '04f62d' => 'C', '0becf3' => 'D', '0c12f1' => 'E',
             '10000c' => 'F', '101d01' => 'G', '1a7508' => 'H', '281603' => 'I', '2b0040' => 'J',
             '31a5fc' => 'K', '415aff' => 'L', '46d300' => 'M', '46fe37' => 'N', '48ffa1' => 'O',
             '4c00a0' => 'P', '4deff7' => 'Q', '540ff5' => 'R', '581404' => 'S', '5d6900' => 'T',
             '61023c' => 'U', '98f0fd' => 'V', '9aa3fe' => 'W', '9affc4' => 'X', 'a09400' => 'Y',
             'a11b13' => 'Z', 'a8f712' => 'a', 'afff69' => 'b', 'b138fe' => 'c', 'b500d8' => 'd',
             'c9eefd' => 'e', 'cf0f75' => 'f', 'd3ffc6' => 'g', 'dba8f3' => 'h', 'e49900' => 'i',
             'e6fffb' => 'j', 'e82810' => 'k', 'ebe7fe' => 'l', 'f49486' => 'm', 'f4f413' => 'n',
             'f8f1e2' => 'o', 'f90cde' => 'p', 'f9f468' => 'q', 'fa8f37' => 'r', 'fb49df' => 's',
             'fbf3b7' => 't', 'fbfffb' => 'u', 'fd90d8' => 'v', 'fdc5d6' => 'w', 'fdf8f7' => 'x',
             'ffe3f7' => 'y', 'ffffff' => 'z'
        );
    $max_w   = $resolution;
    $max_h   = $resolution;
    //We now create an image and load the palette with the values stored as array keys
    $palette = imagecreate($max_w, $max_h);
    foreach($palette_array as $hex_color=>$val) {
        $int_color = hexdec("0x".$hex_color);
        $color = array(
                "red"   => 0xFF & ($int_color >> 0x10),
                "green" => 0xFF & ($int_color >> 0x8),
                "blue"  => 0xFF &  $int_color
            );
        imagecolorallocate($palette, $color['red'], $color['green'], $color['blue']);
    }

    $width  = imagesx($img);
    $height = imagesy($img);
    //Now we do a proportional resize to 8x8 or less
    if ($height > $width)  {   
        $ratio   = $max_h / $height;  
        $thumb_h = $max_h;
        $thumb_w = $width * $ratio;
    } else {
        $ratio   = $max_w / $width;
        $thumb_w = $max_w;  
        $thumb_h = $height * $ratio;
    }
    $thumb = imagecreate($thumb_w, $thumb_h); 
    // secret tip, set the palette before and after filling the new image
    imagepalettecopy($thumb, $palette);
    imagecopyresized($thumb, $img, 0, 0, 0, 0, $thumb_w, $thumb_h, $width, $height);
    // set the new image's palette to my special 62 color palette
    imagepalettecopy($thumb, $palette);

    $fingerprint_array = array();
    $w = imagesx($thumb);
    $h = imagesy($thumb);
    // iterate through the new image, and get the array value associated with each value.
    // this is necessary because imagepalettecopy doesn't preserve the order of the colors in the palette
    for($j=0; $j<$h; $j++) {
        $string = "";
        for($i=0; $i<$w; $i++) {
            $color = ImageColorsForIndex($thumb, imagecolorat($thumb, $i, $j));
            $red   = dechex($color['red'  ]);
            while(strlen($red) < 2) $red = '0'.$red;
            $green = dechex($color['green']);
            while(strlen($green) < 2) $green = '0'.$green;
            $blue  = dechex($color['blue' ]);
            while(strlen($blue) < 2) $blue = '0'.$blue;
            $s = $red.$green.$blue;
            $c = $palette_array[$s];
            $string .= $c;
        }
        $fingerprint_array[] = $string;
    }
    //combine this minimal representation into a string 
    $fingerprint = implode('-', $fingerprint_array);

    return($fingerprint);
}

If you increase resolution, you decrease false positives, but you also require larger storage space, and increase false negatives.  8x8px seems to be a good sweet spot.

In the simplest use, you can store this fingerprint in your database with the image, and check for duplicate fingerprints on each upload.  Then, if you find a duplicate fingerprint, you can perform some more advanced checks to decide whether to disallow the new upload, flag it for review, link both rows to the better resolution version, or ask the user what to do.

Identical fingerprints are NOT a guarantee of duplicity, so you will need additional functions to handle collisions.

ADVERTISEMENT:

In part 2, I’ll approach the advanced uses of a fingerprint, like identifying cropped or rotated images, or simply similar, but not identical images.

4 Comments

  1. Andrew Miller's profile image.

    I found false positives with 8×8 fairly high.

    The fingerprint you are giving is being byte-byte compared – so there seems little reason to keep it at it’s original size, which is your concern when using say 64 or 128px images.

    Why not simply CRC32 or MD5 the results and store that instead for comparison reasons?

    • gschoppe's profile image.

      The reason to keep the fingerprint at its full size is to take advantage of the advanced features of a perceptional fingerprint. With the proper algorithms, you can use a perceptional hash to identify duplicates, even if they have been cropped differently, rotated, or one has had a filter applied. I haven’t gotten a chance to write part 2 yet, but eventually I intend to cover these options.

      As for high numbers of false positives, can you give examples? are these false positives in a set that are extremely similar (same time/location/camera position) or are they simply common across the board?

      It is also important to note that a perceptional hash is not the end all and be all of duplicate detection. It simply quickly identifies candidates for further testing. If a database of 1000 images finds 2 fingerprint matches for an image, you can reasonably do other functions to test, even down to per-pixel comparison. However, you could never do per-pixel comparison on all 1000 photos, for every uploaded file.

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>