Tux

...making Linux just a little more fun!

JPEG de-duplication

Neil Youngman [ny at youngman.org.uk]


Sun, 25 Jul 2010 21:19:34 +0100

A family member has a number of directories containing photos in JPEG format. 3 directories contain different versions of the same collection of photos. One is the current master and the others are earlier snapshots of the same collection. I believe that all the photos in the older snapshots are present in the current master, but I would like to verify that before I delete them. Also many other directories probably contain duplicates of photos in the master collection and I would like to clean those up.

Identifying and cleaning up byte for byte identical JPEGs in the snapshots has freed up a considerable amount of disk space. A sample of the remaining photos suggests that they are probably in the master, but the tags and position in the directory tree have changed. I don't want to go through comparing them all by hand.

Initial research suggests that ImageMagick can produce a "signature", which is a SHA256 checksum of the image data. I believe that this would be suitable for identifying identical images, on which the tags have been altered.

Are there any graphics experts in the gang who can confirm this? Alternatively suggestions of existing tools that will do the job, or better approaches, would be most welcome.

Neil


Top    Back


Thomas Adam [thomas at xteddy.org]


Sun, 25 Jul 2010 21:23:52 +0100

On Sun, Jul 25, 2010 at 09:19:34PM +0100, Neil Youngman wrote:

> Are there any graphics experts in the gang who can confirm this? Alternatively 
> suggestions of existing tools that will do the job, or better approaches, 
> would be most welcome.

Imagemagick has the "compare" command to do this.

-- Thomas Adam

-- 
"Deep in my heart I wish I was wrong.  But deep in my heart I know I am
not." -- Morrissey ("Girl Least Likely To" -- off of Viva Hate.)


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Sun, 25 Jul 2010 17:10:12 -0400

On Sun, Jul 25, 2010 at 09:23:52PM +0100, Thomas Adam wrote:

> On Sun, Jul 25, 2010 at 09:19:34PM +0100, Neil Youngman wrote:
> > Are there any graphics experts in the gang who can confirm this? Alternatively 
> > suggestions of existing tools that will do the job, or better approaches, 
> > would be most welcome.
> 
> Imagemagick has the "compare" command to do this.

Y'know, I recall hearing about it, but had never tried it out. I just did, and - it doesn't seem to work.

I grabbed a random JPG and copied it, then ran 'compare' against the original and the copy (as expected, any ImageMagic util has a weird syntax...)

ben at Jotunheim:/tmp$ compare 1.jpg 2.jpg out.jpg
 @ 0,0

OK, so it produced a "comparison map" - out.jpg - that was just a very low-contrast version of the original. Presumably, though, the '@ 0,0' means that the two are the same. OK - so then, I edited one of the two files in Gimp and changed the lower right corner pixel to white (it had been quite dark, possibly black), and ran the comparison again. Still got the same '@ 0,0' - even though the two images were now different, including their filesizes - although the map was now some weird-looking thing with red streaks running riot through it (but nothing notable in the lower right corner.)

Perhaps the usual confusing-as-hell IM man page contains the secret, but I haven't been able to find it.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Paul Sephton [paul at inet.co.za]


Sun, 25 Jul 2010 23:18:32 +0200

On Sun, 2010-07-25 at 17:10 -0400, Ben Okopnik wrote:

> On Sun, Jul 25, 2010 at 09:23:52PM +0100, Thomas Adam wrote:
> > On Sun, Jul 25, 2010 at 09:19:34PM +0100, Neil Youngman wrote:
> > > Are there any graphics experts in the gang who can confirm this? Alternatively 
> > > suggestions of existing tools that will do the job, or better approaches, 
> > > would be most welcome.
> > 
> > Imagemagick has the "compare" command to do this.
> 
> Y'know, I recall hearing about it, but had never tried it out. I just
> did, and - it doesn't seem to work.

...or one might use 'sum' to produce a crc32...


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Sun, 25 Jul 2010 17:41:22 -0400

On Sun, Jul 25, 2010 at 11:18:32PM +0200, Paul Sephton wrote:

> 
> ...or one might use 'sum' to produce a crc32...

Oooh, nice idea. Although it'll be a little slow if you've got a lot of pics.

#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Jul 25 17:12:49 EDT 2010
use strict;
use File::Find;
$|++;
 
die "'$ARGV[0]' is not a directory.\n" unless -d $ARGV[0];
 
my %list;
find(\&wanted, @ARGV);
 
sub wanted {
    return if -d $File::Find::name;
    chomp(my $sum = qx[/usr/bin/sum "$File::Find::name"]);
    $sum =~ s/\s+.*//;
    push @{$list{$sum}}, $File::Find::name;
}
 
for (sort {$a<=>$b} keys %list){
    if (@{$list{$_}} > 1){
        print "\n[$_]:\n";
        print "\t$_\n" for @{$list{$_}};
    }
}

Fire this off with your top-level pic directory as an argument; it'll return a list of all pics with identical checksums, grouped by checksum values.

Using one of the Perl checksum modules would have been a lot faster, but would require installing that module.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Sun, 25 Jul 2010 17:44:17 -0400

On Sun, Jul 25, 2010 at 05:41:22PM -0400, Benjamin Okopnik wrote:

> 
> Fire this off with your top-level pic directory as an argument

...or with a list of the directories that you want to traverse. Wrote it that way, and forgot to mention it.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]


Mon, 26 Jul 2010 06:40:33 +0530

Hello,

On Sun, 25 Jul 2010, Neil Youngman wrote:

> Initial research suggests that ImageMagick can produce a "signature", which is 
> a SHA256 checksum of the image data. I believe that this would be suitable 
> for identifying identical images, on which the tags have been altered. 

Quoting Glenn Randers-Pehrson (he of libpng fame): (ref http://studio.imagemagick.org/pipermail/magick-users/2003-March/007964.html)

Two images with identical pixel data, even if stored in different formats, will produce the same signature. It is the SHA digest computed over the pixels in a canonical form.

So "imagemagick -identify | grep signature" can indeed be used to check whether two images are pixel for pixel identical.

Regards,

Kapil. --


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Sun, 25 Jul 2010 21:19:50 -0400

On Mon, Jul 26, 2010 at 06:40:33AM +0530, Kapil Hari Paranjape wrote:

> Hello,
> 
> On Sun, 25 Jul 2010, Neil Youngman wrote:
> > Initial research suggests that ImageMagick can produce a "signature", which is 
> > a SHA256 checksum of the image data. I believe that this would be suitable 
> > for identifying identical images, on which the tags have been altered. 
> 
> Quoting Glenn Randers-Pehrson (he of libpng fame):
> (ref http://studio.imagemagick.org/pipermail/magick-users/2003-March/007964.html)
> 
>  Two images with identical pixel data, even if stored in different
>  formats, will produce the same signature.  It is the SHA digest
>  computed over the pixels in a canonical form.
> 
> So "imagemagick -identify | grep signature" can indeed be used to
> check whether two images are pixel for pixel identical.

Is there an actual "imagemagick" program, or is it some util from the IM suite? I don't have an "imagemagick" on my system, despite having the package installed.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Mulyadi Santosa [mulyadi.santosa at gmail.com]


Mon, 26 Jul 2010 09:44:37 +0700

On Mon, Jul 26, 2010 at 08:19, Ben Okopnik <ben at linuxgazette.net> wrote:

> Is there an actual "imagemagick" program, or is it some util from the IM
> suite? I don't have an "imagemagick" on my system, despite having the
> package installed.

Hi ben, I am with you on this. AFAIK, ImageMagick is just the name of the package. The programs there have various names, but none of them are "ImageMagick".

-- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]


Mon, 26 Jul 2010 09:24:16 +0530

Hello,

On Sun, 25 Jul 2010, Ben Okopnik wrote:

> On Mon, Jul 26, 2010 at 06:40:33AM +0530, Kapil Hari Paranjape wrote:
> > So "imagemagick -identify | grep signature" can indeed be used to
> > check whether two images are pixel for pixel identical.
> 
> Is there an actual "imagemagick" program, or is it some util from the IM
> suite? I don't have an "imagemagick" on my system, despite having the
> package installed.

On Mon, 26 Jul 2010, Mulyadi Santosa wrote:

> Hi ben, I am with you on this. AFAIK, ImageMagick is just the name of
> the package. The programs there have various names, but none of them
> are "ImageMagick".

Oops. That should have been 'identify -verbose | grep signature' or (if you use graphicsmagick which is a fork of imagemagick) 'gm identify -verbose | grep signature'.

This is what comes of not cutting and pasting when it is actually necessary! :-(

Kapil. --


Top    Back


Paul Sephton [paul at inet.co.za]


Mon, 26 Jul 2010 07:21:32 +0200

On Sun, 2010-07-25 at 17:44 -0400, Ben Okopnik wrote:

> On Sun, Jul 25, 2010 at 05:41:22PM -0400, Benjamin Okopnik wrote:
> > 
> > Fire this off with your top-level pic directory as an argument
> 
> ...or with a list of the directories that you want to traverse. Wrote it
> that way, and forgot to mention it.
> 
Very nice proggie!


Top    Back


Neil Youngman [ny at youngman.org.uk]


Mon, 26 Jul 2010 08:02:15 +0100

On Sunday 25 July 2010 21:23:52 Thomas Adam wrote:

> On Sun, Jul 25, 2010 at 09:19:34PM +0100, Neil Youngman wrote:
> > Are there any graphics experts in the gang who can confirm this?
> > Alternatively suggestions of existing tools that will do the job, or
> > better approaches, would be most welcome.
>
> Imagemagick has the "compare" command to do this.

I'm not sure that compare produces a simple yes/no answer. Even if it does, it might be good for comparing 2 pictures, but when I'm looking at comparing m pictures against n pictures, 1,000 < n < m, I'd be thrashing the system to death.

Neil


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Mon, 26 Jul 2010 19:23:12 -0400

On Mon, Jul 26, 2010 at 07:21:32AM +0200, Paul Sephton wrote:

> On Sun, 2010-07-25 at 17:44 -0400, Ben Okopnik wrote:
> > On Sun, Jul 25, 2010 at 05:41:22PM -0400, Benjamin Okopnik wrote:
> > > 
> > > Fire this off with your top-level pic directory as an argument
> > 
> > ...or with a list of the directories that you want to traverse. Wrote it
> > that way, and forgot to mention it.
> > 
> Very nice proggie!

I love doing this kind of stuff. Keeps me in practice, especially if I've been concentrating on PHP, or JS, or whatever. :)

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Mon, 26 Jul 2010 19:27:31 -0400

On Mon, Jul 26, 2010 at 08:02:15AM +0100, Neil Youngman wrote:

> On Sunday 25 July 2010 21:23:52 Thomas Adam wrote:
> > On Sun, Jul 25, 2010 at 09:19:34PM +0100, Neil Youngman wrote:
> > > Are there any graphics experts in the gang who can confirm this?
> > > Alternatively suggestions of existing tools that will do the job, or
> > > better approaches, would be most welcome.
> >
> > Imagemagick has the "compare" command to do this.
> 
> I'm not sure that compare produces a simple yes/no answer. Even if it does, it 
> might be good for comparing 2 pictures, but when I'm looking at comparing m 
> pictures against n pictures, 1,000 < n < m, I'd be thrashing the system to 
> death.

It seems like there should be a way to do this in two passes - some kind of a rough (but fast) comparison that will weed out definitely incompatible images, and a second one to do a 'signature' comparison for precise matching - which would reduce the run time. I just wish I could think of a way to do that initial weeding pass... I don't think that size is a reasonable metric.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Karl-Heinz Herrmann [kh1 at khherrmann.de]


Tue, 27 Jul 2010 20:40:38 +0200

On Mon, 26 Jul 2010 19:27:31 -0400 Ben Okopnik <ben at linuxgazette.net> wrote:

> It seems like there should be a way to do this in two passes - some
> kind of a rough (but fast) comparison that will weed out definitely
> incompatible images, and a second one to do a 'signature' comparison
> for precise matching - which would reduce the run time. I just wish I
> could think of a way to do that initial weeding pass... I don't think
> that size is a reasonable metric.

I have been comparing big trees of files myself. My files are on two different computers, not just drives. So I went for a locally run find/md5sum, writing to a local file and a perl script which later compared the two files, sorting them by md5sums and outputting duplicate/missing files.

A problem I ran into were image tags. One copy of the files was tagged in f-photo and unless the copy was taken after the first import, the actually identical photos were binary different because of the tags. I would have to strip all headers on both version, compare whats left -- but never got round to implement that. Same applies to tagged mp3, ogg, ...

K.-H.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Tue, 27 Jul 2010 17:41:00 -0400

On Tue, Jul 27, 2010 at 08:40:38PM +0200, Karl-Heinz Herrmann wrote:

> 
> I have been comparing big trees of files myself. My files are on two
> different computers, not just drives. So I went for a locally
> run find/md5sum, writing to a local file and a perl script which later
> compared the two files, sorting them by  md5sums and outputting
> duplicate/missing files. 
> 
> A problem I ran into were image tags. One copy of the files was tagged
> in f-photo  and unless the copy was taken after the first import, the
> actually identical photos were binary different because of the tags. I
> would have to strip all headers on both version, compare whats left --
> but never got round to implement that. Same applies to tagged mp3,
> ogg, ... 

Well, OK - I got un-lazy enough to look this up (actually, I've been struggling with a corroded alternator bolt for the past however-long, and now that I've got it out, I desperately need a break!) Yep, Perl has an interface to Image::Magick (this can be installed via 'apt-get install perlmagick) - and with that, things get lots easier. All the rest of the required modules are part of the default Perl installation.

#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Jul 25 17:12:49 EDT 2010
use strict;
use File::Find;
use Image::Magick;
use Cwd 'abs_path';
 
die "Syntax: ", $0 =~ /([^\/]+)$/, " <dir to search> [...]\n"
	unless defined $ARGV[0] && -d $ARGV[0];
 
# Canonicalize the directories
map { $_ = abs_path($_) } @ARGV;
 
my %list;
find(\&wanted, @ARGV);
	
sub wanted {
	return if -d $File::Find::name || $File::Find::name !~ /\.(?:jpg|png|gif|bmp)$/i;
	my $img = Image::Magick -> new();
	$img -> Read($File::Find::name);
	my $sig = $img -> Get('signature') || return;
 
	push @{$list{$sig}}, $File::Find::name;
}
 
for (sort keys %list){
	my $a;
	if (@{$list{$_}} > 1){
		print $a++?"\t$_\n":"$_:\n" for @{$list{$_}};
	}
}
 

Now that works as advertised - except it looks like signatures only stay the same if you convert JPG <=> PNG, but not to GIF or BMP. That seems rather limited. :\

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]


Wed, 28 Jul 2010 07:18:26 +0530

Hello,

On Tue, 27 Jul 2010, Ben Okopnik wrote:

> Yep, Perl has an interface to Image::Magick (this can be installed
> via 'apt-get install perlmagick) - and with that, things get lots
> easier.  All the rest of the required modules are part of the
> default Perl installation.

Knowing your oyster hunting proclivities, I was wondering whether I should suggest that to you, but I finally decided that 'identify -verbose $file | grep signature' was simpler.

Kapil. --


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 28 Jul 2010 09:29:32 -0400

On Wed, Jul 28, 2010 at 07:18:26AM +0530, Kapil Hari Paranjape wrote:

> Hello,
> 
> On Tue, 27 Jul 2010, Ben Okopnik wrote:
> > Yep, Perl has an interface to Image::Magick (this can be installed
> > via 'apt-get install perlmagick) - and with that, things get lots
> > easier.  All the rest of the required modules are part of the
> > default Perl installation.
> 
> Knowing your oyster hunting proclivities, I was wondering whether I
> should suggest that to you, but I finally decided that
>  'identify -verbose $file | grep signature'
> was simpler.

[laugh] You know me too well, Kapil. The only problem with the above is that I'd be spawning a shell process to run each one of these - one per image - so that would be way too slow. With that many images to process, and an O(n^2) factor in the works, I wanted at least the controllable factors working as fast as possible.

I wonder if it would make sense to pre-sort the images by header date? As I'd said, anything rejected before the comparison would make a big difference.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Wed, 28 Jul 2010 15:03:11 +0100

On 28 July 2010 14:29, Ben Okopnik <ben at linuxgazette.net> wrote:

> On Wed, Jul 28, 2010 at 07:18:26AM +0530, Kapil Hari Paranjape wrote:
>> Hello,
>>
>> On Tue, 27 Jul 2010, Ben Okopnik wrote:
>> > Yep, Perl has an interface to Image::Magick (this can be installed
>> > via 'apt-get install perlmagick) - and with that, things get lots
>> > easier. ?All the rest of the required modules are part of the
>> > default Perl installation.
>>
>> Knowing your oyster hunting proclivities, I was wondering whether I
>> should suggest that to you, but I finally decided that
>> ?'identify -verbose $file | grep signature'
>> was simpler.
>
> [laugh] You know me too well, Kapil. The only problem with the above is
> that I'd be spawning a shell process to run each one of these - one per
> image - so that would be way too slow. With that many images to process,
> and an O(n^2) factor in the works, I wanted at least the controllable
> factors working as fast as possible.
>
> I wonder if it would make sense to pre-sort the images by header date?
> As I'd said, anything rejected before the comparison would make a big
> difference.

You could always check some of the Exif headers - if the photos were taken with different cameras, it's extremely unlikely that they're the same photo.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 28 Jul 2010 12:32:13 -0400

On Wed, Jul 28, 2010 at 03:03:11PM +0100, Jimmy O'Regan wrote:

> On 28 July 2010 14:29, Ben Okopnik <ben at linuxgazette.net> wrote:
> >
> > I wonder if it would make sense to pre-sort the images by header date?
> > As I'd said, anything rejected before the comparison would make a big
> > difference.
> 
> You could always check some of the Exif headers - if the photos were
> taken with different cameras, it's extremely unlikely that they're the
> same photo.

I just tried a few tests, and the time that it takes to get that info is just awful - about 5.5 seconds per image with 'identify', and about 1.125 seconds with Perl and Image::Magick. Unless someone comes up with a way to pre-sort by "external" data (e.g., file size, time, etc.), or read those bits significantly faster [1], we've pretty much defined the worst-case run-time for this thing.

[1] If somebody wants to look up the JPG header info and tell me where the necessary bits are located, I can read them "raw" and save a lot of time - but calculating stuff like signatures would likely be a pain.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Wed, 28 Jul 2010 17:45:31 +0100

On 28 July 2010 17:32, Ben Okopnik <ben at linuxgazette.net> wrote:

> On Wed, Jul 28, 2010 at 03:03:11PM +0100, Jimmy O'Regan wrote:
>> On 28 July 2010 14:29, Ben Okopnik <ben at linuxgazette.net> wrote:
>> >
>> > I wonder if it would make sense to pre-sort the images by header date?
>> > As I'd said, anything rejected before the comparison would make a big
>> > difference.
>>
>> You could always check some of the Exif headers - if the photos were
>> taken with different cameras, it's extremely unlikely that they're the
>> same photo.
>
> I just tried a few tests, and the time that it takes to get that info is
> just awful - about 5.5 seconds per image with 'identify', and about
> 1.125 seconds with Perl and Image::Magick. Unless someone comes up with
> a way to pre-sort by "external" data (e.g., file size, time, etc.), or
> read those bits significantly faster [1], we've pretty much defined the
> worst-case run-time for this thing.
>
> [1] If somebody wants to look up the JPG header info and tell me where
> the necessary bits are located, I can read them "raw" and save a lot of
> time - but calculating stuff like signatures would likely be a pain.
$ time perl -MImage::ExifTool -e '$et = new Image::ExifTool;
$et->ExtractInfo("foo.jpg");
$make=$et->GetValue("Make");$model=$et->GetValue("Model");print
"$make:$model\n";'
EASTMAN KODAK COMPANY:KODAK P880 ZOOM DIGITAL CAMERA
 
real	0m0.369s
user	0m0.148s
sys	0m0.016s
-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 28 Jul 2010 13:34:05 -0400

On Wed, Jul 28, 2010 at 05:45:31PM +0100, Jimmy O'Regan wrote:

> 
> $ time perl -MImage::ExifTool -e '$et = new Image::ExifTool;
> $et->ExtractInfo("foo.jpg");
> $make=$et->GetValue("Make");$model=$et->GetValue("Model");print
> "$make:$model\n";'
> EASTMAN KODAK COMPANY:KODAK P880 ZOOM DIGITAL CAMERA
> 
> real	0m0.369s
> user	0m0.148s
> sys	0m0.016s

That's definitely faster, thanks. Too bad it doesn't contain that signature data - we'd be able to collect all the necessary info from just this one run, and then the slowdown would just be in the calculation end... It does have a bunch of data, though.

ben at Jotunheim:~/Pics$ perl -MImage::ExifTool=:Public -wle'
$a=ImageInfo("CIMG0019.jpg");print "$_: $a->{$_}" for sort keys %$a
'
Aperture: 2.4
ApertureValue: 1.1
BitsPerSample: 8
ColorComponents: 3
Comment:  
ComponentsConfiguration: -, Cr, Cb, Y
Compression: JPEG (old-style)
Directory: .
EncodingProcess: Baseline DCT, Huffman coding
ExifByteOrder: Little-endian (Intel, II)
ExifImageHeight: 1520
ExifImageWidth: 2032
ExifToolVersion: 8.00
ExifVersion: 0220
ExposureMode: Unknown (41987)
ExposureProgram: Program AE
ExposureTime: 1/65536000
FNumber: 2.4
FileModifyDate: 2009:10:20 22:19:59-04:00
FileName: CIMG0019.jpg
FileSize: 1145 kB
FileType: JPEG
Flash: Auto, Did not fire
FocalLength: 0.4 mm
FocalLength35efl: 0.4 mm
ImageHeight: 1520
ImageSize: 2032x1520
ImageWidth: 2032
JFIFVersion: 1.01
MIMEType: image/jpeg
Make: Palm
Model: Pre
ModifyDate: 2009:08:06 19:01:35
Orientation: Horizontal (normal)
ResolutionUnit: inches
ResolutionUnit (1): inches
ShutterSpeed: 1/65536000
ThumbnailImage: SCALAR(0x8e78218)
ThumbnailLength: 34697
ThumbnailOffset: 416
Warning: Bad IFD2 directory
XResolution: 72
XResolution (1): 72
YCbCrPositioning: Centered
YCbCrSubSampling: YCbCr4:2:0 (2 2)
YResolution: 72
YResolution (1): 72
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Karl-Heinz Herrmann [kh1 at khherrmann.de]


Wed, 28 Jul 2010 21:55:46 +0200

Hi,

On Tue, 27 Jul 2010 17:41:00 -0400 Ben Okopnik <ben at linuxgazette.net> wrote:

> 	my $sig = $img -> Get('signature') || return;

that signature feature is cool. Never thought about looking for a module like that. So, finally I will have to tackle my duplicates.....

With the exif-extraction sent by Jimmy -- Camera same, image size same, Modify Time (thats funnily the older date compared to Filemodify time) same -- that should weed out plenty of images which should not possibly be the same. And in case of an error a (duplicate) file survives which is not that terrible after all.

But before deleting one I would like to be sure they are the same.... apart from some tag info -- which might even be synced for a complete tag list in the surviving pic.

I'll be having a closer look at your code -- my script comparing the md5sum file lists looks somehow clumsy right now....

K.-H.


Top    Back


Ben Okopnik [ben at okopnik.com]


Wed, 28 Jul 2010 16:51:08 -0400

On Wed, Jul 28, 2010 at 09:55:46PM +0200, Karl-Heinz Herrmann wrote:

> Hi,
> 
> On Tue, 27 Jul 2010 17:41:00 -0400
> Ben Okopnik <ben at linuxgazette.net> wrote:
> 
> > 	my $sig = $img -> Get('signature') || return;
> 
> that signature feature is cool. Never thought about looking for a module
> like that. So, finally I will have to tackle my duplicates.....

Thanks to Kapil for that one. Hopefully, there's a more efficient and portable (i.e., works for more image formats) way to do it, though!

> With the exif-extraction sent by Jimmy -- Camera same, image size same,
> Modify Time (thats funnily the older date compared to Filemodify time)
> same -- that should weed out plenty of images which should not possibly
> be the same.

Y'know, that might be a good point. Even if the camera make and model are absent (true for about 1/4-1/3 of my photos), that would still be useful.

> And in case of an error a (duplicate) file survives which
> is not that terrible after all. 
> 
> But before deleting one I would like to be sure they are the same....
> apart from some tag info -- which might even be synced for a
> complete tag list in the surviving pic. 
> 
> I'll be having a closer look at your code -- my script comparing the
> md5sum file lists looks somehow clumsy right now.... 

:) I do try to make my code at least somewhat elegant, even if I write it in a hurry.

-- 
                       OKOPNIK CONSULTING
        Custom Computing Solutions For Your Business
Expert-led Training | Dynamic, vital websites | Custom programming
               443-250-7895    http://okopnik.com


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]


Thu, 29 Jul 2010 08:20:21 +0530

Hello,

On Wed, 28 Jul 2010, Ben Okopnik wrote:

> Hopefully, there's a more efficient and
> portable (i.e., works for more image formats) way to do it, though!

The 'signature' from imagmagick is calculated as part of its identification module so it is available for all image formats.

However, since that calculation is based on 'raw data' it will not be able to compare two different image formats with each other (which is what I think Ben meant).

What should be possible (but slow!) is to write something that uses the magick library to convert the image into a standard bitmap (like ppmraw) and then match signatures (or just do a bit-by-bit comparison). This would work fine for loss-less compression like png but will not be so great for lossy formats like jpeg. Moreover, there would be problems of comparison between vector and bitmap formats since the conversion to bitmap would be lossy in the former case.

The question in those cases is to define a "measure" of the difference between two bitmaps in identical format (say ppmraw for definite-ness). There appears to be a program 'pnmpsnr' that does exactly that. This too should be accessible through the netpbm library.

If the measure is small enough then the older file is the ancestor and the newer one the descendant. (Always using "internal" time-stamps where available over file-system time-stamps.)

This provides an "in principle" solution which I am too lazy to code!

Regards,

Kapil. --


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 28 Jul 2010 23:52:05 -0400

On Thu, Jul 29, 2010 at 08:20:21AM +0530, Kapil Hari Paranjape wrote:

> 
> What should be possible (but slow!) is to write something that uses
> the magick library to convert the image into a standard bitmap (like
> ppmraw) and then match signatures (or just do a bit-by-bit
> comparison). This would work fine for loss-less compression like png
> but will not be so great for lossy formats like jpeg. Moreover, there
> would be problems of comparison between vector and bitmap formats
> since the conversion to bitmap would be lossy in the former case.

Actually, for the real-world case of comparing camera-produced images, I think we can reject any that aren't in the same format (that would be a much more complex task, I agree.) If we're just trying to eliminate actual copies, then that would be pretty simple:

1st pass: use unique file sizes as keys, lists of files with that size as values

2nd pass: any lists with 2 or more files get checked for format and camera make/model equivalence

(optional) 3rd pass: any lists that still have 2 or more entries get checked for signature equivalence.

The actual solution is left to the student. :)

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back