For several weekends, I had fun playing Kaggle: Avito Duplicate Ads Detection problem. This machine learning problem includes more than 10 million images in addition to the structured data set. In this competition, many players use image hashes instead of the actual images to optimize the model creation process.
What I found interesting is – most of the implementation of the image hashing uses a standard Discrete Cosine Transformation (CDT). I used to work with images many years back and remember that Discrete Wavelet Transformation (DWT) might give better results for images. I was unable to find any Python implementation DWT based image hashing, so I implemented one and pushed to the imagehash library. The change is available in the master branch on github and in the new version of the package. In this blogpost, I will describe how it works concisely.
1. Imagehash Python library
The most simple and effective library that I found was the imagehash library from Johannes Bucher. There were several image hashes implemented in the library: aHash, pHash, dHash. All three of the approaches scale an image into a grayscale 8×8 image first. Then the library performs some calculations for each of these 64 pixels and assigns a binary 1 or 0 value. These 64 bits form the output of algorithm. The bit computation methods are different:
- aHash – average hash, for each of the pixels output 1 if the pixel is bigger or equal to the average and 0 otherwise.
- pHash – perceptive hash, does the same as aHash, but first it does a Discrete Cosine Transformation and works in the frequency domain.
- dHash – gradient hash, calculate the difference for each of the pixel and compares the difference with the average differences.
- * wHash – wavelet hashing, that I added to the library a couple days back. It works in the frequency domain as pHash but it uses DWT instead of DCT.
You can fine more detailed description of the hashes in this blogpost.
The code below shows how to use the library.
from PIL import Image
hash1 = imagehash.phash(Image.open(‘test1.jpg’))
hash2 = imagehash.phash(Image.open(‘test2.jpg’))
hash1 == hash2
hash1 – hash2
The two images from the code example are definitely not equal. The 44 bits out of 64 are different. Similar images will have a difference up to 6-8 bits.
UPDATE: A typo was found by LIN China. The hash values and the difference were changed.
2. Calculate image hash
For regular photos, frequency based methods like pHash usually give better results because the frequency domain is more stable for images transformations like:
- JPG compression
- color schema change or applying image filters
- size scaling
- and even some minor image editing: cutting part of an image, marking an image by watermark, adding text of modifying an image .
For example, let’s take a look at an image and a transformed version of the same image. This is going to be a very popular Lenna image. Many image processing researches use this picture. I remember this picture pretty well from my student days when I did some image researches more than some 10 years back.
Let’s make some basic transformations on the image and compare the hashes. First of all we will introduce size change from 512×512 to 400×400 pixels. Then we will change color schema and then compress to JPEG for the final step.
from PIL import Image
lenna = PIL.Image.open(‘lenna.png’)
lenna1 = PIL.Image.open(‘lenna1.jpg’)
h = imagehash.phash(lenna)
h1 = imagehash.phash(lenna1)
Ha… not bad! No difference in the image hashes even after compression, resizing and color changing.
Let’s apply more transformations to the lenna1.jpg image (not the original one):
- take only the central part of the picture
- add text
- compress again
lenna2 = PIL.Image.open(‘lenna2.jpg’)
h2 = imagehash.phash(lenna2)
h – h2
(h – h2)/len(h.hash)**2
All right. Now we can see the hash difference is 20, or 31.2% per hash bit. The second metric is much better because the hash size is varies for different hashes.
aHash brings different results. Even simple transformation of lenna1.jpg shows 1.6% hash difference. More aggressive lenna2.jpg gives 29.7 % difference.
a = imagehash.average_hash(lenna)
a1 = imagehash.average_hash(lenna1)
a2 = imagehash.average_hash(lenna2)
a – a1
(a – a1)/len(a.hash)**2
(a – a2)/len(a.hash)**2
3. Wavelet hash
Discrete Wavelet Transformation (DWT) is another form of frequency representation. The popular DCT and Fourier transformations use a set of sin\cos functions as a basis: sin(x), sin(2x), sin(3x), etc. In contrast, DWT uses one single function as a basis but in different forms: scaled and shifted. The basis function can be changed and this is why we can have Haar wavelet, Daubechie-4 wavelet etc. This scaling effect gives us a great “time-frequency representation” when the low frequency part looks similar to the original signal.
There is a great Python library for wavelets – pywt. I used this library to implement whash() method for the imagehash library. By default whash() computes 8×8 hash using Haar transformation. In addition, the method removes the lowest Haar frequency LL(max). The lowest frequency consists from only one data point/pixel and this point represent the contrast of the image and isn’t so useful for hashing.
wHash Python code is below:
w = imagehash.whash(lenna)
w1 = imagehash.whash(lenna1)
w2 = imagehash.whash(lenna2)
(w – w1)/len(w.hash)**2
(w – w2)/len(w.hash)**2
To make results cleaner, let’s compare the original image with another one. The expected hash difference should be 50%. Here is another standard image for comparison – barbara.jpg. Let’s calculate the hash difference between Lenna and Barbara using all hashes. The code a listed below:
barb = PIL.Image.open(‘barbara.jpg’)
w_b = imagehash.whash(barb)
h_b = imagehash.phash(barb)
a_b = imagehash.average_hash(barb)
(a – a_b)/len(a.hash)**2
(h – h_b)/len(w.hash)**2
(w – w_b)/len(w.hash)**2
Table with all results:
|lenna vs. lenna1||1.6%||0%||3.1%|
|lenna vs. lenna2||29.7%||31.3%||28.1%|
|lenna vs. barbara||50%||53.1%||43.8%|
In the new whash() method, we can play with different parameters. The most important thing in whash() is the hash size. It is 8 by default but you can change it by any power of 2 number less than input image size (minimum by an image dimensions). Also, you can avoid removing the lowest frequency by setting parameter remove_max_haar_ll to False. In addition, you can change the initial scaling of the image rom 64 (which is 8×8) to any power of 2 less than the image size.
The most interesting parameter is mode – wavelet families. By default the library use haar wavelet but the value can be change to any value from pywt library like ‘db4’. See the library page.
> [‘haar’, ‘db’, ‘sym’, ‘coif’, ‘bior’, ‘rbio’, ‘dmey’]
5. Known issues
I had an issue when processed big number of small images. It looks like pywt has a memory leak. An issue was created in github. I’ll try to contact pywt creators regarding the issue.
To mitigate the issue I just split images by directories with ~50K images each and re-run processing for each directory separately.
It is hard to say which of the methods provides better results. It depends on your application and you should focus on your application or machine learning model metrics like precision\recall or AUC. For my Kaggle score the whash() brought +0.04% to AUC metric, in addition to my current ~92.9% result.
It doesn’t look like a huge difference. However, we should remember that in the modeling code, we achieved this by a one-letter change from phash() to whash(). It is nice to have more advanced analytical tools and I hope this method will be a good addition to your analytical toolbox. In addition, I believe that wHash has a great potential for tuning by the method parameters.
Please share your experience in using the library. Any comments, suggestions, code improvements and fixes are highly appreciated.