我正在使用CNN处理图像分类问题.我有一个包含重复图像的图像数据集.当我使用此数据训练CNN时,它已经过拟合.因此,我需要删除那些重复项.
I'm working on image classification problem by using CNN. I have an image data set which contains duplicated images. when I train the CNN with this data, it has over fitting. Therefore, I need to remove those duplicates.
对于算法而言,我们很难将其称为重复项.您的重复项可以是:
What we loosely refer to as duplicates can be difficult for algorithms to discern. Your duplicates can be either:
1号&2比较容易解决.No. 3是非常主观的,仍然是一个研究主题.我可以提供No1&的解决方案2.两种解决方案都使用出色的图像哈希哈希库: https://github.com/JohannesBuchner/imagehash
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic. I can offer a solution for No1 & 2. Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
from PIL import Image import imagehash # image_fns : List of training image files img_hashes = {} for img_fn in sorted(image_fns): hash = imagehash.average_hash(Image.open(image_fn)) if hash in img_hashes: print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) ) else: img_hashes[hash] = image_fn
from PIL import Image import imagehash # image_fns : List of training image files img_hashes = {} epsilon = 50 for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]): if image_fn1 == image_fn2: continue hash1 = imagehash.average_hash(Image.open(image_fn1)) hash2 = imagehash.average_hash(Image.open(image_fn2)) if hash1 - hash2 < epsilon: print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
这篇关于如何在训练CNN期间删除重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程技术网(www.editcode.net)!