PHP: Determine Visually Corrupted Images (yet valid) downloaded via Curl with GD/Imagemagick
Asked Answered
P

4

6

I'm using Curl via Proxies to download images with a scraper I have developed.

Unfortunately, it gets the odd image which looks like these and the last one is completely blank :/

3/4 corrupted dog corrupted room corrupted completely white

  • When I test the images via imagemagick (using identify) it tells me they are valid images.
  • When I test the images via exif_imagetype() and imagecreatefromjpeg() again, both these functions tell me the images are valid.

Does anyone have a way to determine if the image has majority of greyness or is completely blank/white and these are indeed corrupted images?

I have done a lot of checking with other questions on here, but I haven't had much luck with other solutions. So please take care in suggesting this is a duplicate.

Thanks


After knowing about imgcolorat, I did a search and stumbled on some code. I came up with this:

<?php

$file = dirname(__FILE__) . "/images/1.jpg";

$img = imagecreatefromjpeg($file);

$imagew = imagesx($img);
$imageh = imagesy($img);
$xy = array();

$last_height = $imageh - 5;

$foo = array();

$x = 0;
$y = 0;
for ($x = 0; $x <= $imagew; $x++) 
{
    for ($y = $last_height;$y <= $imageh; $y++ ) 
    {
        $rgb = @imagecolorat($img, $x, $y);

        $r = ($rgb >> 16) & 0xFF;
        $g = ($rgb >> 8) & 0xFF;
        $b = $rgb & 0xFF;

        if ($r != 0)
        {
            $foo[] = $r;
        }
    }
}

$bar = array_count_values($foo);

$gray = (isset($bar['127']) ? $bar['127'] : 0) + (isset($bar['128']) ? $bar['128'] : 0) + (isset($bar['129']) ? $bar['129'] : 0);
$total = count($foo);
$other = $total - $gray;

if ($gray > $other)
{
    echo "image corrupted \n";
}
else
{
    echo "image not corrupted \n";
}
?>

Anyone see some potential pitfalls with this? I thought about getting the last few rows of the image and then comparing the total of r 127,128,129 (which are gray) against the total of other colours. If gray is greater than the other colours then the image is surely corrupted.

Opinions welcome! :)

Pennsylvanian answered 24/1, 2012 at 22:18 Comment(1)
Hmm. If all those functions say it's a valid image, they probably just check the header bytes but don't look whether the entire file is actually there. I would expect there to be a header byte that specifies the expected with, but I don't know for sure whether such a thing existsGlairy
C
2

If the image it is returning is a valid file, then I would recommend running the scrape twice (ie. download it twice and check to see if they are the same).

Another option would be to check the last few pixels of the image (ie. bottom-right corner) to see if they match that color of grey exactly. If they do, then redownload. (obviously this approach fails if you download an image that is actually supposed to be grey in that corner, in that exact colour...but if you check several of the last pixels it should reduce the chance of that to an acceptable level).

Colman answered 24/1, 2012 at 22:23 Comment(3)
*This second approach assumes that your scrape tool is actually fully populating the entire image and not just choking part-way through, and giving you a partial file.Colman
I am all for checking the last few pixels of the image to see whether it's grey. I just don't know how to do this. If you do come up with a solution, please check against the provided images.Pennsylvanian
Many thanks for that. I had already downloaded 60,000 images that needed to be checked and with the code snippet I devised. I now have checked all of them and I have no broken images. Cheers!Pennsylvanian
B
4

found this page when looking for a way to check visually corrupted images like this. Here is a way to solve the problem using bash (anyway, the convert command line can be easily adapted for php or python) :

convert INPUTFILEPATH -gravity SouthWest -crop 20%x1%   -format %c  -depth 8  histogram:info:- | sed '/^$/d'  | sort -V | head -n 1 | grep fractal | wc -l

It crops a little square in the southwest corner of the picture, then gets the histogram of this picture. If the main color of the histogram has the name "fractal" instead of an rgb color, it means this zone is corrupted and so the output will be 1 and 0 otherwise.

Hope this helps!

Budwig answered 11/10, 2013 at 14:37 Comment(2)
Seems to work. What does 'fractal' actually mean in the histogram?Iraidairan
Fractal is just the colorname for #808080. I know this is old, but we've just ran into an issue where bottom part of the image actually is validly grey. It would be really nice to be able to specify what "default" color should be there instead of "fractal", any ideas?Toothache
C
2

If the image it is returning is a valid file, then I would recommend running the scrape twice (ie. download it twice and check to see if they are the same).

Another option would be to check the last few pixels of the image (ie. bottom-right corner) to see if they match that color of grey exactly. If they do, then redownload. (obviously this approach fails if you download an image that is actually supposed to be grey in that corner, in that exact colour...but if you check several of the last pixels it should reduce the chance of that to an acceptable level).

Colman answered 24/1, 2012 at 22:23 Comment(3)
*This second approach assumes that your scrape tool is actually fully populating the entire image and not just choking part-way through, and giving you a partial file.Colman
I am all for checking the last few pixels of the image to see whether it's grey. I just don't know how to do this. If you do come up with a solution, please check against the provided images.Pennsylvanian
Many thanks for that. I had already downloaded 60,000 images that needed to be checked and with the code snippet I devised. I now have checked all of them and I have no broken images. Cheers!Pennsylvanian
A
2

I use this one. If the most of pixels in right bottom corner (5x5) are grey, then image is broken.

    define('MIN_WIDTH',500);
    define('MIN_HEIGHT',200);

    function isGoodImage($fn){
        list($w,$h)=getimagesize($fn);
        if($w<MIN_WIDTH || $h<MIN_HEIGHT) return 0;
        $im=imagecreatefromstring(file_get_contents($fn));
        $grey=0;
        for($i=0;$i<5;++$i){
            for($j=0;$j<5;++$j){
                    $x=$w-5+$i;
                    $y=$h-5+$j;
                    list($r,$g,$b)=array_values(imagecolorsforindex($im,imagecolorat($im,$x,$y)));
                    if($r==$g && $g==$b && $b==128)
                        ++$grey;
            }
        }
        return $grey<12;
    }
Adams answered 13/2, 2014 at 9:28 Comment(0)
C
0

ImageMagick's identify command will identify far more corrupt images if you call it with the -verbose option. And there's a -regard-warnings option as well, which will make it treat warnings as errors. Try these against a bad image, and see if the result is a non-zero error code.

Cloots answered 19/10, 2016 at 11:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.