regex differentiating between ISBN-10 and ISBN-13
Asked Answered
C

5

22

I have an If-else statement which checks a string to see whether there is an ISBN-10 or ISBN-13 (book ID).

The problem I am facing is with the ISBN-10 check which occurs before the ISBN-13 check, the ISBN-10 check will match anything with 10 characters or more and so may mistake an ISBN-13 for an ISBN-10.

here is the code...

$str = "ISBN:9780113411436";

if(preg_match("/\d{9}(?:\d|X)/", $str, $matches)){
   echo "ISBN-10 FOUND\n";  
   //isbn returned will be 9780113411
   return 0;
}

else if(preg_match("/\d{12}(?:\d|X)/", $str, $matches)){
   echo "ISBN-13 FOUND\n";
   //isbn returned will be 9780113411436
   return 1;
}

How do I make sure I avoid this problem?

Countrywoman answered 30/12, 2012 at 23:30 Comment(4)
Um... swap the order?Cushitic
The $str variable is not a valid ISBN number, and will not match either of the regular expressions you have provided. What would you like to match? Something like $str, or an actual ISBN?Foreplay
@WillC. what do you mean it is not a valid ISBN? this is an actual book which can be found on amazonCountrywoman
My mistake, I did not read the ISBN documentation properly. You are correct. isbn.org/standards/home/isbn/us/isbnqa.asp#Q3 Be careful, it appears that ':' is not the only separator though: isbn.org/standards/home/isbn/international/html/usm4.htmForeplay
L
42

You really only need one regex for this. Then do a more efficient strlen() check to see which one was matched. The following will match ISBN-10 and ISBN-13 values within a string with or without hyphens, and optionally preceded by the string ISBN:, ISBN:(space) or ISBN(space).

Finding ISBNs :

function findIsbn($str)
{
    $regex = '/\b(?:ISBN(?:: ?| ))?((?:97[89])?\d{9}[\dx])\b/i';

    if (preg_match($regex, str_replace('-', '', $str), $matches)) {
        return (10 === strlen($matches[1]))
            ? 1   // ISBN-10
            : 2;  // ISBN-13
    }
    return false; // No valid ISBN found
}

var_dump(findIsbn('ISBN:0-306-40615-2'));     // return 1
var_dump(findIsbn('0-306-40615-2'));          // return 1
var_dump(findIsbn('ISBN:0306406152'));        // return 1
var_dump(findIsbn('0306406152'));             // return 1
var_dump(findIsbn('ISBN:979-1-090-63607-1')); // return 2
var_dump(findIsbn('979-1-090-63607-1'));      // return 2
var_dump(findIsbn('ISBN:9791090636071'));     // return 2
var_dump(findIsbn('9791090636071'));          // return 2
var_dump(findIsbn('ISBN:97811'));             // return false

This will search a provided string to see if it contains a possible ISBN-10 value (returns 1) or an ISBN-13 value (returns 2). If it does not it will return false.

See DEMO of above.


Validating ISBNs :

For strict validation the Wikipedia article for ISBN has some PHP validation functions for ISBN-10 and ISBN-13. Below are those examples copied, tidied up and modified to be used against a slightly modified version of the above function.

Change the return block to this:

    return (10 === strlen($matches[1]))
        ? isValidIsbn10($matches[1])  // ISBN-10
        : isValidIsbn13($matches[1]); // ISBN-13

Validate ISBN-10:

function isValidIsbn10($isbn)
{
    $check = 0;

    for ($i = 0; $i < 10; $i++) {
        if ('x' === strtolower($isbn[$i])) {
            $check += 10 * (10 - $i);
        } elseif (is_numeric($isbn[$i])) {
            $check += (int)$isbn[$i] * (10 - $i);
        } else {
            return false;
        }
    }

    return (0 === ($check % 11)) ? 1 : false;
}

Validate ISBN-13:

function isValidIsbn13($isbn)
{
    $check = 0;

    for ($i = 0; $i < 13; $i += 2) {
        $check += (int)$isbn[$i];
    }

    for ($i = 1; $i < 12; $i += 2) {
        $check += 3 * $isbn[$i];
    }

    return (0 === ($check % 10)) ? 2 : false;
}

See DEMO of above.

Lighten answered 31/12, 2012 at 0:24 Comment(8)
I just realised that which is why I deleted my previous comment, I have one request, can you please remove the "ISBN:" from the regex, I just want to match an ISBN, the str I gave was an example, other examples of strings I may get could be end. 9780113411142 -Countrywoman
So ISBN: is optional then? So you want to match string with and without it present?Lighten
yes exactly, I just want to know whether a certain ISBN is present within strings of different lengths.Countrywoman
@mk_89, string of different lengths, what do you mean by that? Do you mean say match ISBN inside say this string: some text ISBN:1234567890 more text?Lighten
Yes exactly, I would like to match an ISBN from a piece of text such as the example you have given me.Countrywoman
let us continue this discussion in chatLighten
ISBN-13 numbers can begin with 978 or 979 Wikipedia. The first 979 prefix was issued in France Source.Yelp
Updated to allow for detection and validation of 979 prefixed ISBNs as well.Lighten
B
3

Use ^ and $ to match beginning and end of string. By using the string delimiters, the order in which you test the 10 or the 13-digit codes will not matter.

10 digits

/^ISBN:(\d{9}(?:\d|X))$/

13 digits

/^ISBN:(\d{12}(?:\d|X))$/

Note: According to http://en.wikipedia.org/wiki/International_Standard_Book_Number, it appears as though ISBNs can have a - in them as well. But based on the $str you're using, it looks like you've removed the hyphens before checking for 10 or 13 digits.

Additional note: Because the last digit of the ISBN is used as a sort of checksum for the prior digits, regular expressions alone cannot validate that the ISBN is a valid one. It can only check for 10 or 13-digit formats.


$isbns = array(
  'ISBN:1234567890',       // 10-digit
  'ISBN:123456789X',       // 10-digit ending in X
  'ISBN:1234567890123',    // 13-digit
  'ISBN:123456789012X',    // 13-digit ending in X
  'ISBN:1234'              // invalid
);

function get_isbn($str) {
   if (preg_match('/^ISBN:(\d{9}(?:\d|X))$/', $str, $matches)) {
      echo "found 10-digit ISBN\n";
      return $matches[1];
   }
   elseif (preg_match('/^ISBN:(\d{12}(?:\d|X))$/', $str, $matches)) {
      echo "found 13-digit ISBN\n";
      return $matches[1];
   }
   else {
      echo "invalid ISBN\n";
      return null;
   }
}

foreach ($isbns as $str) {
   $isbn = get_isbn($str);
   echo $isbn."\n\n";
}

Output

found 10-digit ISBN
1234567890

found 10-digit ISBN
123456789X

found 13-digit ISBN
1234567890123

found 13-digit ISBN
123456789012X

invalid ISBN
Byline answered 30/12, 2012 at 23:32 Comment(5)
The regular expressions you've provided don't match the same strings as the regular expressions in the posted question.Foreplay
@WillC., which is what the problem is; the OP's regular expressions are poorly written, which is why the logic is failing in the first place.Archer
Let me clarify my statement, ISBN-10 and ISBN-13 formats can end in an 'X' character, which your regular expression cannot match.Foreplay
@Byline poorly written? ouch, you expression fails, an isbn can be a mixture of numbers and letters and I am not interested in matching "ISBN:" when doing regex comparisonCountrywoman
@mk_89, I do not intend to offend you. Wikipedia does not seem to suggest that a "mixture of numbers and letters" is valid. Please do note that checking for 10-digit and 13-digit formats is not the same thing as validating the ISBN itself. See my updated post for details.Archer
F
1

Switch the order of the if else block, also strip all whitespace, colons, and hyphens from your ISBN:

//Replace all the fluff that some companies add to ISBNs
$str = preg_replace('/(\s+|:|-)/', '', $str);

if(preg_match("/^ISBN\d{12}(?:\d|X)$/", $str, $matches)){
   echo "ISBN-13 FOUND\n";
   //isbn returned will be 9780113411436
   return 1;
}

else if(preg_match("/^ISBN\d{9}(?:\d|X)$/", $str, $matches)){
   echo "ISBN-10 FOUND\n";  
   //isbn returned will be 9780113411
   return 0;
}
Foreplay answered 30/12, 2012 at 23:32 Comment(3)
The order of the tests shouldn't matter. I.e., a 13-digit ISBN should never return a 10-digit one. Using that logic, we could say we have a valid Social Security Number simply because it matched a segment of digits in a Driver's License Number.Archer
Given the OP's regular expressions, the order does matter. I agree that the regular expressions ought to be changed to match the entire string on the basis that additional ISBN matches (like ISBN-11 if it exists) would need to go in a certain order.Foreplay
That's my point though, you're treating a symptom of the bad logic rather than just fixing the problem with well-written logical tests.Archer
P
1

Put the ISBN-13 check before the ISBN-10 check? This is assuming that you want to match them as a part of any string, that is (your example has an extra "ISBN:" at the start so matching anywhere in a string seems to be a requirement of some sort)

Premonish answered 30/12, 2012 at 23:32 Comment(0)
O
0
ISBN10_REGEX = /^(?:\d[\ |-]?){9}[\d|X]$/i
ISBN13_REGEX = /^(?:\d[\ |-]?){13}$/i
Osborn answered 11/8, 2016 at 7:46 Comment(1)
You should add an explanation of how this answers the question. Answers that are just code are not considered good answers on SO.Majormajordomo

© 2022 - 2024 — McMap. All rights reserved.