Regular expression to match object dimensions
Asked Answered
B

3

7

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .

Imagine some sentences along the following lines:

  • Hello blah blah. It's around 11 1/2" x 32".
  • The dimensions are 8 x 10-3/5!
  • Probably somewhere in the region of 22" x 17".
  • The roll is quite large: 42 1/2" x 60 yd.
  • They are all 5.76 by 8 frames.
  • Yeah, maybe it's around 84cm long.
  • I think about 13/19".
  • No, it's probably 86 cm actually.

I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:

  • 11 1/2" x 32"
  • 8 x 10-3/5
  • 22" x 17"
  • 42 1/2" x 60 yd
  • 5.76 by 8
  • 84cm
  • 13/19"
  • 86 cm

I imagine a world where the following rules apply:

  • The following are valid units: {cm, mm, yd, yards, ", ', feet}, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.
  • A dimension is always described numerically, may or may not have units following it and may or may not have a fractional or decimal part. Being made up of a fractional part on it's own is allowed, e.g., 4/5".
  • Fractional parts always have a / separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).
  • Dimensions may be one-dimensional or two-dimensional, in which case one can assume the following are acceptable for separating two dimensions: {x, by}. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm is OK, .333 is not, nor is 4.33 oz.

To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .

[1-9]+[/ ][x1-9]

Update (2)

You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:

  • The last but one test case is 12 yd x.
  • The last test case is 99 cm by.
  • This sentence doesn't have dimensions in it: 342 / 5553 / 222.
  • Three dimensions? 22" x 17" x 12 cm
  • This is a product code: c720 with another number 83 x better.
  • A number on its own 21.
  • A volume shouldn't match 0.332 oz.

These should result in the following (# indicates nothing should match):

  • 12 yd
  • 99 cm
  • #
  • 22" x 17" x 12 cm
  • #
  • #
  • #

I've adapted M42's answer below, to:

\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?

But while that resolves some new test cases it now fails to match the following others. It reports:

  • 11 1/2" x 32" PASS
  • (nothing) FAIL
  • 22" x 17" PASS
  • 42 1/2" x 60 yd PASS
  • (nothing) FAIL
  • 84cm PASS
  • 13/19" PASS
  • 86 cm PASS
  • 22" PASS
  • (nothing) FAIL
  • (nothing) FAIL

  • 12 yd x FAIL

  • 99 cm by FAIL
  • 22" x 17" [and also, but separately '12 cm'] FAIL
  • PASS

  • PASS

Bacchanalia answered 8/12, 2011 at 16:26 Comment(2)
Coud you please provide the input strings and what is the expected ouput?Wellwisher
Sure. I have provided them in an easier format for you here: pastebin.com/txfJs8LX Thanks so much!Bacchanalia
W
5

New version, near the target, 2 failed tests

#!/usr/local/bin/perl 
use Modern::Perl;
use Test::More;

my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
    chomp;
    if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
        ok($1 eq $out[$i], $1 . ' in ' . $_);
    } else {
        ok($out[$i] eq 'no match', ' got "no match" in '.$_);
    }
    $i++;
}
done_testing;


__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.  
A number on its own 21.
A volume shouldn't match 0.332 oz.

output:

#   Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
#   at C:\tests\perl\test6.pl line 42.
#   Failed test ' got "no match" in They are all 5.76 by 8 frames.'
#   at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 -  got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 -  got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 -  got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 -  got "no match" in This is a product code: c720 with another number 83 x better.  
ok 14 -  got "no match" in A number on its own 21.
ok 15 -  got "no match" in A volume shouldn't match 0.332 oz.
1..15

It seems difficult to match 5.76 by 8 frames but not 0.332 oz, sometimes you have to match numbers with unit and numbers without unit.

I'm sorry, I'm not able to do better.

Wellwisher answered 8/12, 2011 at 16:52 Comment(2)
This one matched everything, including things like 12 yd by 23.3. However, how would one improve it to avoid the following case? "12 yd x" is currently matched with your regex, but I guess it's preferable if in that case only 12 yd is matched. Thanks!Bacchanalia
I tried to adapt your answer to some more general cases but failed. . . Updated question accordingly.Bacchanalia
F
2

One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):

foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");

Will get your results :)

Explanation:

"
\d             # Match a single digit 0..9
   +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
      \           # Match the character “ ” literally
   |              # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      cm          # Match the characters “cm” literally
   |              # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      \.          # Match the character “.” literally
   |              # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
      ""          # Match the character “""” literally
   |              # Or match regular expression number 5 below (the entire group fails if this one fails to match)
      /           # Match the character “/” literally
)
[\d/""x -]        # Match a single character present in the list below
                  # A single digit 0..9
                  # One of the characters “/""x”
                  # The character “ ”
                  # The character “-”
   *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?:               # Match the regular expression below
   \b             # Assert position at a word boundary
   (?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
         by       # Match the characters “by” literally
         \s       # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
            *     # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
         \d       # Match a single digit 0..9
            +     # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      |           # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         cm       # Match the characters “cm” literally
      |           # Or match regular expression number 3 below (the entire group fails if this one fails to match)
         yd       # Match the characters “yd” literally
   )
   \b             # Assert position at a word boundary
)?                # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
Faizabad answered 8/12, 2011 at 16:52 Comment(2)
Wow, thanks! It doesn't quite match all my imagined cases. For example, it doesn't match if the first dimension ends in mm, cm, yd etc. I think I can work out how to adapt it though. :-)Bacchanalia
@Bacchanalia I used your examples, but you could extend it I guess :)Faizabad
E
2

This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:

\d.*\d(?:\s+\S+|\S+)

Explanation:

\d        # One digit.
.*        # Any number of characters.
\d        # One digit. All joined means to find all content between first and last digit.
\s+\S+    # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
|         # Or. Select one of two expressions between parentheses.
\S+       # Any number of non-space characters. It tries to match double-quotes, or units joined to the 
          # last number.

My test:

Content of script.pl:

use warnings;
use strict;

while ( <DATA> ) {
        print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}

__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.

Running the script:

perl script.pl

Result:

11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm
Emanation answered 8/12, 2011 at 17:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.