I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the following lines:
- Hello blah blah. It's around 11 1/2" x 32".
- The dimensions are 8 x 10-3/5!
- Probably somewhere in the region of 22" x 17".
- The roll is quite large: 42 1/2" x 60 yd.
- They are all 5.76 by 8 frames.
- Yeah, maybe it's around 84cm long.
- I think about 13/19".
- No, it's probably 86 cm actually.
I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:
- 11 1/2" x 32"
- 8 x 10-3/5
- 22" x 17"
- 42 1/2" x 60 yd
- 5.76 by 8
- 84cm
- 13/19"
- 86 cm
I imagine a world where the following rules apply:
- The following are valid units:
{cm, mm, yd, yards, ", ', feet}
, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units. - A dimension is always described numerically, may or may not have units following it and may or may not have a fractional or decimal part. Being made up of a fractional part on it's own is allowed, e.g.,
4/5"
. - Fractional parts always have a
/
separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!). - Dimensions may be one-dimensional or two-dimensional, in which case one can assume the following are acceptable for separating two dimensions:
{x, by}
. If a dimension is only one-dimensional it must have units from the set above, i.e.,22 cm
is OK,.333
is not, nor is4.33 oz
.
To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .
[1-9]+[/ ][x1-9]
Update (2)
You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:
- The last but one test case is 12 yd x.
- The last test case is 99 cm by.
- This sentence doesn't have dimensions in it: 342 / 5553 / 222.
- Three dimensions? 22" x 17" x 12 cm
- This is a product code: c720 with another number 83 x better.
- A number on its own 21.
- A volume shouldn't match 0.332 oz.
These should result in the following (# indicates nothing should match):
- 12 yd
- 99 cm
- #
- 22" x 17" x 12 cm
- #
- #
- #
I've adapted M42's answer below, to:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
But while that resolves some new test cases it now fails to match the following others. It reports:
- 11 1/2" x 32" PASS
- (nothing) FAIL
- 22" x 17" PASS
- 42 1/2" x 60 yd PASS
- (nothing) FAIL
- 84cm PASS
- 13/19" PASS
- 86 cm PASS
- 22" PASS
- (nothing) FAIL
(nothing) FAIL
12 yd x FAIL
- 99 cm by FAIL
- 22" x 17" [and also, but separately '12 cm'] FAIL
PASS
PASS