regex to get measurements
Asked Answered
G

2

7

I have these measurements in the document

5.3 x 2.5 cm
11 x 11 mm
7 mm 
13 x 12 x 14 mm
13x12cm

I need to extract 5.3 x 2.5 cm using python using regex.

So far my code is below but it does not work properly

x = "\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?"
by = "( )?(by|x)( )?"
cm = "(mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "((" + x + " *(to|\-) *" + cm + ")" + "|(" + x + cm + "))"
xy_cm = "((" + x + cm + by + x + cm + ")" +"|(" + x + by + x + cm + ")" +"|(" + x + by + x + "))"
xyz_cm = "((" + x + cm + by + x + cm + by + x + cm + ")" + "|(" + x + by + x + by + x + cm + ")" + "|(" + x + by + x + by + x + "))"
m = "((" + xyz_cm + ")" + "|(" + xy_cm + ")" + "|(" + x_cm + "))"
a = re.compile(m)
print a.findall(text)

The output it gives:

[('13', '13', '13', '13', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('12', '12', '12', '12', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('4', '4', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('25', '25', '25', '25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''),
Glycerite answered 2/9, 2017 at 7:22 Comment(5)
Define "does not work properly": what does it do vs. what should it do? Examples would be most welcome.Andrew
Please show and explain the difference between the output you get and the output you want.Stulin
One thing you must do is to get rid of capturing groups. However, you should check the final pattern after concatenation, it only returns numbers only.Volley
Check the helpfulness for your goal of "noncapturing groups" (?:blabla), it might be helpful for incorporating Wiktors comment.Stulin
Thank you very much @WiktorStribiżew it was so fast and solved most of my problem. Ultimately my goal is extracting all measurements in the text, therefore I need to extract their units too. My plan was extract everything as a string like "5.3 x 2.5 cm". 1) Does it make sense? and 2) How can I do that, any suggestion?Glycerite
R
5

The only issues with the current regex are two:

  • You need to get rid of capturing groups since .findall will extract all the substrings captured rather than the whole match value (however, it is not crucial, you might as well use re.finditer and get match.group(0))
  • The main issue is that you did not group the x pattern, the number format alternation ruined the structure of the final pattern.

A quick fix will look like

x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)"
by = "(?: )?(?:by|x)(?: )?"
cm = "(?:mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
xy_cm = "(?:" + x + cm + by + x + cm +"|" + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x + ")"
xyz_cm = "(?:" + x + cm + by + x + cm + by + x + cm + "|" + x + by + x + by + x + cm + "|" + x + by + x + by + x + ")"
m = "{}|{}|{}".format(xyz_cm, xy_cm, x_cm) 

See the Python demo printing

['5.3 x 2.5', '11 x 11', '13 x 12 x 14', '13x12cm']

To further enhance it, think of all possibilities of x, by, cm and perhaps use str.format instead of concatenation.

Riba answered 2/9, 2017 at 7:50 Comment(0)
C
7

With Regex you should always slowly build up your expression to get what you want. E.g.

s = "5.3 x 2.5 cm"

You want to find the numbers here?

re.findall("\d+", s)

gives you all the integers:

["5", "3", "2", "5"]

Ok, so what if your numbers can be floating point but don't have to be. Then you expand your expression with a non-capturing match group that has a dot and maybe some numbers following.

re.findall("\d+(?:\.\d*)?", s)

this gives you

["5.3", "2.5"]

Then you can take the multiplication with an arbitrary number of spaces around:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)", s)

Putting the numbers in match groups now gives you a tuple.

[("5.3", "2.5")]

You can then go on with the units:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)\s*(cm|mm)", s)

giving you the tuple you want:

[("5.3", "2.5", "cm")]

and so on.

If you build your regexes like this you have a chance to see what breaks from one change to the next. Debugging a huge regex like the one you posted above is a task not worth going at.

I wouldn't name my unit regex as cm that's quite confusing for anyone maintaining your code in the future. Apart from that you need some clear requirements on the number formats you want to allow. Maybe somebody will input scientific notation etc. Your regexes will become very complicated.

Contentious answered 2/9, 2017 at 7:42 Comment(4)
Thanks it solved my problem! Also extra thanks for all of your detailed explanation!!!Glycerite
The only thing it does not find when there is only one measure (7 mm ) but I will figure it out.Glycerite
@Glycerite I think that can be left as an exercise to the reader ;-).Contentious
but still, I could not make it :)Glycerite
R
5

The only issues with the current regex are two:

  • You need to get rid of capturing groups since .findall will extract all the substrings captured rather than the whole match value (however, it is not crucial, you might as well use re.finditer and get match.group(0))
  • The main issue is that you did not group the x pattern, the number format alternation ruined the structure of the final pattern.

A quick fix will look like

x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)"
by = "(?: )?(?:by|x)(?: )?"
cm = "(?:mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
xy_cm = "(?:" + x + cm + by + x + cm +"|" + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x + ")"
xyz_cm = "(?:" + x + cm + by + x + cm + by + x + cm + "|" + x + by + x + by + x + cm + "|" + x + by + x + by + x + ")"
m = "{}|{}|{}".format(xyz_cm, xy_cm, x_cm) 

See the Python demo printing

['5.3 x 2.5', '11 x 11', '13 x 12 x 14', '13x12cm']

To further enhance it, think of all possibilities of x, by, cm and perhaps use str.format instead of concatenation.

Riba answered 2/9, 2017 at 7:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.