Python Regex for hyphenated words
Asked Answered
H

1

16

I'm looking for a regex to match hyphenated words in Python.

The closest I've managed to get is: '\w+-\w+[-w+]*'

text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)

which returns list ['one-hundered-and-three-', 'foo-bar'].

This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]\*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.

Holocrine answered 5/12, 2011 at 9:28 Comment(3)
I don't know what you plan to use this for, but have you considered cases where a trailing or prefixed hyphen is valid, like "nineteenth- and twentieth-century" or "investor-owned and -operated"?Shaftesbury
The main problem in your own expression are the square brackets. They don't group the content together, they create a character class, thats something completely different.Pellicle
Thanks for the input, lazyr. I have considered the cases you point out, and they will not pose a problem. Thanks for the clarification, stema. I realised that the square brackets did not group the content, but they resulted in the closest match for what I was attempting to do.Holocrine
C
31

Try this:

re.findall(r'\w+(?:-\w+)+',text)

Here we consider a hyphenated word to be:

  • a number of word chars
  • followed by any number of:
    • a single hyphen
    • followed by word chars
Cassiterite answered 5/12, 2011 at 9:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.