Regular expression to match hyphenated words (kebab-case)

Asked 2/9, 2011 at 6:52 Answered 2/9, 2011 at 18:0

php regex string text-extraction kebab-case

How can I extract hyphenated strings from this string line?

ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service

I just want to extract "ADW-CFS-WE" from it but has been very unsuccessful for the past few hours. I'm stuck with this simple regEx "(.*)" making the all of the string stated about selected.

Finger answered 2/9, 2011 at 6:52 Comment(0)

You can probably use:

preg_match("/\w+(-\w+)+/", ...)

The \w+ will match one or more consecutive characters which may be letters, numbers or underscores (one word). And the second group ( ) will match one or more repetitions of: a hyphen followed by a sequence of one or more characters which may contain letters, numbers or underscores.

The trick with regular expressions is often specificity. Using .* will often match too much.

Mythologize answered 2/9, 2011 at 6:59 Comment(0)

$input = "ADW-CFS-WE X-Y CI SLA Def No SLANAME CI Max Outage Service";
preg_match_all('/[A-Z]+-[A-Z-]+/', $input, $matches);
foreach ($matches[0] as $m) {
  echo $matches . "\n";
}

Note that this solutions assumes that only uppercase A-Z can match. If that's not the case, insert the correct character class. For example, if you want to allow arbitrary letters (like a and Ä), replace [A-Z] with \p{L}.

Brantbrantford answered 2/9, 2011 at 6:57 Comment(10)

Don’t write [A-Z] when \p{Lu} is available. – Aniseed 2/9, 2011 at 18:5

@Aniseed I assumed his IDs are all upper-case latin characters and that he doesn't want to match a ℝ-äe. In general, I agree, but in this case, I think [A-Z] is perfectly adequate. – Brantbrantford 2/9, 2011 at 20:13

The problem is that \p{Lu} is safe no matter what the character set, but [A-Z] breaks on most of them. – Aniseed 2/9, 2011 at 20:31

@Aniseed Could you elaborate? I was under the impression that preg_match(/[a-Z]/, $_POST['input']) matches a user input of A if the whole page uses, say, UTF-8. – Brantbrantford 2/9, 2011 at 20:37

Oh gosh. False positive: ^, _, etc. False negative: Å, É, Æ, Ñ, etc. – Aniseed 2/9, 2011 at 20:40

@Aniseed Not a single one of these inputs matches /[A-Z]/, so I'm not sure how there could be any false positives. And as far as I can test, Å does not match, which is quite intentional and may be useful if the match should capture only latin characters (for, say, an airport code). – Brantbrantford 2/9, 2011 at 20:52

Å is LATIN CAPITAL LETTER A WITH RING ABOVE. That makes it a Latin letter, you know. And you wrote [a-Z] in your comment, which I took for [A-z], which is the false positives. – Aniseed 2/9, 2011 at 22:14

@Aniseed Sorry, let me rephrase that: What if I want to match only ABCD..Z (like international airport codes) and am using [A-Z] (sic, that lower-case was a typo). Them I'm good, am I not? – Brantbrantford 2/9, 2011 at 22:20

Yes, then you are. Best to put a comment in the code about that. I still might use the intersection of \p{ASCII} and \p{Lu} myself to say "an ASCII upper case letter". It’s how I’ve come to think of patterns and data these days. – Aniseed 2/9, 2011 at 22:42

@Aniseed Thanks! Added a note to the answer mentioning ways how to match other characters than A-Z. – Brantbrantford 2/9, 2011 at 23:9

Just catch every space free [^\s] words with at least an '-'.

The following expression will do it:

<?php

$z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service";

$r = preg_match('#([^\s]*-[^\s]*)#', $z, $matches);
var_dump($matches);

Qualls answered 2/9, 2011 at 6:57 Comment(0)

The following pattern assumes the data is at the beginning of the string, contains only capitalized letters and may contain a hyphen before each group of one or more of those letters:

    <?php
    $str = 'ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service';
    if (preg_match('/^(?:-?[A-Z]+)+/', $str, $matches) !== false)
        var_dump($matches);

Result:

    array(1) {
      [0]=>
      string(10) "ADW-CFS-WE"
    }

Mayfly answered 2/9, 2011 at 18:0 Comment(2)

[A-Z] is always wrong. Even if it works on this dataset, it fails on 99+% of Unicode. Get into the habit of matching \p{Lu} instead. Or, if you have a real regex language, use \p{upper}, which matches more than \p{Lu} does. – Aniseed 2/9, 2011 at 18:4

To say that [A-Z] is always wrong is always wrong. It's only wrong when it's wrong. Sometime you only want ASCII, and if you do this regex is just fine. – Upsydaisy 29/6, 2015 at 19:13

Recommended topics

Hot tags