How do I handle contractions with regex word boundaries in javascript
Asked Answered
M

2

6

I have a nodejs script that reads in a file and counts word frequencies. I currently feed each line into a function:

function getWords(line) {
    return line.match(/\b\w+\b/g);
}

This matches almost everything, except it misses contractions

getWords("I'm") -> {"I", "m"}

However, I cannot just include apostrophes, as I would want matched apostrophes to be word boundaries:

getWords("hey'there'") -> {"hey", "there"}

Is there a way capture contractions while still treating other apostrophes as word boundaries?

Misgive answered 31/12, 2014 at 2:59 Comment(8)
How can you tell that I'm should be split but hey'there' should not? Sounds like this might require a dictionary?Reisfield
will "hey'there'" really appear like that, or will it have a space like "hey 'there'"?Mckinnon
What if the input is "I'm Steve O'Conner's 'friend'"? How would you know that O'Conner's is actually one word, not three? Or what if the matched apostrophes you mention contain a contraction with another apostrophe?Euterpe
@Euterpe my answer below seems to cover that case but it could use more testingMckinnon
It will PROBABLY also have a space, but I can't really guarantee it. The script just accepts random files, be they source files, text files, html, or whatever, and counts word frequencies. I need "I'm" to be considered a single word, with html properties and code like syntax to continue to be treated with ' as a word boundary.Misgive
using just a regex, I believe you'll have to settle for hey'there' being considered the contraction hey'there if no space is provided to differentiate it. You could use a dictionary of known contractions as @Aaron Dufour alluded to. But that seems a bit much for the general use you seem to haveMckinnon
My question is, for the record, neither a joke nor rhetorical. You're going to have a hard time getting the answer you want unless you provide actual criteria for making the determination. @DelightedD0D's answer is good, but it drops the apostrophe from words like "'twas" and "'ow", which are also contractions, and it's not clear whether that's important to you.Reisfield
Ah, I didn't think of prefacing apostrophes. The answer below is what I'm now running with in lieu of a solution that could accomodate them; could you provide an alternative that would capture 'twas?Misgive
M
5

The closest I believe you could get with regex would be line.match(/(?!'.*')\b[\w']+\b/g) but be aware that if there is no space between a word and a ', it will get treated as a contraction.

As Aaron Dufour mentioned, there would be no way for the regex by itself to know that I'm is a contraction but hey'there isn't.

See below:

enter image description here

Mckinnon answered 31/12, 2014 at 3:45 Comment(4)
Thank you, I'm using this for now. Do note that it wasn't hey'there, it was hey'there', I thought it could be determined by ' matching but your example of O'Conner leads me to consider double-apostrophe words that I would want matched, like O'Hares' and 'O'Conner's`. Thanks!Misgive
Glad it helped, what I meant was just that this regex will match hey'there' as hey'there due to no space being betwen y and the 'Mckinnon
Got it. What's the screenshot from?Misgive
It's regexbuddy, regexbuddy.com. It costs $40 but well worth it in my opinion.Mckinnon
R
3

You can match letters and a possible apostrophe followed by letters.

line.match(/[A-Za-z]+('[A-Za-z]+)?/g
Reynaud answered 31/12, 2014 at 6:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.