I have a nodejs script that reads in a file and counts word frequencies. I currently feed each line into a function:
function getWords(line) {
return line.match(/\b\w+\b/g);
}
This matches almost everything, except it misses contractions
getWords("I'm") -> {"I", "m"}
However, I cannot just include apostrophes, as I would want matched apostrophes to be word boundaries:
getWords("hey'there'") -> {"hey", "there"}
Is there a way capture contractions while still treating other apostrophes as word boundaries?
I'm
should be split buthey'there'
should not? Sounds like this might require a dictionary? – Reisfield"I'm Steve O'Conner's 'friend'"
? How would you know thatO'Conner's
is actually one word, not three? Or what if the matched apostrophes you mention contain a contraction with another apostrophe? – Euterpehey'there'
being considered the contractionhey'there
if no space is provided to differentiate it. You could use a dictionary of known contractions as @Aaron Dufour alluded to. But that seems a bit much for the general use you seem to have – Mckinnon'twas
? – Misgive