removing Hebrew "niqqud" using r
Asked Answered
B

1

6

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success... :-( any ideas?

Barbate answered 17/9, 2015 at 18:35 Comment(2)
Have a look at my gsub example, does it work for you?Nakesha
@stribizhev - thank you very much! it worked like a charmBarbate
N
3

You can use the \p{M} Unicode category to match diacritics with Perl-like regex, and gsub all of them in one go like this:

sample1 <- "הֻסְמַק"
gsub("\\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

See demo

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

See more at Regular-Expressions.info, "Unicode Categories".

Natika answered 17/9, 2015 at 19:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.