removing Hebrew "niqqud" using r

About

Asked 17/9, 2015 at 18:35 Answered 17/9, 2015 at 19:50

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success... :-( any ideas?

Barbate answered 17/9, 2015 at 18:35 Comment(2)

Have a look at my gsub example, does it work for you? – Nakesha 17/9, 2015 at 19:24

@stribizhev - thank you very much! it worked like a charm – Barbate 17/9, 2015 at 19:49

You can use the \p{M} Unicode category to match diacritics with Perl-like regex, and gsub all of them in one go like this:

sample1 <- "הֻסְמַק"
gsub("\\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

See demo

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

See more at Regular-Expressions.info, "Unicode Categories".

Natika answered 17/9, 2015 at 19:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags