Remove Hebrew vowels (nikkud) from selected Unicode Hebrew text
Asked Answered
P

5

7

I want to select a string of Unicode Hebrew text in a Word document and remove the Hebrew vowels (aka nikkud) without changing anything else.

I need to remove Unicode characters in a given range from the selected text. The Unicode characters I want to remove are U+0591-U+05BD, U+05BF-U+05C2, and U+05C4-U+05C7.

I found a way to remove the Hebrew vowels from a Unicode text string using the REGEXREPLACE function in Google Sheets (thank you GitHub). E.g:

=REGEXREPLACE(B1,"[(\x{0591}-\x{05BD})OR(\x{05BF}-\x{05C2})OR(\x{05C4}-\x{05C7})]","")

where cell B1 contains the original Hebrew text with vowels, and the function outputs the identical text with the vowels removed. The Unicode range used there permits me to leave two characters that need to remain (U+05BE and U+05C3).

Using that method, I can copy a Hebrew text string, e.g., אָמַר יְהוָה, paste it into my Google Sheet, and then copy the output, אמר יהוה, and paste it over the original text. This is much slower than a macro in Word would be (there are hundreds of these Hebrew text strings that need to be fixed). The majority of the document is in English, with snippets of Hebrew, so I don't need a solution for converting a whole document.

A bit of searching suggests to me that a similar RegEx replace function exists for Word VBA, but I don't have sufficient programming knowledge to adapt this to my own needs.

Pennyweight answered 13/6, 2018 at 2:3 Comment(3)
Word's tool is called Find/Replace, use Ctrl+H to bring up the dialog. Using wildcards you can create search conditions in a similar manner to Regex, but complex searches are also possible without wildcards. You'll find a lot of information in an Internet search on how to use Find. I recommend you ask this in Super User or Microsoft Answers as you may well be able to do this without needing VBA (or simply record a macro so that you can re-use the search criteria). You only need VBA for Find/Replace when the Find result needs to be manipulated in a way that Replace can't handle.Kristalkristan
Thanks Cindy! This was very helpful. I read a bunch of different things and figured out a way to make it work with Find/Replace, as you suggested, then recorded and tweaked the macro. I didn't originally think it would work, since Find/Replace doesn't allow Unicode codes in wildcard searches, but I just needed to use the actual characters to specify the range.Pennyweight
Glad you got it to work :-)Kristalkristan
P
1

Thanks, everyone. Building on several of these suggestions, I put together the following macro, which seems to be working perfectly. There may be a more elegant way to write this (wp78de's macro seems more consolidated, but it didn't work for me).

Sub HebrewDevocalizer()
With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1425) & "-" & ChrW(1469) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll

With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1471) & "-" & ChrW(1474) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll

With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1476) & "-" & ChrW(1479) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
Pennyweight answered 15/6, 2018 at 2:7 Comment(3)
Nice, however, this is more or less what I have suggested decomposed into three functions. meta.#307820Cowper
wp78de: right, as I mentioned, it does look very similar, and seeing yours was helpful- in particular it clued me in to the decimal form for the Unicode chars. But when I tried your macro verbatim, it generated an error. So I went ahead and figured out the find/replace stuff in the UI, then recorded a new macro. Also, I didn’t need a macro that would process the whole document (which I believe yours does, correct?), but just the selected text.Pennyweight
There are more elegant ways to do it - as shown. The funny thing is your modified version does not work for me. Maybe a difference between localized versions.Cowper
K
4

You can install notepad++ and do a find and replace operation using regex mode using this regex after pasting your whole input.

[\x{0591}-\x{05BD}\x{05BF}-\x{05C2}\x{05C4}-\x{05C7}]

Before:

before

After:

after

Then you can automate the copy/paste operation using AutoHotkey for example

If you want to keep the formatting information this is not a problem neither.

Just do the following operations:

  • Save your file in Word XML Document (Save as>Save as type: Word XML Document (*.xml)
  • Take a copy of this file and open it with Notepad++ (you have either to take a copy of the file or to close Word otherwise you can not open it in write mode)
  • Apply the find and replace described in the beginning of the explanations and save the file.
  • Reopen the file with Word and save it .docx for example
Knawel answered 13/6, 2018 at 3:17 Comment(4)
The problem with anything like this is that all Word-specific content (formatting) is lost. RegEx can only be used with Word when plain text is involved. For anything else, Word's internal tools need to be used.Kristalkristan
@CindyMeister: Thank you for your comment but I do not completely agree with you! This is true that with just a copy/paste operation you will lose the formatting information but if you do this directly on a word XML document where the formatting information is also included in the file there will not be any problem. Have a look at my edit ;-)Knawel
Ah, OK. I wasn't aware that this file save option created an XML file in the OPC flat file format. I had always assumed this was the old 2003 XML file format, but I see that's a separate option. Yes, I agree and good you put the edit in there :-)Kristalkristan
This approach looks helpful, but in my case, I really wanted to stay in Word both for my workflow and because there are exceptional cases where I wouldn't want to remove the vowels. Thanks for the idea.Pennyweight
C
2

You can try this Macro. Be warned, it's very slow on my end:

Sub RemoveHebrewVowels()
    Dim Word As Range
    Dim Words As Variant
    Dim WildcardCollection(3) As String
    Rem [(\x{0591}-\x{05BD}]
    WildcardCollection(0) = "[" & ChrW(1425) & "-" & ChrW(1469) & "]{1;}"
    Rem [\x{05BF}-\x{05C2}]
    WildcardCollection(1) = "[" & ChrW(1471) & "-" & ChrW(1474) & "]{1;}"
    Rem [\x{05C4}-\x{05C7}]
    WildcardCollection(2) = "[" & ChrW(1476) & "-" & ChrW(1479) & "]{1;}"
    'Options.DefaultHighlightColorIndex = wdYellow
    'Clear existing formatting and settings in Find
    Selection.Find.ClearFormatting
    Selection.Find.Replacement.ClearFormatting
    'Selection.Find.Replacement.Highlight = True
    'Cycle through document and find wildcards patterns, replace when found
    For Each Word In ActiveDocument.Words
        For Each WildcardsPattern In WildcardCollection
            With Selection.Find
                .Text = WildcardsPattern
                .Replacement.Text = ""
                .Forward = True
                .Wrap = wdFindContinue
                .Format = False
                .MatchCase = False
                .MatchWholeWord = False
                .MatchWildcards = True
                .MatchSoundsLike = False
                .MatchAllWordForms = False
            End With
            Selection.Find.Execute Replace:=wdReplaceAll
        Next
    Next
End Sub
Cowper answered 13/6, 2018 at 6:54 Comment(5)
Thanks- it didn't work for me, but it gave me an idea for what I needed to do. In my answer below, I have the macro run three separate find/replace commands. Yours seems more consolidated, but I got an error. Also, I wanted to be able to run the macro just on a selection, not the whole doc, so I changed the "wrap" setting.Pennyweight
@JonathanPotter You are welcomed. It would be appreciated if the effort put in could still be rewarded.Cowper
@Allan: thanks. Yeah, I was going planning to come back, but just wanted to see if anyone would point out any major problems with my solution.Pennyweight
@Jonathan Potter: Ok I see :) then let's wait for a couple of days before closing it!Knawel
@JonathanPotter: Could you close this case if you do not have any answer before the end of the month? Thank you!Knawel
P
1

Thanks, everyone. Building on several of these suggestions, I put together the following macro, which seems to be working perfectly. There may be a more elegant way to write this (wp78de's macro seems more consolidated, but it didn't work for me).

Sub HebrewDevocalizer()
With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1425) & "-" & ChrW(1469) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll

With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1471) & "-" & ChrW(1474) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll

With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "[" & ChrW(1476) & "-" & ChrW(1479) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchKashida = False
    .MatchDiacritics = False
    .MatchAlefHamza = False
    .MatchControl = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
Pennyweight answered 15/6, 2018 at 2:7 Comment(3)
Nice, however, this is more or less what I have suggested decomposed into three functions. meta.#307820Cowper
wp78de: right, as I mentioned, it does look very similar, and seeing yours was helpful- in particular it clued me in to the decimal form for the Unicode chars. But when I tried your macro verbatim, it generated an error. So I went ahead and figured out the find/replace stuff in the UI, then recorded a new macro. Also, I didn’t need a macro that would process the whole document (which I believe yours does, correct?), but just the selected text.Pennyweight
There are more elegant ways to do it - as shown. The funny thing is your modified version does not work for me. Maybe a difference between localized versions.Cowper
V
1

Anyone who needs to use this in a software script (Python 3), you can do

import re
re.sub(r'[\u0591-\u05BD\u05BF-\u05C2\u05C4-\u05C7]', '', 'אֱלֹהִים')
Valora answered 13/1, 2020 at 20:16 Comment(0)
M
0

BS"D

Do a "Save as" to other format- Hebrew DOS text.

Reload the file in Word and you will see that a question mark has replaced each nikud.

Do a global change (cntrl H) of '?' to null.

All done

Maidstone answered 18/1, 2022 at 6:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.