Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex
Asked Answered
S

2

7

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.

I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.

Example: Company Name: Simmons Text in Doc: The [[COMPANY]]'s business is cars. Result: The Simmons's business is cars.

This is improper English.

I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.

            Regex apostropheReplace = new Regex("s\\'s");
            docText = apostropheReplace.Replace(docText, "s\'"); 

This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.

            Regex apostrophyReplace = new Regex("s\\’s");
            docText = apostrophyReplace.Replace(docText, "s\'"); 

Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "'" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.

Shea answered 29/10, 2019 at 20:37 Comment(2)
There are some types of quotes characters, straight quotes or smart (curly) quotes, like you discovered, and these ' and ’ are some of them: '' ‘’ "" “”Czar
You have received two very good answers. I just wanted to add that Word uses so-called typographical quotation marks, which are also called "curly quotes". Visual Studio uses so-called "typewriter quotes" or "straight quotes". As noted in the answer, those are different Unicode characters, meaning your regular expression does not match curly quotes when you just provide a straight one.Perron
M
5

The reason this happens is because they are different characters.

Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.

I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']

So essentially modify your code to:

Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")

I found these were the five most common type of single quotes and apostrophes used.

And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]

Marylou answered 29/10, 2019 at 20:48 Comment(2)
Thank you! That worked great! I would upvote but this is my first SO post.Shea
no worries - glad it helped... it is easy once you know how :-p welcome to SO!Marylou
P
4

Answering the question:

Is there a way to do it so that both characters work?

If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:

 Regex apostropheReplace = new Regex("s\\['’]s");
 docText = apostropheReplace.Replace(docText, "s\'")

This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:

If using the copied character from Word is the proper way of doing this?

That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).

If you mean "most versatile/robust single quote Regex", then as @Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:

Regex apostropheReplace =
    new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")

But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from

Regex("s\\['\u2018\u2019\u201A\u201b]s")

to

Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")

Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

Pulido answered 29/10, 2019 at 20:53 Comment(2)
Thank you! I would upvote but this is my first SO Post. The answer above([\u2018\u2019\u201b\u2032']) fits a little better as I want to be ready for all use cases.Shea
No worries - I'm glad you got your answer, @mfontaine! I've updated my answer to explicitly include other single-quote encodings, that I mentioned could be added.Pulido

© 2022 - 2024 — McMap. All rights reserved.