Ideas for converting straight quotes to curly quotes

Asked 4/2, 2009 at 0:57 Answered 24/8, 2022 at 1:11

Solved algorithm language-agnostic typography

I have a file that contains "straight" (normal, ASCII) quotes, and I'm trying to convert them to real quotation mark glyphs (“curly” quotes, U+2018 to U+201D). Since the transformation from two different quote characters into a single one has been lossy in the first place, obviously there is no way to automatically perform this conversion; nevertheless I suspect a few heuristics will cover most cases. So the plan is a script (in Emacs) that does something like the following: for each straight quote character,

guess which curly quote character to use, if possible
ask the user (me) to confirm, or make a choice

This question is about the first step: what would be a good algorithm (a set of heuristics, more like) to use, for normal English text (a novel, for example)? Here are some preliminary ideas, which I believe work for double-quotes (counterexamples are welcome!):

If a double-quote is at the beginning of a line, guess that it is an opening quote.
If a double-quote is at the end of a line, guess a closing quote.
If a double-quote is preceded by a space, guess an opening quote.
If a double-quote is followed by a space, guess a closing quote.
If a double-quote doesn't fit into one of the above categories, guess that it is the “opposite” of the most recently used kind of double-quote.

Single quotes are trickier, because a ' might be either an opening quote, closing quote, or apostrophe, and we want to leave apostrophes alone (mustn't write “mustn’t”). Some of the same rules as above apply, but 'tis possible apostrophes are at the beginning of words (or lines), although it's less common than 'twas in the past. I can't offhand think of rules that would properly handle fragments like ["I like 'That '70s show'", she said]. It might require looking at more than just neighbouring characters, and compute distances between quotes, for example…

Any more ideas? It is okay if not all possible cases are covered; the goal is to be as intelligent as possible but no further. :-)

Edit: Some more things that might be worth thinking about (or might be irrelevant, not sure):

quotes might not always be in matching pairs: For single quotes it's obvious why as above. But even for double quotes, when there is a quotation that extends for more than one paragraph, usual typographic convention (don't ask me why) is to start each paragraph with a quotation mark, even though it has not been closed in the previous one. So simply keeping a state machine that alternates between two states will not work!
Nested quotation (alluded to in the "I like 'That '70s show'" example above): this might make either kind of quote not be preceded or followed by a space.
British/American punctuation style: are commas inside the quotes or outside?
Many word processors (e.g Microsoft Word) already do some sort of conversion like this. Although they are not perfect and can often be annoying, it might be instructive to learn how they work...

Crimp answered 4/2, 2009 at 0:57 Comment(1)

I finally did the conversion on the actual document. The first four rules covered all the double quotes. For single quotes, "immediately follows a comma or full stop" handled many of the closing quotes, and all the rest I had to handle manually. – Crimp 9/6, 2009 at 15:57

You can't parse English quotation marks with regex because English quotations can't be parsed by regex. Regular expressions aren't sufficiently expressive to parse English quotations. You can get by in a few situations, but a general solution can't be created using regex. See the test cases for my solution.

Given:

A lexer to create lexemes from a character stream.
An emitter that publishes various types of quotation marks.
An ambiguity resolver that creates nested trees.
A set of known ambiguous and unambiguous contractions.
A circular buffer of lexemes, length 4.

Then, super-broadly, one possible algorithm follows:

Iterate over the document using the lexer.
Pass lexemes from the lexer to the emitter.
Push the lexeme into the emitter's circular buffer.
Parse 4 lexemes at a time in the emitter to categorize the curl:
- opening/closing double/single quote
- apostrophe
- straight quote
- ambiguous opening single quote
- ambiguous closing single quote
- ambiguous single quote
- ambiguous double quote
Emit the categorized quotation mark as a token to the ambiguity resolver.
Have the resolver create trees (for tracking nested quotes):
1. open a tree for opening quote tokens (single/double)
2. close the tree for closing quote tokens (single/double)
3. otherwise, track any ambiguous tokens in the current tree
After all tokens are in nested trees:
1. start at the root
2. disambiguate the tokens
3. sort the list of tokens
4. resolve the remaining tokens
5. disambiguate the tokens (yes, again)
6. relay the tokens to the document parser

Disambiguating entails replacing ambiguous quotation marks with resolvable equivalents. Basically, you need to count the number of ambiguous leading, lagging, and indeterminate single quotes. Based on whether the current level of the tree already contains some combination of leading/lagging quotes, you can ascertain whether the ambiguous quote is a: closing single quote, opening quote, or apostrophe.

It's not a trivial algorithm, as it can require:

A circular buffer
A lexer (tokenizer)
A parser (emitter)
A resolver (ambiguities)
A tree
A set of contractions (ambiguous and unambiguous)

Here are some screenshots of KeenQuotes, which is integrated into my text editor, KeenWrite:

Nit: It's '70s, not '70's because decades cannot possess anything.

Epiphytotic answered 24/8, 2022 at 1:11 Comment(0)

A good place to start would be with a state machine:

Starting at position 0, iterate over the characters
Upon finding a quote, enter the "Quoted" state ( open quote )
If in "Quoted" state and you encounter a quote, return to "Starting" state ( closing quote )

You can make additional decisions at each of the state transitions.

You could attempt to normalize the single quotes by identifying known conjunctions, for instance, and converting them to a different, not text, character prior to processing.

My $0.02

Exaggerated answered 4/2, 2009 at 1:13 Comment(6)

This is simply assuming that the quote characters are alternately opening quotes and closing quotes, something which is emphatically not true. – Crimp 4/2, 2009 at 2:16

That's where the normalization comes in. If you know there is a paragraph break, then you can change the rogue quote into something else. The state machine is a tool to process the normalized text. Generally, finding all of the "strange" cases is easier than accounting for all "good" cases. – Exaggerated 4/2, 2009 at 22:6

Quotes alternating is the easy case, and there are dozens of ways of handling that, including yours. I'm trying to find a larger set of heuristics (more than just "alternate") which handles as many cases as possible. The heuristics in the question already cover more cases than this answer (5) does. – Crimp 5/2, 2009 at 13:11

The point is that a state machine can be used in the implementation. I posted an example, but a more sophisticated state machine can easily handle 99% of the cases. Is this supposed to be an open discussion on the complexities of the English language, or on approaches to solving your problem? – Exaggerated 5/2, 2009 at 13:33

The actual mechanics (keeping state, backtracking, writing a recursive descent parser, whatever) are implementation details, and I think I can handle them. The question is indeed about high-level ideas based on the English language... sorry if this wasn't clear. How could I have phrased it better? – Crimp 5/2, 2009 at 22:8

Sorry, i misunderstood. I think you did a fine job describing your problem, it's just the context of the site that sent me down the implementation path. – Exaggerated 6/2, 2009 at 14:6

guess which curly quote character to use, if possible

It is not, in the general case.

The simple algorithm that most automatic converters use is just to look at the previous letter you typed before the ' or ". If it's a space, start of line, opening bracket or other opening quote, choose opening quote, else closing. The advantage of this method is that it can run as-you-type, so when it chooses the wrong one you can generally correct it.

we want to leave apostrophes alone

I agree! But not many people do. It's normal typesetting practice to turn an apostrophe into a left-facing single quote. Personally I prefer to leave them as they are, to distinguish them from enclosing quotes, making the text easier (I find) to read, and possible to process automatically.

However this really is just my taste and is not generally considered justified merely because the character is defined by the Unicode standard as being APOSTROPHE.

'tis possible apostrophes are at the beginning of words

Indeed. There is no way to tell an apostrophe from a potential open quote in cases like the classic Fish 'n' Chips, short of enormous amounts of cultural context.

(Not to mention primes, okinas, glottal stops and various other uses of the apostrophe...)

The best thing to do, of course, is install a keyboard layout that can type smart quotes directly. I have ‘’ on AltGr+[], “” on AltGr+Shift+[], –— on AltGr+[Shift]+dash, and so on.

Insure answered 4/2, 2009 at 1:29 Comment(2)

Good points! Unfortunately I'm already 3/4ths done with this file (reformatting an OCRed public-domain book) and although I tried to make some of the changes manually, I kept noticing that most of these could be automated... and that led to this question. :) – Crimp 4/2, 2009 at 1:57

Oh, been there! Yeah, I usually do it with the simple method above, but leaving apostrophes as they are when they're inside a word. It still takes a manual proofing to spot the beginning-apostrophes and plural-possessives that have been wrongly converted. – Insure 4/2, 2009 at 2:16

It looks like your initial post covers most of the ideas I was going to write here, this is what I've got left...

For the apostrophe example ("I like 'That '70s show'", she said), it's unlikely that quotes will be nested directly inside quotes of the same type. You could take advantage of that.

Best way to do this in my opinion is to make the code only handle unambiguous cases (double quotes are pretty simple). For the ones with multiple possible choices, store their position in a list and examine it when it's finished. You might find a few more easily-coded cases in there, or you might just decide to fix them manually.

Derma answered 4/2, 2009 at 2:57 Comment(0)

The basic thing is to always try to find matching pairs. Given that every quote has a matching quote you could make your program ask for your help only where it's unsure which is the matching quote.
Opening quotes are always at the opening of a line or have a space in front of them. Closing quotes always a space after them. If you find a colon with a following quote it's probably a closing quote.
If the letter following the quote is upper case it's probably an opening quote.
If there's a punctuation mark in front of the quote it's probably a closing quote.
Try to do it iteratively. The program should ask you first for all the quotes that it can definitely assign to a function. (Just to make sure it hasn't made any errors.)
In the second round something like all the quotes that it's unsure whether they are opening quotes or apostrophes. For all opening quotes it has to find automatically the closing quote.

Another, maybe less complex, idea could be:

Find all non-quotes by asking the user about each one that could potentially be a quote or a non-quote.
All the remaining quotes should be fairly easy to convert. Opening quotes have a spaces or newline in front of them and closing after them.

One last piece of thought:

You should break the process apart like processing only paragraph-wise. If your program makes an error, which it probably will given the complexity of language, it's easier for you to correct it and the program can start fresh with the new paragraph.

Prescott answered 4/2, 2009 at 1:11 Comment(0)

I hate to say it, but the best course of action might be to study what Word does, and copy it. Even if it's wrong in some cases, it represents a standard that many people have become accustomed to. One behavior to emulate is having undo (Ctrl-Z) immediately revert to the straight quote after you have substituted a curved one.

Coolidge answered 4/2, 2009 at 2:0 Comment(3)

Yes, I mentioned that in the question above. How does one study what Word does? :) – Crimp 4/2, 2009 at 2:2

Get the latest version of Word and experiment with different conditions. You've already created a good list of exceptional cases, and I'm sure you'll generate more in time. – Coolidge 4/2, 2009 at 5:30

To be more specific - generate a hypothesis of the algorithm they're using, and come up with test cases that would disprove the hypothesis. If you fail, you've probably guessed the algorithm correctly. – Coolidge 4/2, 2009 at 5:33

Here is a regular expression that might help for double-quotes:

/([^\s\(]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|\n\n)([^\s\)\.\,;]?)/gms

It will restart at each paragraph, and it will identify pairs of quotes (and will also allow you to check that the spacing is correct before and after the quotes, if that's useful).

Numbered element    identification  
  1               non-white-space before quote quote  
  2               white-space after leading quote  
  5               white-space before trailing quote  
  6               trailing quote (or double-newline, i.e. start of a paragraph  
  7               character after trailing quote if not whitespace or right   paren

I think it would be reasonable to extend this for your other cases (I just haven't had the need to yet.)

It's javascript syntax. It's pretty fast, but I haven't done more optimizing than my "good enough". It will do a, say, 400 page book in about a second. I think it would be hard to match its speed procedurally.

Spiritualty answered 4/2, 2009 at 3:16 Comment(0)

Computational linguistics anyone?

Somebody mentioned if you had a vast amount of cultural context, it might be feasible. So the overkill but most accurate automated solution to the problem is shallow parsing. This requires a corpus of whatever language and mode you're dealing with (e.g. the Brown corpus for general English).

Develop a classifier for curly quotes based on the syntactic context of the curly quotes occurring in the corpus. Finally, give your arbitrary syntactic context with a straight quote to your classifier and out pops the most probable quote character!

Goffer answered 4/2, 2009 at 5:27 Comment(1)

... and if you want to go in this direction, en.wikipedia.org/wiki/Natural_Language_Toolkit is a good place to learn about it, and find the tools to implement it. (A tutorial simultaneously in Natural Language Processing and Python.) – Spiritualty 4/2, 2009 at 5:39

["I like 'That '70s show'", she said]

I originally thought maybe using multiple passes over the text to gain context insight might help but that would not solve all instances.

The best thing you could do is run up a list of possible word sets/expressions like 'twas, 'tis, '70's etc. and throw them in the dictionary with auto-correction on them to convert the straights to curls and vice versa. Spell checks run on every word anyway don't they? (sorry that doesn't help your emacs problem)

OO ignores the single quote curving all together from what I can tell.

Wikipedia has a bit of info on these pesky things.

Sappy answered 4/2, 2009 at 2:37 Comment(0)

Try Shift + Ctrl + " (double quote key), this worked for me on windows 10, using a program called Kalipso.

Goober answered 11/10, 2017 at 11:18 Comment(1)

Sorry, this does not answer the question and it is not what I was asking. I have no problem inserting the “ or ” or ‘ or ’ characters; the question was about coming up with an algorithm/heuristic/a set of rules for when to insert which character. – Crimp 11/10, 2017 at 15:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags