How do I count the letters in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch?
Asked Answered
S

4

82

How do I count the letters in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch?

print(len('Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'))

Says 58

Well if it was that easy I wouldn't be asking you, now would I?!

Wikipedia says (https://en.wikipedia.org/wiki/Llanfairpwllgwyngyll#Placename_and_toponymy)

The long form of the name is the longest place name in the United Kingdom and one of the longest in the world at 58 characters (51 "letters" since "ch" and "ll" are digraphs, and are treated as single letters in the Welsh language).

So I want to count that and get the answer 51.

Okey dokey.

print(len(['Ll','a','n','f','a','i','r','p','w','ll','g','w','y','n','g','y','ll','g','o','g','e','r','y','ch','w','y','r','n','d','r','o','b','w','ll','ll','a','n','t','y','s','i','l','i','o','g','o','g','o','g','o','ch']))
51

Yeh but that's cheating, obviously I want to use the word as input, not the list.

Wikipedia also says that the digraphs in Welsh are ch, dd, ff, ng, ll, ph, rh, th

https://en.wikipedia.org/wiki/Welsh_orthography#Digraphs

So off we go. Let's add up the length and then take off the double counting.

word='Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
count=len(word)
print('starting with count of',count)
for index in range(len(word)-1):
  substring=word[index]+word[index+1]
  if substring.lower() in ['ch','dd','ff','ng','ll','ph','rh','th']:
    print('taking off double counting of',substring)
    count=count-1
print(count)

This gets me this far

starting with count of 58
taking off double counting of Ll
taking off double counting of ll
taking off double counting of ng
taking off double counting of ll
taking off double counting of ch
taking off double counting of ll
taking off double counting of ll
taking off double counting of ll
taking off double counting of ch
49

It appears that I've subtracted too many then. I'm supposed to get 51. Now one problem is that with the llll it has found 3 lls and taken off three instead of two. So that's going to need to be fixed. (Must not overlap.)

And then there's another problem. The ng. Wikipedia didn't say anything about there being a letter "ng" in the name, but it's listed as one of the digraphs on the page I quoted above.

Wikipedia gives us some more clue here: "additional information may be needed to distinguish a genuine digraph from a juxtaposition of letters". And it gives the example of "llongyfarch" where the ng is just a "juxtaposition of letters", and "llong" where it is a digraph.

So it seems that 'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch' is one of those words where the -ng- is bit just a "juxtaposition of letters".

And obviously there's no way that the computer can know that. So I'm going to have to give it that "additional information" that Wikipedia talks about.

So anyways, I decided to look in an online dictionary http://geiriadur.ac.uk/gpc/gpc.html and you can see that if you look up llongyfarch (the example from Wikipedia that has the "juxtaposition of letters") it displays it with a vertical line between the n and the g but if you look up "llong" then it doesn't do this.

screenshot from dictionary (llongyfarch)

screenshot from dictionary (llong)

So I've decided okay what we need to do is provide the additional information by putting a | in the input string like it does in the dictionary, just so that the algorithm knows that the ng bit is really two letters. But obviously I don't want the | itself to be counted as a letter.

So now I've got these inputs:

word='llong'
ANSWER NEEDS TO BE 3 (ll o ng)

word='llon|gyfarch'
ANSWER NEEDS TO BE 9 (ll o n g y f a r ch)

word='Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
ANSWER NEEDS TO BE 51 (Ll a n f a i r p w ll g w y n g y ll g o g e r y ch w y r n d r o b w ll ll a n t y s i l i o g o g o g o ch)

and still this list of digraphs:

['ch','dd','ff','ng','ll','ph','rh','th']

and the rules are going to be:

  1. ignore case

  2. if you see a digraph then count it as 1

  3. work from left to right so that llll is ll + ll, not l + ll + l

  4. if you see a | don't count it, but you can't ignore it completely, it is there to stop ng being a digraph

and I want it to count it as 51 and to do it for the right reasons, not just fluke it.

Now I am getting 51 but it is fluking it because it is counting the | as a letter (1 too high), and then it is taking off one too many with the llll (1 too low) - ERRORS CANCEL OUT

It is getting llong right (3).

It is getting llon|gyfarch wrong (10) - counting the | again

How can I fix it the right way?

Shiest answered 21/8, 2020 at 19:10 Comment(6)
Since it is only one word that you are attempting to measure and you know the word and it's length, why not just create a constant string to contain the string and a constant int to contain the length of the string and be done with it? No need to do this in code, right?Callaway
I don't know much about python. after you do count=count-1, could you add index=index+1 to skip the next letter?Polynices
So I don't know a ton about python but I figured they must have some concept of culture for strings? In .NET for example you would set the culture of your application and based on that it would treat certain characters differently. Unless the idea here is that you're trying to implement this from the ground up yourself then ignore this comment.Coontie
If it was C# I could offer "ch dd ff ng ll ph rh th |".Split().ToList().ForEach(a => sb.Replace(a, a == "|" ? ".": "")); //sb is a stringbuilder - just replace each of the digraphs with a char that doesn't occur in the string and finally replace the | with nothing; resulting length is your string. Not a python dev, but the same process should work, of replacing the doubles with a single..Gynecocracy
"th" and "sh" are digraphs in English, but I've never come across anyone who considers these "single letters", in the glyph sense. You're asking about counting "phonemes", which map notoriously awkwardly to languages written with alphabets. The syllable break, which you've identified, is just one ambiguity.Orientation
It is clear that Welsh and English operate differently in this regard. An example is that if you browse words alphabetically on the dictionary link I posted (by using the "next word" and "previous word" links - note link top-right to set the interface to English) "cywystlwr" comes before the "ch" words - because the letter ch is after c in the alphabet.Shiest
D
58

Like many problems to do with strings, this can be done in a simple way with a regex.

>>> word = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
>>> import re
>>> pattern = re.compile(r'ch|dd|ff|ng|ll|ph|rh|th|[^\W\d_]', flags=re.IGNORECASE)
>>> len(pattern.findall(word))
51

The character class [^\W\d_] (from here) matches word-characters that are not digits or underscores, i.e. letters, including those with diacritics.

Doherty answered 21/8, 2020 at 19:24 Comment(11)
Do the order of the conditions matter there? Will ll take priority over a through z since it appears first? More specifically is that a regex specific thing or will each language have its own implementation?Coontie
If you want regex to handle the origin input: pattern = re.compile(r'ch|dd|ff|ll|ph|rh|th|[a-z]|(ng^yf)', flags=re.IGNORECASE)Gaggle
@MaxYoung Yes, the order of the parts is why the digraphs take priority over individual letters; that is generally true in every regex engine I have seen. In Python specifically, the docs say "As the target string is scanned, REs separated by '|' are tried from left to right", so it is the specified behaviour and safe to rely on.Doherty
@benjessop--doesn't your pattern produce the same result without the last |(ng^yf)?Toneme
@Doherty thanks I was just curious I had never looked into that before and this was the first scenario I've run into where it actually mattered.Coontie
@JörgWMittag Fixed.Doherty
Then there's the problem that Welsh uses several loan words / phrases from English and doesn't always change their spelling to Welsh spelling, so you can't absolutely count on the digraphs being digraphs... :-| Ah, natural languages are such fun. :-)Misstep
@benjessop, what's that (ng^yf) about? Can it ever match anything when ^ means the start of string?Odin
How can a regexp distinguish between a digraph and a juxtaposition?Ina
@DirkLachowski A regexp cannot, of course, but according to the question, the input string will have a | character separating the two letters when they are not a digraph.Doherty
@Doherty Ah! Would i have read the question more carefully i wouldn't have asked a silly question. I somehow confused the | with a small L. Thanks for the clarification.Ina
F
21

You can get the length by replacing all the double letters with a . (or any other character, ? would do just fine), and measuring the length of the resulting string (subtracting the amount of |):

def get_length(name):
    name = name.lower()
    doubles = ['ch', 'dd', 'ff', 'ng', 'll', 'ph', 'rh', 'th']
    for double in doubles:
        name = name.replace(double, '.')
    return len(name) - name.count('|')

name = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
print(get_length(name))
>>> 51
Flotow answered 21/8, 2020 at 19:22 Comment(2)
+1 for being extremely simple, I would have never thought just to tokenize, for a lack of better terms, the characters that are conjugated. I have a feeling I'll have to apply this to a algorithm I've been working on for detecting duplicate characters in Japanese text but where the duplication is correct. The issue I run into in Japanese is that for example hahaha would be three of the same character back to back but that could in theory be the first two characters I word and the last character is a particle.Coontie
It works fine in this case. If you apply this method to other strings, you need to make sure that the intermediary variable doesn't contain digraphs which aren't present in the original string.Soliloquize
B
10
  1. Step through the string letter by letter
  2. If you are at index n and and s[n:n+2] is a digraph, add or increment a dictionary with the digraph as the key, and increment the index by 1 as well so you don't start on the second digraph character. If it's not a digraph, just add or increment the letter to the dict and go to the next letter.
  3. If you see the | character, don't count it, just skip.
  4. And don't forget to lowercase.

When you've seen all the letters, the loop ends and you add all the counts in the dict.

Here's my code, it works on your three examples:

from collections import defaultdict

digraphs=['ch','dd','ff','ng','ll','ph','rh','th']
breakchars=['|']


def welshcount(word):
    word = word.lower()
    index = 0
    counts = defaultdict(int)  # keys start at 0 if not already present
    while index < len(word):
        if word[index:index+2] in digraphs:
            counts[word[index:index+2]] += 1
            index += 1
        elif word[index] in breakchars:
            pass  # in case you want to do something here later
        else:  # plain old letter
            counts[word[index]] += 1

        index += 1

    return sum(counts.values())

word1='llong'
#ANSWER NEEDS TO BE 3 (ll o ng)

word2='llon|gyfarch'
#ANSWER NEEDS TO BE 9 (ll o n g y f a r ch)

word3='Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
#ANSWER NEEDS TO BE 51 (Ll a n f a i r p w ll g w y n g y ll g o g e r y ch w y r n d r o b w ll ll a n t y s i l i o g o g o g o ch)

print(welshcount(word1))
print(welshcount(word2))
print(welshcount(word3))
Baryton answered 21/8, 2020 at 19:18 Comment(0)
M
1

You could use a Combining Grapheme Joiner (+u034F) character to join the letters and then take your character count and take away the number of these joiners * 2.

http://www.comisiynyddygymraeg.cymru/English/Part%203/10%20Locales%20alphabets%20and%20character%20sets/10.2%20Alphabets/Pages/10-2-4-Combining-Grapheme-Joiner.aspx

The Welsh Language Commissioner also addresses the issue here: http://www.comisiynyddygymraeg.cymru/English/Part%203/10%20Locales%20alphabets%20and%20character%20sets/10.2%20Alphabets/Pages/10-2-1-Character-vs--letter-counts.aspx

Misquotation answered 18/9, 2020 at 10:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.