Algorithm for largest word formed from perodic table elements
Asked Answered
C

7

7

I want to write an algorithm for the following problem scenario

Taking periodic table elements' names, find largest word that can be formed? The symbols such as Na , Ne etc should be regarded as single elements.

This was asked in a reputed company's job interview. Can anybody please help me out solve this.

Crossquestion answered 15/5, 2014 at 9:7 Comment(8)
Is there a dictionary of words provided to check the output ?Cowper
@Cowper : no such validation data was given . The question was exactly what I have mentioned. This was the dilemma for me.Crossquestion
There are infinite words, how can we compare without a dictionary?Siliculose
I.N.C.O.N.S.P.I.C.U.O.U.S.N.Es.SKosaka
If you want to find the longest word, your first step would be to sort your dictionary from longest to shortest. Then try from the top and find the first one you can build with the symbols from the periodic table.Variform
@Variform Simply sorting the dictionary by length and stopping at the first match does not necessarily give the correct answer because multiletter elements count only as one symbol. Maybe the intention of the question is to see if you are trying to be to clever and optimize in a bug.Lona
find largest word that can be formed is actually ambiguous. Do they need the largest combination of elements, or the largest resulting word?Variform
Because of the second sentence I am pretty sure they ask for number of elements not letters, but there is some ambiguity left.Lona
T
1

How to express a given word as a chemical compound? Here's a dynamic programming solution. The important line is "Let progress[i] be the solution to the subproblem for word[:i]". The rest follows.

elements = "H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt Ds Rg Cn Uut Fl Uup Lv Uus Uuo".split()

def decompose(word):
    """Express given word as chemical compound. If there are multiple solutions, return one of minimal weight."""
    progress = [False for x in range(len(word)+1)] # solution for word[:i]
    progress[0] = []

    for i in range(1, len(word)+1):
        possibles = list()
        for j in range(max(i-3,0), i):
            if progress[j] == False:
                continue
            alchemical = word[j:i].title()
            if alchemical in elements:
                possibles.append(progress[j] + [alchemical])

        if possibles:
            # choose minimal solution
            progress[i] = min(possibles, key=len)

    if progress[-1] == False:
        return False
    return "".join(progress[-1])

assert decompose("sine") == "SiNe" # rather than S-I-Ne
assert decompose("bismuth") == "BiSmUTh"
assert decompose("jam") == False

https://gist.github.com/hickford/750135db633832913703


Ran this on a whole dictionary, the longest compounds were:

  • nonrepresentationalism NoNRePReSeNTaTiONaLiSm
  • thermophosphorescence ThErMoPHOsPHoReSCeNCe
  • bronchoesophagoscopy BrONCHoEsOPHAgOsCoPY
  • hyperphosphorescence HYPErPHOsPHoReSCeNCe
  • hypsibrachycephalism HYPSiBrAcHYCePHAlISm
Terrie answered 22/5, 2014 at 10:15 Comment(0)
P
3

I think a better way is to check every word in a dictionary and see if it can be made from the names of the elements. To check every permutation of the elements would be harder to program and would be less efficient.

While I agree it is easier to produce the combinations, there are just way too many of them, and as you said tend to infinity if you don't give a limit. The production of the word with the symbols would be slightly more difficult and finicky, but I don't think it would be too difficult.

E.g. when you get a word you could search through the elements looking for elements that could make up your word, then using that set of elements try and fill in the letters from start to finish. Obviously this gets harder with elements that aren't 2 letters and words that are odd in length.

You can actually use a sort of graph. Say you have 'silicon'. You can start with the letter 'S' or 'SI' which are both in the table. From there choose 'SI' because it is closer to your solution. If 'SI' does not lead to your solution you will have to come back and see if 'S' would work.

So it works as a kind of depth first search.

Psia answered 15/5, 2014 at 9:25 Comment(13)
And how do you propose to check if a word can be made from the symbols? It's quite vital to the question and answer. Also, it is very easy to program generating of combinations IMO.Sather
Hmm... after thinking some more, you might be right, including repetition of symbols would make the possible words infinite. It might be easier to invent algorithms to check word against being possible to be built.Sather
I figured out an algorithm that checks if a prefix of word is a symbol. But again, of course you are right. For each possible hit you should make a list (a queue), and try every possible path. I think it still is fairly easy to refactor and extend my proposal with this idea to be correct. I think the basic idea is fine.Sather
I think that better wording for "kind of DFS" is building a graph of prefixes that match the word. And IMO it doesn't matter if worked like DFS or BFS.Sather
You can use DP to check whether a word consists of only element names: Let f(i) be the maximum number of elements that can form the prefix w[1..i]. Then if there is an element string E of size k and we have w[i+1..i+k] = E, we have f(i + k) <= 1 + f(i)Febrific
@NiklasB.: Ah, you found the DP for this already! One tiny thing: "<=" should be ">=", no?Erdda
@Erdda Yeah, we want to maximize, not minimize :) Also it might be worth memoizing actual prefixes (or their hashes) to eliminate common subproblems between different words. Some kind of LRU cache would probably work great here if the dictionary is sortedFebrific
@NiklasB.: Yes, I think all you need to do is to keep the previous word's f[] around, and then you can just avoid recalculating f(i) for all i <= length_of_shared_prefix. But since it will take linear time to find the length of the shared prefix anyway, this only swaps one linear time algo for another, so I guess it might not be such a huge win ;) Should be a bit faster though!Erdda
@Erdda Yes that's even better. It's asymptotically better as well because looking up a string is linear time as well and you do that much more often Omega(n) where n is the number of words in the dict.Febrific
@NiklasB.: Hmm I would say that testing whether a 1- or 2-character substring is a valid chemical element name is just O(1), not O(n) -- or did you mean something else? But certainly it should have a higher constant factor than just comparing two characters, so doing the latter (i.e. finding the length of the longest shared prefix) should be a win in practice.Erdda
@Erdda I meant looking up f(j) would be linear time if you memoize based on stringsFebrific
@NiklasB.: Oh, you meant if you made f() a 2-parameter function, with 1 param being the string? Sure, those lookups would be O(n). What I meant was that doing my "remember just the last string" trick (with 1-param f()) doesn't actually improve the asymptotic complexity over just dumbly processing each string completely independently (with 1-param f()), since you have to pay O(n) just to find the length of the shared prefix.Erdda
@Erdda Yes you're right, I missed that the transition is easily implemented in O(1). Well we should probably not clutter the comments any more. Linear time is definitely possible for this problem, that should be good enough :PFebrific
M
1

Generating all words and checking if they exist is impractical. There are 118^L words of length L, a too fast growing function. 1,643,032 three symbols words !

The other way round, as suggested by Makoto, is much more efficient. Try to reconstruct every word from a dictionary. There are on the order of 250,000 English words.

Checking a word is straightforward: see if any element symbol matches the start of the word, and continue with the remaining characters.

You must try all possible matches as a match could hide another. (I mean word ABC could be matched by symbol A and then be stuck because BC is not available, but it could be that AB and C are.) So the matching function will be recursive. I don't expect an exponential explosion in this matching process.

For maximum efficiency, you will store the symbols in a trie structure http://en.wikipedia.org/wiki/Trie.

Final hint: as you are just asked to find the longest match, try the words by decreasing lengths and stop at the first match.

Here is a simple solution in Python, without any optimization. Matching is done right to left, to allow printing the sequence of symbols in post-order:

Words= ['K', 'NaH', 'NaNaNaNa', 'HaKuNa']
Symbols= ['Na','K','H']

def Match(Word):
    # Match the end of the word with every symbol
    for S in Symbols:
        # In case of a partial match, recurse on the prefix
        if S == Word or (S == Word[-len(S):] and Match(Word[:-len(S)])):
            print S,
            return True

    # No match, report failure
    return False

for W in Words:
    if Match(W):
        print


>>> 
K
Na H
Na Na Na Na
Mandalay answered 15/5, 2014 at 10:1 Comment(7)
Sorting the dictionary and stopping at the first match may give the wrong answer because multiletter elements count only as one symbol.Lona
@Daniel: Right. This can be fixed by continuing the search until a word length equal the number of symbols of the longest word found.Mandalay
You can use DP to get a polynomial-time checkFebrific
@Niklas: how can DP help here ?Mandalay
If you just memoize Match, you're golden already, since it's only called with O(n*m) different inputs, where n is the number of words and m is the maximum word sizeFebrific
@Niklas: do you consider memoization as DP ?Mandalay
@YvesDaoust Of course. Memoization is just a means of resolving the DAG of subproblems implicitly. You can also call it top-down DP if you preferFebrific
T
1

How to express a given word as a chemical compound? Here's a dynamic programming solution. The important line is "Let progress[i] be the solution to the subproblem for word[:i]". The rest follows.

elements = "H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt Ds Rg Cn Uut Fl Uup Lv Uus Uuo".split()

def decompose(word):
    """Express given word as chemical compound. If there are multiple solutions, return one of minimal weight."""
    progress = [False for x in range(len(word)+1)] # solution for word[:i]
    progress[0] = []

    for i in range(1, len(word)+1):
        possibles = list()
        for j in range(max(i-3,0), i):
            if progress[j] == False:
                continue
            alchemical = word[j:i].title()
            if alchemical in elements:
                possibles.append(progress[j] + [alchemical])

        if possibles:
            # choose minimal solution
            progress[i] = min(possibles, key=len)

    if progress[-1] == False:
        return False
    return "".join(progress[-1])

assert decompose("sine") == "SiNe" # rather than S-I-Ne
assert decompose("bismuth") == "BiSmUTh"
assert decompose("jam") == False

https://gist.github.com/hickford/750135db633832913703


Ran this on a whole dictionary, the longest compounds were:

  • nonrepresentationalism NoNRePReSeNTaTiONaLiSm
  • thermophosphorescence ThErMoPHOsPHoReSCeNCe
  • bronchoesophagoscopy BrONCHoEsOPHAgOsCoPY
  • hyperphosphorescence HYPErPHOsPHoReSCeNCe
  • hypsibrachycephalism HYPSiBrAcHYCePHAlISm
Terrie answered 22/5, 2014 at 10:15 Comment(0)
S
0

Generate all possible words - naive approach.

Based on Generate all permutations of all lengths:

import itertools..
symbols = ['Na','K','H']
for i in range(len(symbols)):
 for word in itertools.permutations(symbols,i+1):
  print( ''.join(word) )

You can generate every possible combination, and check against a dictionary if its an acutal word. But it is inefficient and only for if you allow no repetitions of symbols.

Check if word can be built from symbols - still not perfect.

If you allow repetitions you need to check a word against the list of symbols. I propose the following:

import itertools..
words = ['K', 'NaH', 'NaNaNaNa', 'HaKuNa']
symbols = ['Na','K','H']
for i in range(len(symbols)):
 for word in itertools.permutations(symbols,i+1):
  print( ''.join(word) )


def can_be_built(word):
  pos = 0
  ret = True
  while(pos < len(word)):
   #following loop checks if word starting form `pos`
   #can be prefixed by a symbol
   symbol_match = False
   for symbol in symbols:
    if word[pos:pos+len(symbol)] == symbol:
      symbol_match = True
      break
   if symbol_match == False:
     print('Word `{}` has no match for symbol from: {}'.format(word, word[pos:]))
     ret = False
     break
   #if it does move forward
   pos += len(symbol)

  return ret

for word in words:
   print("{} can be built? {}".format(word, can_be_built(word)))

It iteratively check if a word prefix is a symbol, then moves forward until the end of word is reached.

Output:

K can be built? True
NaH can be built? True
NaNaNaNa can be built? True
Word `HaKuNa` has no match for symbol from: aKuNa
HaKuNa can be built? False

It still is not perfect.

Correct approach

As Makoto says, the prefix-check should return a list of every possible match. The algorithm should make a queue out of those matches, and check all possible paths. It would be kind of like building a graph of prefixes that match the word. And if one builds whole word you are home.

I think it is still fairly easy to correct my 2nd example, but I am out of time for the actual code. I think it's a good place for start.

Sather answered 15/5, 2014 at 9:15 Comment(6)
Can you please elaborate your solution ?Crossquestion
So , you are opting out to add all the symbols in a peridic table to a String array. Your algo seems very much speific to a particular language.Crossquestion
I added second example that allows repetition in case batman would like to say NaNaNa. Examples are written in python.Sather
I Think the above code makes some sense. I'll try it out. Thank you !Crossquestion
I edited the code with comment to help identify the prefixing check. But I think it would be better to refactor it, and you certainle need to make it catch all possible matches, not only the 1st as it does currently.Sather
The trick to getting your test to work 100% of the time is to check that the first 1 or 2 characters are a valid element name and that the rest of the word can be built from element names, recursively. This would be exponential-time, but it can be brought back to linear time with DP -- see my answer.Erdda
E
0

[EDIT: This post is basically just a full explanation of a DP algorithm mentioned by Niklas B. in comments on other posts.]

Testing a particular word in linear time

In a word that can be formed from chemical element names, at least one of the following must be true:

  • The last letter is a single-letter chemical element name AND the prefix consisting of all letters except for the last is itself a word that can be formed from chemical element names.
  • The last two letters are a two-letter chemical element name AND the prefix consisting of all letters except for the last two is itself a word that can be formed from chemical element names.

This suggests a nice DP algorithm, where for a given word w of length k we compute, for all 1 <= i <= k, the largest number of letters (or chemical symbols; the question is ambiguous regarding exactly what we're trying to maximise!) that the length-i prefix of w can be built from. Let's call this f(i), and say that f(i) = -1 if the length-i prefix of w cannot be built from chemical element names at all.

Let X = 1 for maximising element names, or 2 for maximising letter counts.

The recursion for the DP can be written as:

  • one(i) = if (w[i] is a 1-letter element name) { f(i-1) + 1 } else { -1 }
  • two(i) = if ("w[i-1]w[i]" is a 2-letter element name) { f(i-2) + X } else { -1 }
  • f(i) = -1 if i < 0
  • f(0) = 0
  • f(i) = max(one(i), two(i)) if i > 0

Then f(k) will be what we want to know: the maximum number of (letters/elements) that w can be formed from, or -1 if this is not possible.

The max() means that if only one of the ways of handling the end of the prefix works, that way will be chosen (because the other way will have a score of -1, which will always compare less), and if neither way works, we will correctly get f(i) = -1.

Assuming constant-time testing of whether single characters or character pairs are valid chemical element names (using e.g. array or hashtable lookups), we can compute whether a given word can be represented as a sequence of chemical element names in time and space proportional to its length.

Considering all words

If we are maximising the number of letters, then it probably makes the most sense to process the dictionary in decreasing length order, since the first time we encounter a word for which f(k) is not -1, we know it must be the longest and we can stop.

This can be adapted for maximising the element name count. Although we can't stop immediately, because it might be that there is a shorter word further on that nevertheless can be formed from more element names (specifically, by using more single-character ones), we still get some useful information whenever we find a new word with a greater element count than the previously best: we can update a cutoff threshold with this element count, and stop as soon as we see a word that has fewer letters than this threshold, since a length-m word can never be formed from more than m element names.

There is different approach, which might turn out to be faster: by first sorting the dictionary in alphabetical order, we can avoid recomputing f(i) on the prefix that is shared with the previous word. E.g. if carton immediately follows cart in the dictionary, then we only need to compute f(i) on the on part.

Erdda answered 16/5, 2014 at 13:5 Comment(0)
B
0

Just want to write down my thoughts, and verified with others who professional with Algorithm, because I didn't see such question in other place.

BTW, I use Java.And here is my pseudocode:

1, Use the element symbol of Periodic Table to form a Array of String: String[] PeriTable.

2, Based on the backtracking thoughts, build a helper function called Backtracking(String result, String[] PeriTable, String[] used);

3, We iterate each element in PeriTable, and check the String it formed.

 Backtracking void(String result, String[] PeriTable, String[] used){
   for(int i=0;i<PeriTable.length;i++){
     If(!used.contain(PeriTable[i])){
      newResult=result+PeriTable[i];
     If(isEnglishWord(newResult)) {
         used.add(PeriTable[i]);
         maxlength=Math.max(maxlength,newResult.length());
         Backtracking(newResult, PeriTable, used);
         used.remove(used.length()-1);
      }
    }
    else continue;
  }
 }

However,I don't know how to build isEnglishWord(); function.

Bolection answered 3/8, 2018 at 19:17 Comment(0)
I
-1

Sorry, I am very confused by these answers. Surely the obvious answer is to sort the dictionary in alphabetical order by using nested bucket sorts up to some depth, say 8 letters, this should give you order 1 retrieval of words starting with a given sequence of letters up to 8.

Then go through the period tables doing DFS like it is a game tree. At each step you add an element and then check down the list to see if there are any words starting with that combination.

This will be relatively memory intensive, as you probably need at least 6 layers of buckets (alphabetise by the first six letters), but there are typically only 1,000,000 or so words. Since virtually every branch in the game tree will terminate after four letters, youre DFS will search the whole game tree quite fast. If you can form every word in the dictionary, it still must be the case that there are at most the same number of branches as there are words, and in practice you will only be able to get a fraction of the words int he dictionary, so this approach must be efficient.

Intertidal answered 15/5, 2014 at 10:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.