Overcoming MemoryError / Slow Runtime in Ashton String task
Asked Answered
J

3

6

In the Ashton String task, the goal is to:

Arrange all the distinct substrings of a given string in lexicographical order and concatenate them. Print the Kth character of the concatenated string. It is assured that given value of K will be valid i.e. there will be a Kth character.

The Input Format:

First line will contain a number T i.e. number of test cases. First line of each test case will contain a string containing characters (a−z) and second line will contain a number K.

The Output Format:

Print Kth character ( the string is 1 indexed )

And the Constraints are

1 ≤ T ≤ 5
1 ≤ length ≤ 105
K will be an appropriate integer.

For example, given the input:

1
dbac
3

The output would be: c

I've attempted the task with this code and it works for relatively short strings:

from itertools import chain

def zipngram(text, n=2):
    words = list(text)
    return zip(*[words[i:] for i in range(n)])

for _ in input():
    t = input()
    position = int(input())-1 # 0th indexing
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
    concatstr = ''.join(sorted([''.join(s) for s in chargrams]))
    print (concatstr[position])

But if the input file looks like this: http://pastebin.com/raw/WEi2p09H and the desired output is:

l
s
y
h
s

The interpreter will throw a MemoryError:

Traceback (most recent call last):
  File "solution.py", line 11, in <module>
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
  File "solution.py", line 11, in <listcomp>
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
  File "solution.py", line 6, in zipngram
    return zip(*[words[i:] for i in range(n)])
  File "solution.py", line 6, in <listcomp>
    return zip(*[words[i:] for i in range(n)])
MemoryError

How can the MemoryError be resolved? Is it solvable in another way using native python2 or python3?

I tried to resolve the MemoryError by pruning the stack using heapq but now it goes into an uber slow runtime pushing and popping the heap instead of taking up too much memory.

from itertools import chain
import heapq

t = int(input())
s = list(input())
k = int(input())

stack = []
for n in range(1,len(s)+1):
    for j in range(n):
        ngram = (''.join(s[j:]))
        ngram_len = len(ngram)
        # Push the ngram into the heapq.
        heapq.heappush(stack, ngram)
        pruned_stack = []
        temp_k = 0
        # Prune stack.
        while stack != [] and temp_k < k:
            x = heapq.heappop(stack)
            pruned_stack.append(x)
            temp_k+=len(x)
        # Replace stack with pruend_stack.
        stack = pruned_stack

print (''.join(chain(*pruned_stack))[k])

Is there a way to balance between not using too much memory that it leads to MemoryError and too slow runtime with heapq pushing and popping?

Jene answered 30/12, 2015 at 14:8 Comment(0)
G
4

MemoryError means that the program consumed all available memory and thus crashed.

A possible solution is using iterables (they work in Py2 too but Py3 has better support with them) that are lazy (they compute the value only on demand, not all at once).

Adapting your program to generators should need only minor changes, to index a generator without using a list (that would nullify the lazy benefit) see: Get the nth item of a generator in Python

Graze answered 30/12, 2015 at 14:36 Comment(0)
M
4

Try this code, it works for the large sample.

def ashton(string, k):
    #We need all the substrings, and they have to be sorted
    sortedSubstrings = sorted_substrings(string)
    count = 0
    currentSubstring = 0
    #Loop through the substrings, until we reach the kth character
    while (count < k):
        substringLen = len(sortedSubstrings[currentSubstring])
        #add the number of characters of the substring to our counter
        count += substringLen
        #advance the current substring by one
        currentSubstring += 1
    #We have the correct substring now, and calculate to get the right char
    diff = count - k
    #Return answer, index 1 = substring, index 2 = char in substring
    return(sortedSubstrings[currentSubstring][substringLen-diff-1])

#Determine the substrings in correct order
#Input: 'dbac', returns: a, ac, b, ba, bac, c, d, db, dba, dbac
def sorted_substrings(string):
    a = set()
    length = len(string)
    #loop through the string to get the substrings
    for i in range(length):
        for j in range(i + 1, length + 1):
            #add each substring to our set
            a.add(string[i:j]) 
    #we need the set to be sorted
    a = sorted(a)
    return a

t = int(input())
for i in range(t):
    s = input()
    k = int(input())
    print(ashton(s, k))
Marcellmarcella answered 2/1, 2016 at 22:8 Comment(7)
Could you try this input: pastebin.com/raw/WEi2p09H? Does it go to MemoryError too?Jene
@Jene please try my code, it gets no memory error and returns the right resultsMarcellmarcella
Patience, you must have. come, they will, the vote when approaching, the bounty is.Jene
Also, a little bit of explanation on sortedSubstrings = sorted(set([string[x:y] for x in range(length) for y in range(length) if string[x:y]])) by unrolling the loops will get you votes easily =)Jene
@alvas, i rewrote that complicated line to be it's own function now which makes it a lot easier to read. The sorted_substrings function puts all the substrings in the lexicographical order. So the function for 'dbac' returns a set: a, ac, b... Once we have the sorted substrings, the while loop then checks against k, incrementing as we look at each substring. So in the simple test case where k=3, we first look at 'a' which increments count by 1. Then 'ac' which increments count to be 3. Now that count is equal to k, we break from the loop.Marcellmarcella
The bounty is given to you but if you want more votes to the question, I suggest that you explain the purpose and comment the functions =) That way you can get more votes to your answer. I'll leave the question unchecked so that more people will add new answers and also you might get more votes.Jene
Thanks! I added comments to explain what I was doing and let me know if something is unclear. If you are interested in further optimizations, the sorted_substrings and the while loop are what take the majority of time.Marcellmarcella
B
0

Try this code in c, it works for the large sample.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#define N 100001
void merge(int*a,int*left,int*right,int left_size, int right_size);
void sort_a(int*a,int size);
void suffixSort(int n);
void LCP(int n);
char str[N]; //input
int rank[N], pos[N], lcp[N]; //output
int cnt[N], next[N]; //internal
int bh[N], b2h[N];

int main(){
  char cc;
  int T,n,i;
  long long K,c,t,x,tt;
  double xx;
  scanf("%d",&T);
  while(T--){
    scanf("%s%lld",str,&K);
    n=strlen(str);
    suffixSort(n);
    LCP(n);
    c=0;
    for(i=0;i<n;i++){
      tt=c;
      c+=(lcp[i]+1+n-pos[i])*(long long)(n-pos[i]-lcp[i])/2;
      if(K<=c){
        xx=(-1+sqrt(4*(2*(K-tt)+lcp[i]+lcp[i]*(long long)lcp[i])))/2;
        x=(long long)xx;
        t=K-tt-(lcp[i]+1+x)*(x-lcp[i])/2;
        if(!t)
          cc=str[pos[i]+x-1];
        else
          cc=str[pos[i]+t-1];
        break;
      }
    }
    printf("%c\n",cc);
  }
  return 0;
}
void merge(int*a,int*left,int*right,int left_size, int right_size){
    int i = 0, j = 0;
    while (i < left_size|| j < right_size) {
        if (i == left_size) {
            a[i+j] = right[j];
            j++;
        } else if (j == right_size) {
            a[i+j] = left[i];
            i++;
        } else if (str[left[i]] <= str[right[j]]) {
            a[i+j] = left[i];
            i++;                
        } else {
            a[i+j] = right[j];
            j++;
        }
    }
    return;
}
void sort_a(int*a,int size){
  if (size < 2)
    return;
  int m = (size+1)/2,i;
  int *left,*right;
  left=(int*)malloc(m*sizeof(int));
  right=(int*)malloc((size-m)*sizeof(int));
  for(i=0;i<m;i++)
    left[i]=a[i];
  for(i=0;i<size-m;i++)
    right[i]=a[i+m];
  sort_a(left,m);
  sort_a(right,size-m);
  merge(a,left,right,m,size-m);
  free(left);
  free(right);
  return;
}
 
void suffixSort(int n){
  //sort suffixes according to their first characters
  int h,i,j,k;
  for (i=0; i<n; ++i){
    pos[i] = i;
  }
  sort_a(pos, n);
  //{pos contains the list of suffixes sorted by their first character}
 
  for (i=0; i<n; ++i){
    bh[i] = i == 0 || str[pos[i]] != str[pos[i-1]];
    b2h[i] = 0;
  }
 
  for (h = 1; h < n; h <<= 1){
    //{bh[i] == false if the first h characters of pos[i-1] == the first h characters of pos[i]}
    int buckets = 0;
    for (i=0; i < n; i = j){
      j = i + 1;
      while (j < n && !bh[j]) j++;
      next[i] = j;
      buckets++;
    }
    if (buckets == n) break; // We are done! Lucky bastards!
    //{suffixes are separted in buckets containing strings starting with the same h characters}
 
    for (i = 0; i < n; i = next[i]){
      cnt[i] = 0;
      for (j = i; j < next[i]; ++j){
        rank[pos[j]] = i;
      }
    }
 
    cnt[rank[n - h]]++;
    b2h[rank[n - h]] = 1;
    for (i = 0; i < n; i = next[i]){
      for (j = i; j < next[i]; ++j){
        int s = pos[j] - h;
        if (s >= 0){
          int head = rank[s];
          rank[s] = head + cnt[head]++;
          b2h[rank[s]] = 1;
        }
      }
      for (j = i; j < next[i]; ++j){
        int s = pos[j] - h;
        if (s >= 0 && b2h[rank[s]]){
          for (k = rank[s]+1; !bh[k] && b2h[k]; k++) b2h[k] = 0;
        }
      }
    }
    for (i=0; i<n; ++i){
      pos[rank[i]] = i;
      bh[i] |= b2h[i];
    }
  }
  for (i=0; i<n; ++i){
    rank[pos[i]] = i;
  }
}
// End of suffix array algorithm

void LCP(int n){
  int l=0,i,j,k;
  for(i=0;i<n;i++){
    k=rank[i];
    if(!k)
      continue;
    j=pos[k-1];
    while(str[i+l]==str[j+l])
      l++;
    lcp[k]=l;
    if(l>0)
      l--;
  }
  lcp[0]=0;
  return;
}
Briant answered 3/7, 2023 at 12:51 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Adhesive

© 2022 - 2024 — McMap. All rights reserved.