Search for cyclic strings
Asked Answered
D

2

7

I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.

I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.

Is there any way to do that more efficiently while the place complexity will be like in a Trie?

Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011 are: 0111, 1110, 1101, 1011

Declaratory answered 20/1, 2012 at 14:49 Comment(4)
If this is for an alphabet of more than those two characters, I'd say create a hash function that generates the same result regardless of the order of the character values, so you can quickly eliminate most non-matches.Purificator
@C.Evenhuis: No, I'm working only with binary strings.Declaratory
Do you do all the insertions up-front? Or do you intermix insertions and lookups?Misdoing
@templatetypedef: intermix insertions and lookupsDeclaratory
T
6

Can you come up with a canonicalizing function for cyclic strings based on the following:

  1. Find the largest run of zeroes.
  2. Rotate the string so that that run of zeroes is at the front.
  3. For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.

This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.

0101010101 is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.

You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.

EDIT:

if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?

That's why I said 010101.... is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".

The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).

Then, the time to do step 3 should be

  1. Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
  2. For each run: O(log n) average case, O(n) worst case
  3.   Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.

This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).

Tocantins answered 20/1, 2012 at 20:49 Comment(5)
I don't understand. Suppose 1100010 is to be stored and 1001 is to be searched. How does your algorithm proceed? Can it find the substring 1100?Bb
No, it won't solve substring of a cyclic string, but I don't see anything in the OP about substrings.Tocantins
hmm, my interpretation of "check if some cyclic .. is in my structure" is different. Maybe user550413 will clarify.Bb
@Ali: No, I don't care about sub strings but I am not sure I got the idea. If I have a string like 11001001 so after doing steps 1-3 the least lexicographically value will be either 00100111 or 00111001 so which of them I will insert? If I take the first one(and there could be many other options in other stirngs) and insert it to a Trie and then I get to check a string like 01001110 by steps 1-3 I'll translate it to 00111001 (the second option) and won't find it..Declaratory
@MikeSamuel: Okay, I've missed step 3 in the comment above but if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?Declaratory
B
0

You have n strings s1..sn and given a string t you want to know whether a cyclic permutation of t is a substring of any s1..sn. And you want to store the strings as efficiently as possible. Did I understand your question correctly?

If so, here is a solution, but with a large run-time: for a given input t, let t' = concat(t,t), check t' with every s in s1..sn to see if the longest subsequence of t' and sm is at least |t| If |si| = k, |t| = l it runs in O(n.k.l) time. And you can store s1..sn in any data structure you want. Is that good enough or you?

Bb answered 20/1, 2012 at 21:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.