Can you come up with a canonicalizing function for cyclic strings based on the following:
- Find the largest run of zeroes.
- Rotate the string so that that run of zeroes is at the front.
- For each run of zeroes of equal size, see if rotating that to the front produces a lexicographically lesser string and if so use that.
This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.
0101010101
is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.
You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.
EDIT:
if I have a string of a length |s| it can take a lot of time to find the least lexicographically value..how much time will it actually take?
That's why I said 010101....
is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".
The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).
Then, the time to do step 3 should be
- Find other long runs: one O(n) pass with O(log n) storage average case, O(n) worst case
- For each run: O(log n) average case, O(n) worst case
- Shift and compare lexicographically: O(log n) average case since most comparisons of randomly chosen strings fail early, O(n) worst case.
This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).