Approaches for reversing sprintf/format
Asked Answered
F

1

6

I have to heuristically determine the format pattern strings by analyzing the formatted results.

For example I have these strings:

You have 3 unread messages.

You have 10 unread messages.

I'm sorry, Dave. I'm afraid I can't do that.

I'm sorry, Frank. I'm afraid I can't do that.

This statement is false.

I want to derive these format strings:

You have %s unread messages

I'm sorry, %s. I'm afraid I can't do that.

This statement is false.

Which approaches and/or algorithms could help me here?

My first thought was using machine learning stuff, but my guts tell me this could be a rather classic problem.

Some additional requirements:

  • The type of the parameter is irrelevant, i.e. I don't need the information if the parameter originally was %s or %d or if it was padded or aligned.
  • There can be more than one parameter (or none at all)
  • Typically the data consists of thousands of formatted strings, but only tens of format patterns.
Fredrickafredrickson answered 11/2, 2011 at 8:19 Comment(4)
+1, awesome problem. BTW, do you mean "You have %d unread messages"?Soloist
@templatetypedef: Yes and no. It would be great, if I could determine the original type, but it's not necessary. But if the general position of the parameter can be found, it shouldn't be that hard to determine the actual type. (Although that won't be always correct, for example Javas String.format allows int to be filled in %s)Fredrickafredrickson
@DR- Ah, so you're trying to find where the changes are. I thought you were given where they were and needed to figure out what awful combination of %s, %d, %-s, %8d, etc. were required. Still interesting, though.Soloist
Do you know how many format strings there are? Can they have multiple parameters? (Interesting problem indeed, +1.)Cuneal
C
1
  1. Cluster the strings by some metric of similarity (I'd try length of longest common subsequence, LCS). Determining the number of clusters is the hard part, if you don't know it beforehand.

  2. Within each cluster, determine the LCS of all strings in it, recording the position of the gaps that occur. Replace the gaps with %s. (You may want to build a function that returns an LCS-based format string and fold/reduce that over the cluster.)

The above is a greedy algorithm that, given {foobar, fooBaR} produces foo%sa%s. You may want to replace any pair of occurrences of %s separated by a single character (or a single non-whitespace char, etc) by a single %s, recursively.

Cuneal answered 11/2, 2011 at 9:4 Comment(2)
I've been experimenting a little bit, and it seems that finding the number of clusters is not only the hardest part, but the essential, because the number of clusters is the number of format strings (best case)Fredrickafredrickson
OK, I'm trying to approach this similar to the k-center problem using the Hochbaum Shmoys heuristic. I'll stop chosing a new center as soon as the length of the LCS reaches a certain threshold.Fredrickafredrickson

© 2022 - 2024 — McMap. All rights reserved.