I have to heuristically determine the format pattern strings by analyzing the formatted results.
For example I have these strings:
You have 3 unread messages.
You have 10 unread messages.
I'm sorry, Dave. I'm afraid I can't do that.
I'm sorry, Frank. I'm afraid I can't do that.
This statement is false.
I want to derive these format strings:
You have %s unread messages
I'm sorry, %s. I'm afraid I can't do that.
This statement is false.
Which approaches and/or algorithms could help me here?
My first thought was using machine learning stuff, but my guts tell me this could be a rather classic problem.
Some additional requirements:
- The type of the parameter is irrelevant, i.e. I don't need the information if the parameter originally was
%s
or%d
or if it was padded or aligned. - There can be more than one parameter (or none at all)
- Typically the data consists of thousands of formatted strings, but only tens of format patterns.
String.format
allowsint
to be filled in%s
) – Fredrickafredrickson