I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.
Any advice on how to approach this problem?
Update on distance between strings
I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.
That is:
"I am going to the store" and
"We are going to the store"
Similar.
"I am going to the store" and
"I am going to the store today"
Similar as well (but slightly less).
"I am going to the store" and
"J bn hpjoh up uif tupsf"
Quite not similar.
(Thanks, Welbog!)
I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.
Update on task simplification
I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.