Mapping arbitrary strings to RGB values
Asked Answered
M

8

9

I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.

Any advice on how to approach this problem?

Update on distance between strings

I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.

That is:

"I am going to the store" and 
"We are going to the store"

Similar.

"I am going to the store" and 
"I am going to the store today"

Similar as well (but slightly less).

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

Quite not similar.

(Thanks, Welbog!)

I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.

Update on task simplification

I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.

Manado answered 30/1, 2009 at 14:27 Comment(1)
Hue isn't really three dimensional. You're probably going to want to vary hue more than you vary the other values, as when you get low values or high saturations, you can't really tell the colours apart. My main suggestion is to work in HSV space rather than RGB, working mostly with hue.Conception
C
4

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings

 "I am going to the store" and 
"We are going to the store" 

considered similar? What about the strings

 "I am going to the store" and 
"J bn hpjoh up uif tupsf" 

(all of the letters in the original +1), or

 "I am going to the store" and 
"I am going to the store today"

? Based on what you mean by "similar", you might consider different functions.

If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.

If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.

There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

Conception answered 30/1, 2009 at 14:42 Comment(0)
M
2

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

Monzonite answered 30/1, 2009 at 14:47 Comment(2)
Nope - MD5 intentionally breaks locality. If S1 and s2 are similar, color(s1) and color(s2) should be similar to. MD5(s1) and MD5(s2) are not similar.Medicare
Fair enough... but I didn't suggest MD5, this just makes it clearer as to why :)Monzonite
C
1

You can use something like MinHash or some other LSH method and define similarity as intersection between sets of shingles measured by Jaccard coefficient. There is a good description in Mining of Massive data sets, Ch.3 by Rajaraman and Ullman.

Cacie answered 30/1, 2009 at 14:27 Comment(0)
O
1

Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):

You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

You can do this:

  1. First assign each string a random color (r, g, b)
  2. Repeat until you see fit (eg. error is adjusted less than \epsilon = 0.0001):
    1. Pick a random node
    2. Adjust it's position (r, g, b) such that the error is minimized
  3. Scale the coordinate system such that each nodes coordinates are in the range [0., 1.) or [0, 256]
Orangy answered 30/1, 2009 at 14:27 Comment(0)
I
1

How important is it that you never end up with two dissimilar strings having the same colour?

If it's not that important then maybe this could work?

You could pick a 1 dimensional color space that is "homotopic" to the circle: Say the color function c(x) is defined for x between 0 and 1. Then you'd want c(0) == c(1).

Now you take the sum of all character values modulo some scaling factor and wrap this back to the color space:

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

This might work even better if you defined a "wrapping" color space of higher dimensions and for each dimension pick different SumOfCharValues function; someone suggested alternating sum and length.

Just a thought... HTH

Introrse answered 30/1, 2009 at 14:27 Comment(1)
Oh I see other people suggested this as well, but in other words.. Srry!Introrse
E
1

First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".

Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.

Evvoia answered 30/1, 2009 at 14:47 Comment(0)
I
0

Sorry, but you can't do what you're looking for with levenshtein distance or similar. RGB and HSV are 3-dimensional geometric spaces, but levenshtein distance describes a metric space - a much looser set of contstraints with no fixed number of dimensions. There's no way to map a metric space into a fixed number of dimensions while always preserving locality.

As far as approximations go, though, for single terms you could use a modification of an algorithm like soundex or metaphone to pick a color; for multiple terms, you could, for example, apply soundex or metaphone to each word individually, then sum them up (with overflow).

Insomnia answered 30/1, 2009 at 14:27 Comment(0)
W
0

I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).

When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).

Woodberry answered 30/1, 2009 at 14:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.