How can I train a Genetic Programming algorithm onto a variable sequence of descriptors?
Asked Answered
N

2

21

I am currently trying to design a Genetic Programming algorithm that analyses a sequence of characters and assigns a value to those characters. Below I have made up an example set. Every line represents a data point. The values that are trained are real-valued. Example: For the word ABCDE the algorithm should return 1.0.

Example dataset:

ABCDE : 1

ABCDEF : 10

ABCDEGH : 3

ABCDELKA : 50

AASD : 3

The dataset could be as large as it is needed, since this is all just made up. Lets assume the rule that the GP should figure out is not too complicated and that its explained by the data.

What i would like the algorithm to do is to approximate the values from my dataset when given the input sequence. My problem now is that each sequence can consist of a different number of characters. I would prefer not to need to write some fancy descriptors myself, if possible.

How can I train my GP (preferably using tinyGP or python) to build this model?

Since there was so much discussion here - a diagram says a thousand words: schematics What I want to do is just put a data point and put that into a function. Then I get a value, which is my result. Unfortunately i do not know this function, I just have a dataset that has some examples (maybe 1000 examples just an example). Now I use the Genetic Programming Algorithm to find an Algorithm that is able to convert my Datapoint into a Result. This is my model. The problem that I have in this case is that the data points are of differing lengths. For a set length I could just specify each of the characters in the string as a input parameter. But beats me what to do if I have a varying number of input parameters.

Disclaimer: I have gotten to this problem multiple times during my studies, but we could never work out a solution that would work out well (like using a window, descriptors, etc). I would like to use a GP, because I like the technology and would like to try it out, but during Uni we also tried this with ANNs, etc, but to no avail. The problem of the variable input size remains.

Naidanaiditch answered 15/5, 2013 at 13:51 Comment(13)
Are you sure you don't want classification?Freshwater
It is not clear from your question either a) what the input is (a single sequence of characters that can broken up into "words"? a list of "words"?), nor the goal (assigning values to individual characters? to "words"?), and thus how one would go about computing one from the other is all but impossible to discern.Citarella
Genetic algorithms are optimisation algorithms. They're mostly useful for determining a fairly optimal approximate solution to a combinatorial optimisation problem where simply brute forcing every combination isn't feasible. It's not clear what you define a solution as, is it a string? If it's a string then how are the scores calculated on the string? What would be an optimal score? ANNs are designed more for classification, whilst GAs are designed more for optimisation. They're designed to solve different classes of problem so you can't interchange them because you simply "like" them.Milson
String is input, real is output. The GP algorithm should create a model, that converts one into the other. An ANN would likewise form a model. ANN and GP share that they both build models upon an input and apply these models to the input data. Theorethically a GP algorithm can do classification just fine.Naidanaiditch
If you can't describe a fitness function for a solution you can't use a GA.Milson
I think what makes a difference here is that i am not talking about a GA, i am talking about a GP, which are totally different principlesNaidanaiditch
A GP is just an implementation of a GA for a specific problem so they're not different principles at all. Even with a GP you still need to be able to define a fitness function, i.e. what determines the fitness of a solution.Milson
You should clarify wether you actually want to design a genetic algorithm (GA) or a genetic programming algorithm (GP), as your question imply that both concepts are interchangeable, while they are not (paragraph 2 and 4 mention GA).Aquiculture
Sorry, that was a typo. I want a GP that is able to cope with the variable length of the input. the GP should have the string as input and the value as output. The table represents a training set that the GP should use to infer a rule how input and output are connected.Naidanaiditch
So, not very familiar with GP, GA, or anything, but I have an idea for a fitness function which may get you off the ground. Take your algorithm and compare the sample input and outputs with the outputs of the algorithm. The fitness value is the absolute value of the sum of the differences between the true real output and the algorithm's output. Lower is better.Leicester
You are right about having to have a fitness function for the solution. i would do that by letting the GP-program compute over the set of data and then compare the results given by the GP-program with the solution in the database. But i am looking for strategies to cope with the variable-sized input.Naidanaiditch
I was trying to implement a solution to this and the only thing that I was hung up on was generating coherent random functions to use on the data. Dealing with variable-sized input doesn't seem to be a problem... A simple for char in input should do the trick.Leicester
I dont understand the "generating coherent random functions". Isn't the GP fitting a deterministic function?Naidanaiditch
G
5

Traditional genetic programming is not suited for variable length input.

It occurs to me some model of evaluation is presupposed in the question.

Consider, for example that you encode your variable length input to a single arbitrary precision value, for example for alphabet of 10 symbols:

ABCD = 1234; ABCDEF = 123456

or

ABCD = 0.1234; ABCDEF = 0.123456

However if this encoding is not natural for the problem domain, it will be quite hard to evolve a program that deals with such input well.

You could also suppose that problem can be adequately represented by a genetically derived finite state machine:

F(F(F(F(init(), A), B), C), D) = 1234

That's a separate field of study from genetic programming, google around, read research papers, perhaps you can find a package that does what you want for you.

Then again your problem may be best represented by yet another transformation, e.g. frequency of bigrams -- such transform is finite length:

# bigrams
# ABCDE => 1
"AA": 0
"AB": 0.25
"AC": 0
"AD": 0
"AE": 0
"BA": 0
"BC": 0.25
"BD": 0
#... up to end of alphabet ...

(0, 0.25, 0, 0, 0, 0, 0.25, 0, ...., 0, ...) => 1      # ABCDE
(0, 0.20, 0, 0, 0, 0, 0.20, 0, ...., 0.20, ...) => 10  # ABCDEF
# input length N^2

# trigrams
(0, 0.33, 0, 0, ..., 0, ...) => 1      # ABCDE
(0, 0.25, 0, 0, ..., 0.25, ...) => 10  # ABCDEF
# input length N^3

Bigrams, trigrams, etc are surprisingly good predictors:

  • capture markov information ("ab" vs "ac")
  • capture relative position ("ab" && "bc" vs "ed" && "bc")
  • capture non-linear semantics ("abab" != "ab" * 2)
  • resistant to shuffled input ("buy new spam" vs "buy spam it's new")

These are often used in natural language problems, such as text topic detection, author detection, spam protection; biotech, such as dna and rna sequences, etc.

However there is no guarantee this approach is applicable to your problem. It truly depends on you problem domain, for example consider alphabet 10+ in arithmetics domain, the following two inputs become indistinguishable, yet yield different results:

10000+10000 = 20000
1000+100000 = 101000

In this case you need something like a register machine:

init: tmp = 0; res = 0
"0": tmp *= 10
"1": tmp *= 10; tmp += 1
"+": res += tmp; tmp = 0
end: res += tmp
Glazer answered 19/5, 2013 at 13:29 Comment(3)
can you cite some research papers that cope with variable sized input/variable numbers of descriptors?Naidanaiditch
for example, cs.ubbcluj.ro/~moltean/lsgp.pdf liquid state machines (neural nets with decay) coupled with genetic programming. en.wikipedia.org/wiki/Recurrent_neural_network is also interesting, while not genetic programming per se. cs.utsa.edu/~bylander/pubs/AppliedSoftComputing.pdf decomposes time series into high-order components.Glazer
The old link for the Liquid State Genetic Programming is not working anymore. You may read it now from: researchgate.net/publication/…Holeandcorner
S
9

Since you have not a fitness function, you will need to treat the genetic algorithm as it was a classifier. So you will need to come up with a way to evaluate a single chromosome. As others suggested you, this is a pure classification problem, not an optimization one, but, if you still want to go ahead with GA, here you have some steps to try an initial approach:

You will need:

  1. Description of (how to encode) a valid chromosome

To work with genetic algorithms, all the solutions must have same length (there are more advanced approach with variable length enconding, but I wont enter there). So, having that, you will need to find an optimal encode method. Knowing that your input is a variable length string, you can encode your chromosome as a lookup table (dictionary in python) for your alphabet. However, a dictionary will give you some problems when you try to apply crossover or mutation operations, so is better to have the alphabet and chromosome encoding splitted. Refering to language models, you can check for n-grams, and your chromosome will have the same length as the length of your alphabet:

.. Unigrams

alphabet = "ABCDE"
chromosome1 = [1, 2, 3, 4, 5]
chromosome2 = [1, 1, 2, 1, 0]

.. Bigrams

alphabet = ["AB", "AC", "AD", "AE", "BC", "BD", "BE", "CD", "CE", "DE"]
chromosome = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

.. Trigrams

alphabet = ["ABC", "ABD", "ABE"...]
chromosome = as above, a value for each combination

2. Decode a chromosome to evaluate a single input

Your chromosome will represent an integer values for each element in your alphabet. So if you want to know a value of one of your inputs (variable length string) having a chromosome, you will need to try some evaluation functions, simplest one is the sum of each letter value.

alphabet = "ABC"
chromosome = [1, 2, 1]
input = "ABBBC"

# acc = accumulated value
value = reduce(lambda acc, x: acc + chromosme[alphabet.index(x)], input, 0)
# Will return ABBBC = 1+2+2+2+1 = 8

3. Fitness function

Your fitness function is just a simple error function. You can use simple error sum, square error... A simple evaluation function for a single gen:

def fitnessFunction(inputs, results, alphabet, chromosome):
    error = 0

    for i in range(len(inputs)):
        value = reduce(lambda acc, x: acc + chromosome[alphabet.index(x)], inputs[i], 0) 
        diff = abs(results[i] - value)
        error += diff # or diff**2 if you want squared error

    return error

# A simple call -> INPUTS, EXPECTED RESULTS, ALPHABET, CURRENT CHROMOSOME
fitnessFunction(["ABC", "ABB", "ABBC"], [1,2,3], "ABC", [1, 1, 0])
# returned error will be:
# A+B+C = 1 + 1 + 0 -- expected value = 1 --> error += 1
# A+B+B = 1 + 1 + 1 -- expected value = 2 --> error += 1
# A+B+C = 1 + 1 + 1 + 0 -- expected value = 3 --> error += 0
# This chromosome has error of 2

Now, using any crossover and mutation operator you want (e.g.: one point crossover and bit flip mutation), find the chromosome that minimizes that error.

Things you can try to improve the algorithm model:

  • Using bigrams or trigrams
  • Change evaluate method (currently is a sum of lookup table values, it can be a product or something more complex)
  • Try using real values in chromosomes, instead of just integers
Suitable answered 20/5, 2013 at 19:27 Comment(6)
Good, solid answer @iluengo. Nice. +1Harlin
I like the amount of work you put into your answer, but the advanced approach with variable length enconding, which you would not enter, is kinda the point of this question. If the input is fixed-length, i could just use a simple GPNaidanaiditch
Ill split the answer in 2 comments, it doesnt fitt in 1. Well, I didn't enter there because it is not the cap of the problem. I mean, variable length chromosome, is just another way to encode the same problem. e.g: [3, 2] means [1,1,1,0,0] and [1,2,1] means [1,0,0,1] (count of 1,0). What I'm trying to say, is that variable length chromosome just adds another code-decode step to your GA, it doesn't solve your main problem, which is that you don't know what is the fitness function and you don't know what a chromosome means.Suitable
Like I said at the begin, this is a classification problem. A classifier makes you a "black box function" to predict a value for a given entry.. but in GA there is no black box, you need to know exactly the function you are trying to optimize. That's why I give you an initial approach to try to guess the fitness function. But, without a way to encode and eval a chromosome, is hard to work in GA. Here you have an example of variable length chromosome: #10707086 it is just another way to encode a chromosome.Suitable
You do understand, that we are not talking about a GA, which would be a genetic algorithm, but about a GP, which is a genetic programming algorithm. Those are quite different. The fitness function has to be evolved from the data, thats why it cannot be part of the input. The fitnessfunction would then give f('aaa')=1 For values that are not described by the data, this has to be extrapolated by the algorithm - that is the result of the GPNaidanaiditch
Ok, all clear now. Seems I just messed up with concepts, and thought GA and GP were the same. That's why I tried to give you a classic GA approach. Sorry if my answer confused you, and sorry for the time I made you lose. I've learned something new at least =PSuitable
G
5

Traditional genetic programming is not suited for variable length input.

It occurs to me some model of evaluation is presupposed in the question.

Consider, for example that you encode your variable length input to a single arbitrary precision value, for example for alphabet of 10 symbols:

ABCD = 1234; ABCDEF = 123456

or

ABCD = 0.1234; ABCDEF = 0.123456

However if this encoding is not natural for the problem domain, it will be quite hard to evolve a program that deals with such input well.

You could also suppose that problem can be adequately represented by a genetically derived finite state machine:

F(F(F(F(init(), A), B), C), D) = 1234

That's a separate field of study from genetic programming, google around, read research papers, perhaps you can find a package that does what you want for you.

Then again your problem may be best represented by yet another transformation, e.g. frequency of bigrams -- such transform is finite length:

# bigrams
# ABCDE => 1
"AA": 0
"AB": 0.25
"AC": 0
"AD": 0
"AE": 0
"BA": 0
"BC": 0.25
"BD": 0
#... up to end of alphabet ...

(0, 0.25, 0, 0, 0, 0, 0.25, 0, ...., 0, ...) => 1      # ABCDE
(0, 0.20, 0, 0, 0, 0, 0.20, 0, ...., 0.20, ...) => 10  # ABCDEF
# input length N^2

# trigrams
(0, 0.33, 0, 0, ..., 0, ...) => 1      # ABCDE
(0, 0.25, 0, 0, ..., 0.25, ...) => 10  # ABCDEF
# input length N^3

Bigrams, trigrams, etc are surprisingly good predictors:

  • capture markov information ("ab" vs "ac")
  • capture relative position ("ab" && "bc" vs "ed" && "bc")
  • capture non-linear semantics ("abab" != "ab" * 2)
  • resistant to shuffled input ("buy new spam" vs "buy spam it's new")

These are often used in natural language problems, such as text topic detection, author detection, spam protection; biotech, such as dna and rna sequences, etc.

However there is no guarantee this approach is applicable to your problem. It truly depends on you problem domain, for example consider alphabet 10+ in arithmetics domain, the following two inputs become indistinguishable, yet yield different results:

10000+10000 = 20000
1000+100000 = 101000

In this case you need something like a register machine:

init: tmp = 0; res = 0
"0": tmp *= 10
"1": tmp *= 10; tmp += 1
"+": res += tmp; tmp = 0
end: res += tmp
Glazer answered 19/5, 2013 at 13:29 Comment(3)
can you cite some research papers that cope with variable sized input/variable numbers of descriptors?Naidanaiditch
for example, cs.ubbcluj.ro/~moltean/lsgp.pdf liquid state machines (neural nets with decay) coupled with genetic programming. en.wikipedia.org/wiki/Recurrent_neural_network is also interesting, while not genetic programming per se. cs.utsa.edu/~bylander/pubs/AppliedSoftComputing.pdf decomposes time series into high-order components.Glazer
The old link for the Liquid State Genetic Programming is not working anymore. You may read it now from: researchgate.net/publication/…Holeandcorner

© 2022 - 2024 — McMap. All rights reserved.