evolutionary-algorithmdata-representationinternal-representation

Trying to find a good data representation for a genetic (evolutionary) algorithm, but I just can't imagine one


Disclaimer

First of all, this is for homework, so don't ask why it's so contrived, it is this way and it can only be this way. (I get a lot of "how about if you change something"), sorry ... I can't.

Also I must use evolutionary algorithms, that means parents have children, they can mutate / recombine, form new generations and eventually lead to a solution.

/Disclaimer

I have n*2 words of length n. I have to make a n^2 matrix containing all these words. The words can be gibberish, but they have to be able to fit in this matrix (it's a requirement on the user's part).

Thus AGE,AGO,BEG,CAB,CAD,DOG will give me this result (1 out of at least 2 possible):

C A B
A G E
D O G

I have to use an evolutionary algorithm. Thus I need to find a way to code my information into a chromosome.

What I did come up with:

Each word has to appear, has a start position in the matrix, and an orientation (left-right or up-down). Thus I have [Word][Orientation][StartPosition] where start position is [0][0] / [0][1] / [1][0] etc (left column and top line). But it has restrictions, I need to validate that the orientation fits the start position.

Problems:

A chromosome has to be a possible solution whereas this is only part of a solution.

Since my solution has to be a matrix containing all the words in such a way as to "fit" the chromosome also has to somehow represent the entire matrix. But that hits several problems. I can only have one word from one starting position in one orientation (except the first two words, they share the same start position with different orientations). I can't see this working as a valid way of attempting an evolutionary algorithm. I just can't see any of the phases working, especially mutation / recombination.

Am I thinking of it entirely wrong ? If so ... why ? and how could I attempt to code my data in such a way as to let me go trough all the phases (reproduction, mutation / recombination, natural selection ... be able to calculate a fitness and start a new generation) without having tons of garbage data (having a word appear twice, lose a word, have a word on a wrong orientation compared to it's start position) ?

EDIT

I shall be using this representation to implement many other nature-inspired algorithms, thus I need a "good" data representation. Nothing makeshift that could hurt me later.

I honestly can't think of a good way though. Because I have a lot of limitation (maybe I have been thinking on this too long and I can't get past them, and they maybe aren't really there). I would really like a binary representation, but that seems impossible.


Solution

  • You seem to be having a problem with Winston Ewert's answer. I'm making this a separate answer for reasons of length and formatting, but what Winston is suggesting is a reasonable approach.

    You ask about the "array of characters" and what your parents and offspring are, how you'd do crossover, mutation, etc. I think maybe you're confused by that array of characters bit. Don't think of it as an array of characters -- it's an array of words. This is important, because you don't want to do crossover between word boundaries or change letters inside a word via mutation. You want to shuffle the words around. So instead of dealing with the words, let's take a slightly different (but equivalent) approach.

    Let's number each of your 2n words from 1 to 2n. To generate a candidate solution, we just pick n random numbers between 1 and 2n (without replacement). The randomly selected words become the rows to your matrix. So if the words are AGE,AGO,BEG,CAB,CAD,DOG, we pick three at random and end up with, for instance, 2, 3, and 5, (giving a chromosome of 235) or AGO,BEG,CAD.

    This yields a matrix of

    AGO
    BEG
    CAD
    

    Now we count how many of the 2n total input words appear in the matrix. In this case, it's only three, as none of the columns make valid words. Your fitness for the input 235 is 3. The input 416 works though -- its fitness is 6.

    To generate a population, you just generate N random sets of three numbers (without repetition) between 1 and 2n. With a population size of 4, you might get

    142
    354
    624
    241
    

    These give you four different potential solutions:

    142 = AGE
          CAB
          AGO
    
    354 = BEG
          CAD
          CAB
    
    624 = DOG
          AGO
          CAB
    
    241 = AGO
          CAB
          AGE
    

    To do crossover, you'd need to design a method that avoided duplicated numbers in the offspring. I've in the past adapted the Cycle Crossover method designed for permutations to handle situations similar to this, but you can design any method that seems reasonable. For instance, you could do a simple uniform crossover and then just fix any duplicate values by changing one of them to something else.

    Mutation is simply changing one number into another or swapping the positions of two numbers in the encoding.

    You say you must use a GA, so you could stop here and try something like that. It's not likely to work extremely well, for a couple of reasons. First, there are going to be a lot of plateaus in the fitness function. Almost every possible answer is equally wrong, so there's not much of a gradient for the GA to follow. It's a needle in a haystack search problem. You could try to modify the way you score the fitness values to alleviate this somewhat. For instance, I know that only one word starts with D in your example set, so any solution with DOG in the first row is automatically wrong. I could award points to solutions that gave more "plausible" wrong answers. However, in general, this is going to be a tough problem for a GA to solve efficiently.

    Second, constraint satisfaction problems like this one can be much more efficiently solved with specialized techniques, no matter how well you build your GA. You could easily solve this with backtracking.