Tuesday, March 14, 2006

variants on current experimental set-ups

Evolving to remember and distinguish a feature within a continuum is truly hard for abstract dynamical systems, even if irreversible, at least in comparison with its discrete counterpart. I will detail ahead a number of several variants that I am experimenting with, in order to get something that works.

First of all, by abstract I mean disembodied and non-situated in an environment and accordingly where the system itself cannot vary the way the feature is being given, at least not directly.

[1] Gradual increment from discrete to continuum during evolution: The ‘zen’ approach would be to pick parents and individuals randomly from a uniform distribution every time from the beginning of evolutionary time until the end. But it is not easy to be ‘zen’ (particularly when you are not making as much progress as you would like to be), so we put our engineering hats on. One could think that it would be easier for evolution to gradually change the task into ‘more complex’ as the population gets better – this is probably most of the times the case, but I’m not so sure it applies in this particular scenario. As the title suggests the idea is to evolve agents for the 2 most different parents first and as the agent gets good at doing this, more closely related parents are presented. One way to do this is drawing parents randomly from a distribution that gradually shifts from the binary 2-parent case towards the complete continuum. The Beta distribution is quite ideal for this, it has two parameters and changing the parameters from 0.1 to 1 causes the distribution to change from ‘almost’ discrete to ‘almost’ uniform.

This figure shows the histogram of random samples taken a beta distribution for parameters 0.01, 0.11, 0.21, 0.31, 0.41, 0.51, 0.61, 0.71, 0.81. As you can guess, when the parameter (alpha) reaches 1, the distribution is effectively uniform across the continuum. The motivation for this is to encourage the agent to distinguish between the more different individuals first and later on to learn the smaller differences. This distribution can be used to pick both the parent and the test individual.

I’ve used this for a while and have run comparisons with and without this ‘incremental’ approach. I have not made a thorough comparison (yet), nor have I performed sufficiently big tests (basically compared 10 runs with the dynamic Beta dist and 10 with the simple uniform dist), nevertheless (and interestingly enough indeed), it does not seem to improve evolvability in any way. It actually made things worst as far as I can tell. Again, I have not studied this in depth but hope to do so at some point.

[2] Gradual increment from noise-less to noise-full: For the behaviour to be interesting it has to be able to cope with several forms of noise: [a] random initialisation of the activation of the nodes, [b] random delays at the beginning of the run and between presentation of individuals and [c] inter-trial variability. There are also other types of noise that I am not considering at the moment (e.g. [d] inter-node, sensory and motor node noise).

[3] Gaussian-weighted Evaluation of agent’s output: This idea is taken directly from (Phattanasri et al., 2002, 2006). The agent is evaluated for 10 units of time after the presentation of a test individual. Instead of simply taking the area under the output node’s activation, a Gaussian-weighted area is taken. This takes away importance to the beginning and end of the evaluation period and concentrates on the middle region. It helped them, since mine is not evolving I figured I’d use it until it works – then I would run tests to see whether this in particular helped or not.

Finally, there are several parameters we know (from common sense or experience) to be crucial:

[4] Genotype-phenotype mapping:
[i] Weights: mapped linearly from [0,1] to something like [-6, 6]. I’m quite happy with this.
[ii] Time-parameters: Map exponentially from [0,1] to [e0,e3]. The important thing here is that the smallest possible is around 10 times bigger than the time-step of integration (usually 0.1) and the largest will depend on how long a trial is – definitely not longer than that. The exponential mapping simply provides more precision for the small time-scales (where more precision might be needed), and less as the time-parameter becomes bigger. I’m quite happy with this as well.
[iii] Biases: generally I map it linearly from [0,1] to something like [-10,10] but I’m not happy with this one at all. One idea (which could be very fruitful to CTRNN evolution in general) is to map the genotype value always relative to the centre of activation of that node (which would depend on its incoming weights). Furthermore, this could be extended to include an exponential mapping so that there’s more precision around the centre-crossing region and less towards the outer parts,

[5] Number of trials per fitness (this is related to the inter-trial variability): Each agent is tested 200 times every time it is chosen during selection. I’ve done 100 until now so this is an experiment to see whether there’s any difference.

[6] Evolutionary operators: mutation and recombination: For mutation I used the common Gaussian vector mutation, where basically each value in the genotype gets perturbed with a random number drawn from a Normal distribution around 0 with very small standard deviation. I’m happy with this. For the recombination I take x/N genes from the winner of the tournament and (N-x)/N genes from the loser. I’m less happy about this method. It makes a lot of sense for the discrete (e.g. binary) genotypes but less so for continuous ones. Ideally the new individual should be made from a combination between winner and loser that is not constrained to the gene dimension. One way would be to take a new point in genotype space using regular Euclidean n-space that is an x proportion away from the winner and a (1-x) proportion away from the loser. Not sure though.

[7] Finally, the very obvious, number of nodes in the CTRNN: I have played around with circuits between 3 and 10. As there are 5 node CTRNNs whose nodes are all active but which are not doing the full* task then I will be running experiments with 10 node CTRNNs and when it works I’ll see how many are (if any) saturated on/off and take it from there.


At 12:57 am, Blogger eldan said...

Thanks for posting this - it's useful to keep up with what you're doing, especially as you seem to be encountering some similar issues to those I'm finding with my work.

A couple of comments on specific parts:

[1] - I've had some success with an analogous (though simpler) attempt at shaping. However, I've found it very sensitive to exactly where evolution starts. Specifically, I'm also doing experiments where an agent has to distinguish two signals (*), and getting a reasonable success rate (about 1 per 5 searches, though this is drawn from a very small sample...) by starting the search with signals that don't overlap but also have no gap betweem them. In other words, signal A can be between 0 and +1, and signal B can be between 0 and -1. From there, gradually increasing the amount of overlap seems to work, but starting with either overlap or a gap doesn't.

I haven't tried very small gaps yet, and I'm sure that really there must be a continuous distribution of probabilities of success, based on what the starting conditions are. However, I'm mentioning this because the failure of the 'incremental' experiments you describe might turn out to be nothing more than parameter sensitivity.

(*) actually I'm not convinced my agents need to do anything more than recognise one of the signals and ignore the other, so the analogy to your work may be even closer than I intended.

[4][iii] what worries you about the biases? I must admit it's not something I've given a great deal of thought to, but I'm interested in what you see as the drawback of the very simple way of doing the mapping.

At 3:10 pm, Blogger eduardo said...

Thank you for reading it.. knowing other people might actually read this definitely makes me put more effort in both the content and form of it.

Regarding you shaping techniques, (although I still need to read more of your work to understand the whole picture) I have looked at your protocol and it definitely seems to be helpful for your scenario, distinguishing between two signals. Now, in the case of wanting an agent that can remember a feature that is on a continuum, driving the population towards the discrete case (i.e. where it can distinguish between two different non-overlapping inputs) may not necessarily be an advantage to where the population needs to be in order to generalise.

Regarding the biases, this is a worry that arises from something very particular in my evolutionary technique. I constrain the range of possible circuit’s parameters to be within certain values. For example, I constrain weights to be between [-6, 6] and biases to be between [-10, 10]. Mutations that drive a parameter out of these regions are reflected back into that space (why do I do this? I won’t give a thorough explanation here, but the argument is around the idea of preventing parameters from incrementally running away).

But, as you know, the centre of activation of a node its proportional to its incoming weights. So, for example, if we have a 5 node circuit and we have one node whose incoming connections have all the maximum possible positive weights, say 6, then for this node to be near its centre-crossing region, its bias would have to be -15. In the constrained case described before, the possibility of this node being its centre-crossing region is simply lost completely.

One possible way to avoid this (with keeping the constrained parameter space) is to map the biases relative to their centre of activation, so that the centre-crossing region is always in the dead centre of the genotype/parameter space (as opposed to being off to one side or outside completely). This is something that I am thinking about currently. This is directly dependent on the number of nodes in the circuit and the incoming weights.

On top of the relative mapping, which would ensure that the centre-crossing region is in the middle of the genotype space, always, there is the idea of a different-from-linear mapping. I usually map biases linearly, but in the case of the bias, with the centre-crossing regions in the middle, as I understand it, it is generally the case that around the centre-crossing space different phase-portraits are more dense, while as you get away from this region they become increasingly sparse. For this reason, there is an argument in favour of having more precision in the more dense area and less as you go out, whether it empirically works out to be better or not I have not tested yet. So a mapping that is exponential around the centre-crossing bias could also be useful…

As this very much still under construction in my head, I will try to make my intuitions clearer with a picture. To be more exact, it is a raw modification of a picture from Randy’s parameter structure paper (and it is not a nice modification at that). Randy’s figure depicts the asymptotic region approximations of a particular 3 node circuit, I’ve added the genotype next to one of the axis. It goes from 0 to 1 because that’s the range that the genes in my genotypes can take. There are two things to note, (1st) the 0.5 genotype region is aligned with the centre-crossing region, and the boundaries (0 and 1) encompass all of the interesting space and (2nd) the black lines along that same plane would correspond to the ‘exponential’ distribution around the centre-crossing bias, allowing for ‘hard-to-get-to-phase-portraits’ to be more easily reachable while quickly moving through the outer regions.

Sorry if all of this is a bit convoluted at the moment, but I definitely think there is something there – and would be happy to discuss it further.


Post a Comment

<< Home