bias in statistics

Mon Dec 16 10:30:26 EST 2002

I understood whats going on, I'll write it down for the sake of posterity.
It has to do with the fact that I am using a noisy objective function
with the steady state ga with high replacement rate(0.9).
I'll explain using an example:
Assume i am optimizing a population of size 20 with 0.9 replacement rate.  
The steady state ga creates the next generation like this:

- Take the worst 18 from the last generation and apply genetic operators
to create a new population.
- Add the new 18 to the old population resulting in a population of size
38.
-Evaluate the population.
-Remove the worst 18 from the population for the next generation.
-Update statistics for the population, now of size 20.

If the population has already converged this is not the true average of the 
population.  Sampling 38 i.i.d. random variables and then averaging
the top 20 results will have a higher expectation than taking these same 20 
and resampling them.

The moral of the story is that if you are using the steady state ga with a 
noisy objective function you should do the statistics computations yourself.

Yossi Mossel

----Original Message Follows----
From: "Yossi Mossel" <ymossel at hotmail.com>
Reply-To: galib at mit.edu
To: galib at mit.edu
Subject: bias in statistics
Date: Sun, 15 Dec 2002 19:37:34 +0200

Hi All,
  My problem takes a while to describe, hope you manage to bear with me.

  I am using a steady state ga to optimize a population of agents which 
perform a task in a stochastic environment.  Due to the nature of the 
environment my objective function is noisy.  Because of this I can't just 
pick the best agent at the end of my evolutionary runs but have to sample 
each member of the population several times to determine who is  the best 
agent on average.
The best agent returned by the statistics object won't do because it is 
simply the agent which got the highest score in one of it's runs.  It might 
not be the best on average.

My problem is that the agent which I sample as best has a lower average
score then the average score of the last generation as given by the 
statistics object.
Further testing showed that in fact if I sample the entire population
after evolution ends then consistently this average is lower then the 
population average given in by the statistics object.

I tested to see whether this is consistent throughout the evolutionary run. 
I used the following to sample each genome in my population after each 
generation:
//using my objective function directly
float obj = objective(*genome);
//calling the genome's evaluator
float eval = genome->evaluate(gaTrue);
//getting the genome's score
float score = genome->score();

And then averaged over these results.

If I compare these values with the value logged by the statistics object
I find that the first two averages(from obj and eval) are consistently 
lower.  The average obtained by score() is equal to the average logged by 
the statistics object (not surprising as score() returns the score given to 
each genome in the last generation).

My best theory so far was that the statistics object logs the scaled rather 
than the raw objective scores.  However, by looking at the code I saw that 
this is not the case.

I should add that the genetic algorithm is solving the problem, I would
still like to know what is the source of this apparent discrepancy.

Thank You for your attention ,
Yossi Mossel
Computational Neuro-Science
Tel-Aviv University

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*  
http://join.msn.com/?page=features/junkmail

_______________________________________________
galib mailing list
galib at mit.edu
http://lancet.mit.edu/mailman/listinfo/galib

_________________________________________________________________
Add photos to your messages with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail