I am able to train my robot in simulation using ARS, and the rewards go up, until it's doing a nice quadruped walk of about 8 metres, in 1000 time steps, per episode.
But when I stop the program, and load the latest policy file saved, the progress seems to reset.
So here are rewards before quitting:
Step: 228 Reward: 7.3702018170620045
Step: 229 Reward: 7.714461611598418
Step: 230 Reward: 7.561359695748974
Step: 231 Reward: 7.821410289581096
Step: 232 Reward: 8.063394689120104
Then I quit and start over with the policy from this run, and now the rewards are:
Step: 0 Reward: -0.857725762127679
Step: 1 Reward: 0.4743116543915239
Step: 2 Reward: 2.847940117990215
Step: 3 Reward: -0.8539553638529137
Step: 4 Reward: 3.0954765964267392
Step: 5 Reward: -0.5506416870964888
Step: 6 Reward: -0.626510343527105
Step: 7 Reward: 1.7169761539347284
Step: 8 Reward: 1.4009849267252874
Step: 9 Reward: 5.664102735951084
Step: 10 Reward: 0.46594051311167095
For ARS, theta is the matrix of perceptron weights between
nb_inputs = env.observation_space.shape
nb_outputs = env.action_space.shape
For my robot, with 4 legs, there's 4 actions (-1 to 1) for the 4 motors
And the input is 16 numbers, which is an observation:
(4 motor angles, 4 motor velocities, 4 motor torques, and base orientation in quaternion form)
Anyway, am I missing something obvious?
What might be different between carrying on from step 231 to step 232, vs starting again from step 0?
I've verified that theta being saved is the same as the theta being loaded.
Official Python bindings with a focus on reinforcement learning and robotics.
1 post • Page 1 of 1