Monday, February 20, 2017

Andrew DiNunzio PPJ 2/14 - 2/20

This week I spent a lot of time trying to learn how we could get our AI to "learn" how to use the gravity whip to swing the ball. I sent a Udemy instructor who teaches machine classes an email describing our problem and asked for his thoughts. He said that it "definitely sounds like a Reinforcement Learning problem". So with this going from likely unfeasible to somewhat feasible, I decided to dive into it to try to make it happen.

-------------------------- Describing my RL studying below --------------------------

I enrolled in a class on Reinforcement Learning on Udemy, and I started working my way through that. So far, I've learned about the basics of RL and Markov Decision Processes. I started learning about the Bellman Equation, which describes how to find the value function.

I learned that there are a few ways to solve MDPs, and there are typically two steps to finding the optimal policy.
1) policy evaluation
2) policy improvement
The former being responsible for updating the value function given an updated policy, and the latter being responsible for updating the policy by choosing the action that results in the best (greedy) value of the next state. You can go back and forth between these two things until they converge, and you'll have an optimal policy with up to date value function. Actually, these steps can be combined in something called "value iteration"

I studied Dynamic Programming, which was interesting, but it requires having the full model including knowing p(a|s) and p(s',r | s,a) for all states. I learned that DP makes use of "bootstrapping", which allows you to use the previous value of the value function to update it, instead of having to wait for the end of an episode.

I then looked ahead a bit and considered Monte Carlo as a solution to the problem, which is better than DP because it doesn't require having a full model ahead of time (it actually learns from experience). This solution however is not fully online, since you have to wait until the end of an episode to make use of the average returns.

Then I looked into the idea of Temporal Difference (TD) Learning, which combines the good parts of DP and Monte Carlo, in that it is fully online, but still learns from experience.

All of these still require "storing" the value function somehow, which I've been using a dictionary for, mapping states to values (or states and actions to values if using the action-value function Q).

I learned about approximation methods to avoid having to maintain this massive dictionary (while at the same time avoiding having to collect data about every possible scenario) and how you can actually just use linear regression, neural nets, or deep neural nets to approximate the value function.

I learned that the two common value functions are V and Q. V maps state to value, and Q maps state-action pairs to values.

Q has an advantage in that you don't need to know how to "look ahead" (which isn't possible to do for our game), but it has the disadvantage in that it needs many samples in order to learn.

Lastly, I took a glimpse at Q Learning, which is an "off-policy" method of policy optimization. Apparently approximation methods are not very effective for things like this though, but later I'll see how Deep Q Learning can solve it.



This is all prerequisite stuff for the next Udemy course I'll be taking. Right now, I'm at the point where I assume I can hold info for the entire state space, which isn't really feasible. Also, I haven't done anything continuous yet (only discrete).

The next course I'll be taking will demonstrate how to solve the inverted pendulum problem with RL (I think it will make use of TensorFlow or Theano), which will be extremely helpful for our purposes, since that will cover things like continuous (infinite) action spaces among other things.

If it is possible to solve our problem this way, we'd probably need a way to get our program to interface with Python, which I think is doable. This could also have a really neat side effect of letting people code up their own AI if they want to.


I'm still not sure if things will work out, so I still plan on working on the AI as before.
------------------------------------------------------------------------------

I also worked on getting the AI to use two splines now instead of 1, depending on where the ball is coming from (left/right).

Finally, I spent time to get the gravity wells to be "ionizable", which would throw the ball towards the enemy's goal when ionized. These forces would be stronger the more "charge" they have.



Time spent: Total: ~30 hours

  • 20 hours - RL studies
  • 5 hours - Allowing gravity wells to be ionized
  • 5 hours - Get AI to swing both ways horizontally


Pros:
  • I'm learning about reinforcement learning, which is exciting. And this project would be a great application for it.
  • Gravity well ionization works decently well, with some minor future tweaks.
Cons:
  • Gravity wells need some tweaking, since it looks strange throwing the ball at the inside of the gravity well and having a "boucing off" effect towards the goal.
  • There isn't really enough time between studying RL and working on the project, so I feel like I have to commit to one of them and not both

No comments:

Post a Comment