This is me brainstorming. It’s probably not going to make much sense.

So. I’m able to use attitude/belief/method, and I have reason to believe that this approach will work in arbitrary environments. In some cases, due to properties of the game (zero-sum, identical payouts, non-interacting actions) learning can’t take place, but in all cases attitude is either irrelevant (zero-sum, identical payouts) or learnable (non-interacting actions – in which case belief and method aren’t learnable).

Now, the question is, given this ability to learn, how do you use it. The environment we’re trying to deal with is extremely complex. It is, essentially, an indefinitely repeated sequence of normal-form games.

Given this efficient learner, it is tempting to use it to predict opponent actions, and then respond to those actions in whatever manner you feel is most appropriate. This doesn’t work because it’s too short-sighted. The goal is not simply to optimize your action in this particular case, but also to alter the state of the opponent so as to allow you to achieve your objectives in future interactions.

The obvious problem here is that you don’t know the internal state of the opponent, or how it will be affected by your actions, or how it will affect their actions. Given the amount of possible strategies, and the fact that you don’t have the opportunity to repeat observations, there is no way to learn arbitrary strategies (obviously). So the question is, what assumptions are reasonable to make in this situation?

Rationality immediately leaps to mind. Unfortunately, the Folk theorem combined with an indefinite number of repetitions implies that any course of action at all can be rational, depending on the beliefs they hold about their opponent’s behavior.

I will begin with the assumption that an agents actions in each individual game are rational, in that they attempt to maximize some quantity in the presence of an opponent attempting to maximize another quantity, but the quantities maximized are not necessarily the agents own payoff. Specifically, I will assume that an agent will attempt to maximize a linear combination of their own payoff and their opponent’s payoff.

Note that this does not completely solve the problem, because there may be multiple Nash equilibria. In some senses, the Nash equilibrium selection problem is intractable (indeed, the failure of game theory to provide a means of distinguishing a single equilibrium provides support for that proposition). The simplest way to deal with that problem is to assume that the opponent uses a fixed method to select an equilibrium, and attempt to learn that method. There are some vulnerabilities (opponent deliberately selecting favorable equilibria) and some inaccuracies (opponent will also be learning you in self-play), but this is the approach I will use for now.

So – to recap: assumptions so far are that-

1 – problems of Nash equilibria selection are to be dealt with by learning, and adopting the equilibria selection method used by the opponent. (not strictly accurate in a co-learning situation, but will still result in convergence).

2 – each game will be dealt with by both players as an isolated rational decision in which each player is attempting to maximize a quantity linearly dependent on the payoffs of the agents.

The question of how to behave under these assumptions is still open, but I’ve written enough for now.