Actor-Attention-Critic for Multi-Agent Reinforcement Learning 논문에 대한 정리 글임.

Actor-Attention-Critic for Multi-Agent Reinforcement Learning

Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm that trains decentralized policies in multi-agent settings, us

arxiv.org

KEYWORDS

MARL
Attention

MAIN IDEA

The main idea is to learn a centralized critic with an attention mechanism
- Actor-Critic 알고리즘에 Attention을 적용하여 에이전트가 다른 에이전트의 관측정보를 선택하여 사용
장점 : 더 효과적으로 학습이 가능하고, scalable learning이 가능하다.
- 에이전트의 수가 늘어도 input space는 linearly 하게 증가한다.
- 다양한 환경(협동, 경쟁, 혼합된 환경)에서 모두 적용이 가능하다.

DETAILS

Introduction

기존 MARL 알고리즘

MARL의 가장 간단한 방법은 individual agents를 독립적으로 학습시키는 방법임 (independently maximize their individual reward)
- 그러나 이러한 경우 다른 에이전트들의 정책(Policy)이 바뀌기 때문에 Markov Property를 만족시키지 못함.
이를 극복하기 위하여 연구된 방법은 모든 에이전트를 하나로 모아서 학습을 시키는 방법임 (All agents can be collectively modeled as a single-agent)
- joint action space를 만들어서 모든 에이전트의 행동을 정함
- 에이전트의 수가 늘수록 action space가 exponentially하게 증가하고, 모든 에이전트가 통신이 보장되어야 하는 단점이 있음
최근에는 Critic은 centralized learning을 실시하고, actor는 decentralized excution을 실시하는 CTDE방식으로 발전되어 왔다.
- 하지만 여전히 많은 수의 agent에 적용이 어렵고 일반적으로 적용이 어렵다. (cooperative, competitive, mixed 등의 다양한 시나리오에 적용이 제한)

제안 알고리즘 (MAAC)

Main idea : Learn centralized critic with an attention mechanism
축구를 예시로 들었는데, 수비수는 그들 근처의 공격수와 공에 집중을 해야 되고 상대편 골키퍼의 움직임에 집중할 필요가 없다. 반면, 상대편 공격수는 상대편 골키퍼에 집중을 해야 함.
이처럼, 에이전트들도 서로 집중을 해야 하는 대상을 정하고 대상이 된 에이전트의 행동에 집중을 해야 함.
이를 위해 Attention mechanism을 적용함.
제안 알고리즘의 장점
- input space linearly increasing with respect to # of agents : 입력 공간이 linear 하게 증가함
- be able to cooperative, competitive, and mixed environment : 다양한 상황에 적용이 가능함.

MAAC

💡 The main idea behind multi-agent learning approach is to learn the critic for each agent by selectively paying attention to information from other agents.

상태(o)와 행동(a)을 MLP에 입력하고 임베딩(e) 값을 attention head에 입력하면 어떤 에이전트에 집중해야 하는지를 나타내는 값(x)을 도출한다.
에이전트별 집중도(x)를 다시 MLP에 입력하여 Q값을 출력으로 만들어 낸다.

수식으로 표현하면 위와 같다. f가 그림 좌상단에 있는 MLP이고, g가 좌하단에 있는 MLP이며, h가 attention이 포함되어 있는 신경망(노란색)을 의미한다

Learning with Attentive Critics

모든 critics는 파라미터를 공유하기 때문에 joint regression loss function값을 최소화하는 방향으로 업데이트된다.

Experiments

실험은 multi-agent particle environment framework 환경에서 진행하였음.

GitHub - openai/multiagent-particle-envs: Code for a multi-agent particle environment used in the paper "Multi-Agent Actor-Criti

Code for a multi-agent particle environment used in the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments" - GitHub - openai/multiagent-particle-envs: Code fo...

github.com

실험 결과 MAAC가 다른 알고리즘에 비하여 높은 성능(high mean episode rewards)을 보임.

에이전트의 수에 따른 성능을 비교하기 위하여 MAAC와 MADDPG-SAC 알고리즘을 비교함 MAAC(청색)은 에이전트의 수가 늘어도 성능이 일정하게 보장되는 반면 MADDPG-SAC(적색)은 에이전트 수가 늘어날수록 성능이 떨어지는 결과를 보임.

Attention의 효과를 보기 위해서 Attention의 weights를 확인한 예시로 Rover1이 Tower3과 짝일 경우 Tower3을 집중할 수 있도록 가중치가 설정된 모습을 볼 수 있다.

※ 본 게시물은 작성자가 학습과정에서 작성한 내용으로 정확하지 않은 내용이 포함되어 있을 수 있습니다. 참고해 주시기 바라며, 틀린 사항에 대하여 댓글 남겨주시면 수정하겠습니다. 감사합니다.

저작자표시 비영리 변경금지 (새창열림)

'AI 공부 > MARL' 카테고리의 다른 글

[23-3] SEAC : Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning 논문 리뷰 (0)	2023.02.16
[23-1] LIIR 논문 정리 (0)	2023.02.13

우린 배울 게 넘 많아

[23-2] MAAC : Actor-Attention-Critic for Multi-Agent Reinforcement Learning 논문 정리

KEYWORDS

MAIN IDEA

DETAILS