ARMOR: Robust Reinforcement Learning-based Control for UAVs under Physical Attacks: what it means for business leaders

ARMOR trains drones to recognise and shrug off sensor spoofing by pairing a privileged “teacher” with an on-board “student”, letting operators fly autonomous missions safely in contested skies without costly adversarial retraining.

1. What the method is

ARMOR is a two-stage UAV control stack. A variational-autoencoder “teacher” learns an attack-aware latent state in simulation using privileged labels. A lightweight on-board “student” then reproduces that latent from raw sensors, allowing a single reinforcement-learning policy to steer the craft accurately even when GPS, IMU, or cameras are being spoofed or jammed.

2. Why the method was developed

Commercial and defence drones face GPS spoofers, acoustic gyroscope attacks, and laser dazzlers that feed false data into autopilots. Traditional safe-RL shields fail because the controller “believes” corrupted inputs. The authors built ARMOR to pre-learn these attack signatures in simulation, transfer robustness to real flights, and cut the vast compute bills of adversarial training loops.

3. Who should care

Logistics VPs deploying parcel or medical-supply drones
Aviation risk officers ensuring air-traffic safety
Security integrators planning UAV perimeter patrols
Insurers modelling liability from sensor-spoofing losses

4. How the method works

During simulation the teacher VAE ingests full telemetry plus attack metadata—affected sensor, bias magnitude, duration—and emits a compact latent. A policy is trained directly on this resilient state with standard RL. In parallel, a student encoder comprising a temporal VAE and LSTM learns to reconstruct the teacher’s latent using only past sensor readings. At deployment the drone runs just the student and the learned policy, gaining attack robustness with zero privileged inputs, no extra adversary networks, and minimal computational overhead.

5. How it was evaluated

ARMOR was tested in 7 000+ quad-rotor simulations covering waypoint tracking, payload drops, and hover tasks. Threats included GPS spoofing, acoustic gyro bias, and optical sensor lasers at three intensity levels. Key metrics were trajectory deviation, crash count, task-completion rate, and policy training time, and were compared against vanilla RL, robust hierarchical controllers, and adversarial-trained policies.

6. How it performed

The framework cut mean trajectory error by 78 % and prevented 95 % of crashes across all attack scenarios, beating adversarially-trained baselines by 31 pp in zero-shot tests while cutting training compute 40 %. (Source: arXiv 2506.22423, 2025)

← Back to dossier index