FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning: what it means for business leaders

FedMultiEmo shows how automotive fleets can sense driver emotions without exporting video or biometrics: edge devices train multimodal models locally and still hit near-lab accuracy, latency and memory budgets.

1. What the method is

The framework combines a 60-k-parameter CNN for \(48\times48\) face crops with a random-forest classifier on heart-rate, electrodermal activity and skin-temperature streams. Each vehicle trains both models locally, uploads only weight updates through FedAvg, and receives an aggregated global network. At runtime, the two modality scores are fused by majority vote, yielding cabin-level emotion labels in under 50 ms on a Raspberry Pi 4.

2. Why the method was developed

Vision-only driver-monitoring struggles with sunglasses, low light and head turns, while cloud uploads of raw video or physiology clash with GDPR and chew bandwidth. The authors sought a privacy-preserving, multimodal approach that remains robust to occlusions, scales to millions of cars without GPU servers and complies with emerging data-sovereignty rules.

3. Who should care

Automotive OEMs chasing Euro NCAP driver-state ratings, Tier-1 suppliers building cockpit platforms, usage-based insurers evaluating risk, ride-hail operators seeking safer trips, and wellness-app developers aiming to personalize in-car experiences can all deploy FedMultiEmo to detect fatigue, stress or distraction without handling sensitive raw data.

4. How the method works

Edge clients capture synchronized video frames and biosignals, apply local augmentation and train for 120-second epochs. Flower orchestrates FedAvg with client-size weighting; a lightweight secure-aggregation layer hides individual updates. After each round, the server redistributes the global weights. Decision-level fusion on device outputs the final emotion class—happy, neutral, angry, stressed—which can trigger HMI or safety actions.

5. How it was evaluated

Five Pi-based clients trained on FER2013 images and a ten-driver physiological set covering daylight, night and sunglass scenarios. Benchmarks recorded accuracy, convergence rounds, per-round runtime, peak RAM and network load. Comparisons included centralized multimodal training, unimodal federated CNN and unimodal biosignal RF.

6. How it performed

FedMultiEmo reached 87 % accuracy—matching centralized baselines—while keeping all raw data on board. Global convergence required 18 rounds (≈ 36 min total), each client used < 200 MB RAM and < 7 MB network traffic. Accuracy dipped only 2 % with sunglasses and held under low-light. (Source: arXiv 2507.15470, 2025)

← Back to dossier index