Demajh logo Demajh, Inc.

Can SGD Handle Heavy-Tailed Noise?: what it means for business leaders

This paper shows standard SGD still converges under heavy-tailed gradient noise, with near-optimal sample complexity, practical step-size schedules, and guidance for mini-batching—translating into simpler stacks and predictable training under spiky data.

1. What the method is

The work develops a rigorous framework showing when and how plain Stochastic Gradient Descent (SGD) converges even if gradient noise is heavy-tailed—meaning only a bounded p-th moment exists for p in (1, 2]. It delivers sharp, often minimax-optimal, sample-complexity bounds for convex, strongly convex, and Hölder-smooth non-convex objectives and extends the results to mini-batch SGD. The analysis also clarifies which iterate you should return (averaged, random, or last) in different regimes and provides matching lower bounds, including an SGD-specific one for non-convex problems. In business terms: you can keep a simple, well-supported optimizer and still have principled guarantees under messy, outlier-prone data—if you follow the recommended step-size schedules and evaluation choices.

2. Why the method was developed

A widespread assumption says heavy-tailed noise breaks vanilla SGD, pushing teams toward clipped or highly adaptive optimizers that add tuning burden, code complexity, and sometimes licensing cost. That conventional wisdom can be costly and unnecessary. The authors revisit fundamentals and show the pessimism is overstated: under broad, realistic assumptions, unmodified SGD still converges with optimal rates. By tightening upper and lower bounds and exposing the limits—such as where high-probability guarantees remain difficult without adaptivity—the work helps practitioners pick the simplest tool that works, reduce optimizer churn, and set realistic expectations for time-to-accuracy when data are spiky, non-stationary, or sourced from volatile environments like ads, finance, or reinforcement learning.

3. Who should care

CTOs and ML platform owners standardizing training stacks; research leads comparing optimizers; teams fine-tuning models in recommendation, ads, finance, RL, or risk where outliers abound; and compliance leaders evaluating claims that “you must switch optimizers.” Edge-AI and robotics groups that need minimal memory footprints and straightforward deployments benefit because SGD remains easy to implement and audit. Data-science managers planning experimentation roadmaps can also use these results to budget runs and decide when to average iterates, when a random iterate is enough, and where mini-batching offers headroom without abandoning a plain-SGD baseline.

4. How the method works

The analysis assumes unbiased stochastic gradients with bounded p-th moments (potentially infinite variance). It proves convergence in expectation using tailored step-size schedules and output selection. In convex problems, simple or weighted averaging of iterates suffices with schedules like ηt ∝ t−1/p. In strongly convex problems, a classic schedule ηt = 2/(μt) is optimal and does not require knowing the tail index p. For Hölder-smooth non-convex objectives, a polynomially decaying schedule tied to smoothness constants ensures convergence to a stationary point. The paper further analyzes mini-batch SGD under bounded central moments, showing comparable complexity and explaining when batching improves constants without changing the overall rate picture.

5. How it was evaluated

Evidence comes in two parts. First, formal proofs establish tight upper bounds and complementary lower bounds across convex, strongly convex, and non-convex settings under p-bounded-moment and bounded-central-moment assumptions. Second, controlled experiments illustrate predicted behaviors: sensitivity to step-size schedules, which iterate to output under heavy-tailed noise, and mini-batch effects, alongside comparisons to popular adaptive or clipped baselines in strongly convex cases. The goal is decision support—turning the theory into guidance on training plans, monitoring criteria, and stop conditions—rather than leaderboard chasing, so teams can apply the insights directly in production roadmaps.

6. How it performed

In convex problems, SGD matches the minimax-optimal sample complexity on the order of ε−p/(p−1). In strongly convex problems, it achieves an optimal ε−p/[2(p−1)] rate using the simple 2/(μt) schedule—no tail-index knowledge required. For Hölder-smooth non-convex objectives, SGD converges to a stationary point with complexity on the order of ε−2p/(p−1), with a new algorithm-specific lower bound showing the rate is essentially tight; a mini-batch variant attains a comparable rate under standard smoothness and central-moment assumptions. Practically: many teams can keep vanilla SGD, add batching where sensible, and still enjoy principled guarantees under outlier-heavy data. (Source: arXiv 2508.04860, 2025)

← Back to dossier index