AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification: what it means for business leaders

AutoGeTS turns scarce, imbalanced support tickets into training signal by auto-generating targeted synthetic examples with LLMs, selecting seeds scientifically, and logging what works—delivering measurable recall and F1 lifts without rearchitecting production classifiers.

1. What the method is

AutoGeTS is an automated pipeline that strengthens text classifiers using targeted synthetic data produced by large language models. Instead of prompting at random, it selects real examples expected to yield the most useful synthetics for a chosen objective. Three complementary selectors—Sliding Window, Hierarchical Sliding Window, and a Genetic Algorithm—propose seed sets; an LLM generates variations conditioned on those seeds. The classifier is retrained and re-evaluated, and results are written to a knowledge map that records which selector works best for each class and metric. The loop is budgeted and repeatable, making synthetic augmentation operational rather than ad hoc.

2. Why the method was developed

Support and IT ticketing data are chronically imbalanced and shift as categories evolve, causing recalls to sag in low-volume classes. Gathering labels is slow and expensive; generic augmentation often adds volume without improving decision boundaries. The authors built AutoGeTS to convert limited data into targeted signal that directly moves business metrics, prioritizing classes where misrouting is costly while preserving overall quality. By learning which seed-selection strategy to apply per class and objective, the approach reduces trial-and-error and lowers the operational burden of keeping classifiers healthy under drift.

3. Who should care

Leaders running ITSM/CRM queues, enterprise service desks, and BPO operations; product owners responsible for automated ticket triage; data and ML platform teams tasked with improving recall and balanced accuracy under tight budgets; and risk or compliance managers who need auditable, metric-driven improvements without rearchitecting production models or pausing releases for large labeling campaigns.

4. How the method works

Pick a target class and objective (e.g., class recall or balanced accuracy). Generate candidate seed sets using sliding windows over feature space, a hierarchical drill-down of promising regions, or an evolutionary search. For each candidate, prompt an LLM to synthesize realistic tickets; merge them with training data; retrain a fixed architecture; and log outcomes. The knowledge map aggregates these trials to guide the next phase, so future runs start with strategies proven in similar contexts. Compute and time caps bound cost, and the system tracks spillover effects on non-target classes to avoid harming overall metrics.

5. How it was evaluated

Experiments used 39,100 real tickets across 15 imbalanced classes with 60/20/20 splits for training, optimization testing, and holdout testing. A stable baseline classifier was retrained after each synthetic augmentation to isolate its impact. The study ran 180 configurations covering three selectors and four objectives across all classes, then compared against classic augmentation. Generalization checks included public datasets (TREC-6 and Amazon Reviews 2023). Reporting emphasized class-level recall, class balanced accuracy, overall balanced accuracy, and overall F1, along with analysis of cross-class spillovers and cost budgets.

6. How it performed

Targeted synthetics reliably raised the intended metric, especially for small, hard classes, while overall balanced accuracy and F1 typically improved as well. No single selector dominated; using the knowledge map to choose per-class strategies beat any one method. Compared with generic augmentation, example-aware LLM synthesis delivered larger, more focused gains without altering the production model, offering a practical path to sustain ticket classifiers under drift. (Source: arXiv 2508.10000, 2025)

← Back to dossier index