When GenAI Models Fail, Ensemble Models Win | The Applied Machine Learning Collective of the Rockies

Overview

Your domain-specific model achieves 85.3% accuracy, but production requires 95%+. What do you do? This talk explores how ensemble models and multi-source data synthesis can bridge the gap when single GenAI models fall short.

Through a real-world case study of classifying customer support tickets for a building products company, discover how the "judge pattern" achieved 100% accuracy while maintaining explainability and cost-effectiveness. Learn when to use ensemble approaches, how to synthesize multiple data sources for greater value, and practical deployment strategies including shadow mode validation.

Key Takeaways:

When your domain-specific model hits 85% accuracy but production demands 95%+, ensemble models can bridge that critical gap
The "judge pattern" emerged as the winner by combining a fast NLP pipeline with an LLM judge that only reviews uncertain cases, achieving 100% accuracy while staying cost-effective
The real breakthrough wasn't just ensembling models—it was ensembling data sources across manufacturing silos (BOMs, spec sheets, work instructions) to create richer context
Real-world impact: 2x improvement in customer satisfaction and 1,400% ROI, proving that thoughtful ensemble design scales in production
Start simple with shadow mode deployment, monitor everything obsessively, and remember that your domain expertise is what makes these systems work

Preview Video

Watch the preview of this talk:

Presentation Slides

Download the complete slide deck from this presentation to dive deeper into the technical details, case studies, and implementation strategies.

Download Slides (PDF)

Talk Highlights

The Challenge

Client: A building products company
Problem: Classify customer support tickets into categories

Warranty
Technical Support
Order Status
Installation
Product Information
Troubleshooting

Why GenAI Struggled

Two Approaches Tested:

NLP Pipeline: 85.3% accuracy
Few-Shot Prompting: 46.7% accuracy (worse than coin flip!)

Challenges:

Domain-specific terminology not in training data
Nuanced category differences
Required manufacturing process knowledge
Need for contextual business understanding

The Ensemble Solution

"When one model isn't enough, use multiple models working together"

Three Strategies Tested:

Voting Ensemble - Multiple models vote, majority wins
Judge Pattern - Pipeline + LLM judge reviews
Hybrid - Voting + judge arbitration

Result: All three achieved 100% accuracy! Judge pattern provides explainability + accuracy.

Why Judge Pattern Won

Advantages:

Explainable (judge provides reasoning)
Cost-effective (only invokes when needed)
Easy to improve (add few-shot examples)
Higher accuracy than voting

When Judge Runs:

Pipeline confidence < 90%
~30% of tickets need judge review
Cost: $0.005 (pipeline) + $0.03 (judge when needed)

The Power of Synthesis

"Don't just ensemble models. Ensemble data sources."

Manufacturing companies have data silos:

BOMs (Bill of Materials) - Component relationships
Spec Sheets - Technical specifications
Work Instructions - Assembly procedures
Support Tickets - Historical issues

Key Insight: Synthesized answers provide specific recommendations, technical justification, part compatibility, installation guidance, risk mitigation, and source citations. Value > Sum of Parts.

When to Use Ensembles

Use Ensembles When:

Single model accuracy is 85-90% (good but not great)
Domain is highly specialized
Errors have meaningful business cost
You have budget for multi-model inference
You need explainability

Don't Use Ensembles When:

Single model already achieves 95%+
Domain is well-covered by training data
Errors are low-cost
Latency/cost is critical
Simple rules work well

Event Photos

Ensemble Models Meetup - GenAI Discussion

About the Speaker

Rachael Roland (she/her)

Founder, Applied Industrials

Building AI systems for industrial companies with a focus on domain-specific NLP challenges. Rachael specializes in solving the hard problems that arise when off-the-shelf AI models aren't enough for specialized industrial applications.

Connect with Rachael on LinkedIn