December 2025

When GenAI Models Fail, Ensemble Models Win

AMLC of the Rockies - December 2025

Ensemble Models Mountain

Overview

Your domain-specific model achieves 85.3% accuracy, but production requires 95%+. What do you do? This talk explores how ensemble models and multi-source data synthesis can bridge the gap when single GenAI models fall short.

Through a real-world case study of classifying customer support tickets for a building products company, discover how the "judge pattern" achieved 100% accuracy while maintaining explainability and cost-effectiveness. Learn when to use ensemble approaches, how to synthesize multiple data sources for greater value, and practical deployment strategies including shadow mode validation.

Key Takeaways:

  • When your domain-specific model hits 85% accuracy but production demands 95%+, ensemble models can bridge that critical gap
  • The "judge pattern" emerged as the winner by combining a fast NLP pipeline with an LLM judge that only reviews uncertain cases, achieving 100% accuracy while staying cost-effective
  • The real breakthrough wasn't just ensembling models—it was ensembling data sources across manufacturing silos (BOMs, spec sheets, work instructions) to create richer context
  • Real-world impact: 2x improvement in customer satisfaction and 1,400% ROI, proving that thoughtful ensemble design scales in production
  • Start simple with shadow mode deployment, monitor everything obsessively, and remember that your domain expertise is what makes these systems work

Preview Video

Watch the preview of this talk:

Presentation Slides

Download the complete slide deck from this presentation to dive deeper into the technical details, case studies, and implementation strategies.

Download Slides (PDF)

Talk Highlights

The Challenge

Client: A building products company
Problem: Classify customer support tickets into categories

  • Warranty
  • Technical Support
  • Order Status
  • Installation
  • Product Information
  • Troubleshooting

Why GenAI Struggled

Two Approaches Tested:

  • NLP Pipeline: 85.3% accuracy
  • Few-Shot Prompting: 46.7% accuracy (worse than coin flip!)

Challenges:

  • Domain-specific terminology not in training data
  • Nuanced category differences
  • Required manufacturing process knowledge
  • Need for contextual business understanding

The Ensemble Solution

"When one model isn't enough, use multiple models working together"

Three Strategies Tested:

  • Voting Ensemble - Multiple models vote, majority wins
  • Judge Pattern - Pipeline + LLM judge reviews
  • Hybrid - Voting + judge arbitration

Result: All three achieved 100% accuracy! Judge pattern provides explainability + accuracy.

Why Judge Pattern Won

Advantages:

  • Explainable (judge provides reasoning)
  • Cost-effective (only invokes when needed)
  • Easy to improve (add few-shot examples)
  • Higher accuracy than voting

When Judge Runs:

  • Pipeline confidence < 90%
  • ~30% of tickets need judge review
  • Cost: $0.005 (pipeline) + $0.03 (judge when needed)

The Power of Synthesis

"Don't just ensemble models. Ensemble data sources."

Manufacturing companies have data silos:

  • BOMs (Bill of Materials) - Component relationships
  • Spec Sheets - Technical specifications
  • Work Instructions - Assembly procedures
  • Support Tickets - Historical issues

Key Insight: Synthesized answers provide specific recommendations, technical justification, part compatibility, installation guidance, risk mitigation, and source citations. Value > Sum of Parts.

When to Use Ensembles

Use Ensembles When:

  • Single model accuracy is 85-90% (good but not great)
  • Domain is highly specialized
  • Errors have meaningful business cost
  • You have budget for multi-model inference
  • You need explainability

Don't Use Ensembles When:

  • Single model already achieves 95%+
  • Domain is well-covered by training data
  • Errors are low-cost
  • Latency/cost is critical
  • Simple rules work well

Event Photos

Ensemble Models Meetup - Audience
Ensemble Models Meetup - GenAI Discussion
Ensemble Models Meetup - Presentation
Ensemble Models Meetup - Great Turnout
Ensemble Models Meetup - Networking
Ensemble Models Meetup - Event Moment

About the Speaker

Rachael Roland (she/her)

Founder, Applied Industrials

Building AI systems for industrial companies with a focus on domain-specific NLP challenges. Rachael specializes in solving the hard problems that arise when off-the-shelf AI models aren't enough for specialized industrial applications.

Connect with Rachael on LinkedIn
← Back to All Meetups