Beyond Prompt Engineering: How Re-Engineering an AI System Improved Performance, Consistency, and Cost Efficiency

Marco Ornelas|May 25, 2026|Read 4 min

Beyond Prompt Engineering: How Re-Engineering an AI System Improved Performance, Consistency, and Cost Efficiency

AI-powered applications often perform impressively during initial development. The real challenge begins when those systems move into production environments where speed, reliability, consistency, and operational costs become critical.

At Blue Trail Software, we recently faced this challenge while working on an AI-driven recruitment platform designed to evaluate candidate-job compatibility. The platform initially delivered promising results, but as usage increased, inconsistencies and performance limitations began affecting both user experience and operational efficiency.

The issue was not a complete system failure. Instead, it was a gradual breakdown caused by variability, latency, and growing costs. This project became a lesson in an increasingly important principle for modern AI systems:

Prompt engineering alone is not enough. Sustainable AI performance requires software architecture, orchestration, and deterministic design.

The Limits of Monolithic Prompt Engineering

Our platform used Large Language Models (LLMs) to evaluate candidate profiles against job descriptions and generate compatibility scores.

Initially, the implementation relied heavily on a large, instruction-rich prompt designed to manage the entire evaluation process.

This approach created several critical issues.

Inconsistent AI Scoring and Output Variability

The most significant problem was inconsistency. The exact same candidate-job combination could produce dramatically different results:

First evaluation: 60
Second evaluation: 10
Third evaluation: 95

No underlying data changed between evaluations.

The system was producing unpredictable outputs, which severely impacted trust and usability. For recruitment workflows, this level of variability creates major concerns:

Reduced confidence in recommendations
Poor user experience
Difficulty explaining results
Increased operational risk

Consistency is essential for production AI systems.

Latency and Performance Bottlenecks

The original architecture required the model to process:

Hundreds of instructions
Entire profile files
Large contextual datasets
Complex scoring logic

This created significant delays.

Average evaluation time: 18 seconds per request

For real-world users, this created friction and negatively affected platform responsiveness.

High Token Consumption and Operational Costs

Large prompts and unrestricted text outputs generated unnecessary token usage. The result:

Increased API expenses
Inefficient scaling
Reduced profitability

As AI systems grow, token efficiency becomes a business concern as much as a technical one.

Re-Engineering the Architecture: From Prompt to AI Orchestration

Rather than continuing to optimize a single massive prompt, we redesigned the system architecture entirely.

The new approach introduced a sub-agent orchestration model.

Instead of asking one AI model to perform everything, we distributed responsibilities across specialized components.

Step 1: Introducing an Orchestrator Agent

A central orchestration layer was created to manage the evaluation process. The orchestrator:

Receives candidate and job information
Retrieves contextual information through ChromaDB
Extracts key evaluation categories
Routes information to specialized components

Key categories included:

Work experience
Technical skills
Education
Industry alignment
Required competencies

This transformed a single AI task into smaller focused evaluations.

Parallel AI Processing for Faster Performance

Each category was sent to specialized sub-agents simultaneously. Rather than sequential processing:

Old process: Candidate → Single prompt → Evaluation

New process : Candidate → Orchestrator → Multiple specialized agents → Aggregated results

Parallel execution reduced processing time significantly. Results:

Previous latency: 18 seconds
New latency: 8 seconds

This represented more than a 50% reduction in waiting time.

Moving Business Logic Out of the LLM

Another major improvement involved separating AI reasoning from deterministic application logic.

Instead of allowing the LLM to perform calculations and return open-ended responses:

JSON output became mandatory
Mathematical operations moved into the application API
Rating calculations became deterministic

This produced several advantages:

Reduced hallucinations

The LLM no longer generated inconsistent calculations.

Lower token consumption

Structured outputs required fewer tokens.

Greater transparency

Administrators could now view:

Individual category scores
Detailed evaluation criteria
Final weighting logic

The system became explainable rather than opaque.

Improving AI Consistency Through Data Preprocessing

During iterative testing with tools like Claude Code, we discovered an important insight:

The issue was not primarily flawed rules.
The issue was incomplete contextual understanding.
When information lacked clarity, the model improvised.
That improvisation caused major scoring variability.

NLP Preprocessing with Python

Before data reached the LLM, we introduced traditional Natural Language Processing techniques. The preprocessing layer applied:

Lemmatization
Stemming
Skill normalization

Examples:

"Engineer"
"Engineering"
"Software Engineer"

These terms became semantically aligned before evaluation. Benefits included:

Improved skill matching
Reduced ambiguity
More consistent scoring

Traditional NLP significantly strengthened AI performance.

Introducing Deterministic Decision Structures

Prompt structures were redesigned using more explicit logic. New patterns included:

Numbered rules
Decision tables
Stop conditions
If-else style evaluation paths

Example:

If a mandatory requirement is missing: Return result → Stop evaluation

This reduced unnecessary model interpretation and increased predictability.

Prioritizing Critical Skills and Requirements

Not every job requirement should contribute equally to candidate evaluation. We introduced weighted critical keywords to ensure essential qualifications received stronger influence.

Examples:

Critical requirements:

Required programming languages
Certifications
Years of experience
Mandatory domain expertise

This produced more realistic ranking behavior and improved recommendation quality.

Reducing AI Costs with Database Caching

One of the most impactful optimizations involved introducing a caching layer. Each candidate-job pair generated a unique hash.

The system checks: Has this evaluation already been performed?

If no underlying data changes:

No LLM request executes
Cached results are returned instantly

Effectively:

Recurring evaluations became: $0 cost

Benefits included:

Lower API expenses
Faster response times
Reduced infrastructure load

Results: Performance, Stability, and Cost Improvements

The architectural redesign generated measurable improvements across multiple dimensions.

Token Consumption

Previous: 5,000 tokens
Current: 3,500 tokens
Improvement: 25% reduction

AI Output Stability

Previous average score variance: 37 points
Current average score variance: 5 points
Improvement: Approximately 86% greater consistency

API Cost Reduction

Combined improvements from:

Token optimization
Structured responses
Database caching

Result: 30% reduction in overall API costs

Lessons Learned: Prompt Engineering Alone Is Not Enough

One of the most important lessons from this project is that prompt engineering is not a final destination. Production AI systems require a broader engineering approach involving:

Orchestration
Deterministic logic
NLP preprocessing
Structured outputs
Caching strategies
Continuous optimization

The most successful AI systems are not necessarily the ones with the most sophisticated prompts. They are the systems supported by strong software architecture.

Final Thoughts

AI systems increasingly require the same principles as any mature software platform:

Scalability
Reliability
Predictability
Observability
Cost efficiency

Through this redesign we achieved:

25% lower token usage
30% lower operational costs
More than 50% faster evaluations
Dramatically improved consistency

But the most valuable outcome was trust.

Users no longer depend on the randomness of a prompt. They depend on a reliable system designed for production.

The future of AI applications may not be defined by better prompts alone. It will increasingly be defined by better engineering.