May 20, 2025 in GenAI
A Framework for Operation Evaluation of Generative AI Solutions
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2025.02.10
How do we know if the investment in generative artificial intelligence (GenAI) is moving the needle? It is a question heard almost daily across finance, healthcare and retail sectors. And honestly, it is the right question to ask. On top of that, should we continue to invest in AI at the same rate or optimize? How can a stakeholder show that solution usage has a positive or negative impact on their operation?
Survey results indicate that up to 85% of AI initiatives eventually fail to deliver their promises [1]. Organizations using GenAI want to understand the clear impact of such solutions. So, let’s define a strategic decision-making framework that broadly answers these business questions in an operational setting, balancing the benefits of business value and AI integration. We will also delve into various questions that are broadly asked by each organization for such solutions: How can a stakeholder measure the impact of a GenAI solution in their operation? What’s the baseline usage of these services in a particular industry?
When you use a model, you are charged based on tokens. Tokens are chunks of text such as words or parts of words that the model processes. The number of tokens used includes both the input (your prompt) and the output (the model’s response). Pricing varies depending on the model and number of tokens consumed [2]: GPT-4o (~$0.49 cost per 1 million tokens), Gemini 2.0 Flash (~$0.17 cost per 1 million tokens) and Llama 4 Maverick (~$4.38 cost per 1 million tokens). Assuming your business has multiple use cases (AI assistant, intent recognition, production optimization), the added cost quickly becomes difficult to justify without showing a considerable positive impact on operations.
Prerequisites: Design and Architecture
A centralized architecture for measuring GenAI solutions provides a unified, scalable framework that captures performance, usage and business impact across the organization. At its core, this architecture integrates data pipelines that collect real-time data from various touchpoints, such as user interactions, task completion, latency, model outputs and survey feedback. By consolidating structured and unstructured data into a data lake and integrating data from AI interfaces (chatbots, copilots, APIs), the architecture enables teams to monitor trends, identify bottlenecks and fine-tune models or prompts. A centralized approach ensures that GenAI deployment aligns with strategic goals, enhances accountability and accelerates value realization across departments. Keep in mind that security, access control, regulatory compliance, risk mitigation (e.g., addressing risks such as bias and hallucination) and data privacy are integral building blocks of a scalable and production-ready solution.
Evaluation Framework 2S/2E
In the context of operational performance assessment, four foundational pillars – Satisfaction, Soundness, Efficiency and Effort – provide a comprehensive framework for evaluating solutions and enhancing operational rigor. These dimensions serve as key lenses through which business value can be realized, particularly when integrated with emerging technologies such as GenAI.
Recognizing that organizations vary in their strategic focus – some prioritizing customer-centric metrics, others emphasizing operational streamlining – this framework is deliberately adaptable. It supports both qualitative and quantitative assessments and is applicable across sectors including healthcare, manufacturing, retail and financial services.
By aligning pillar-specific metrics with use case-specific goals, organizations can systematically identify performance gaps, monitor improvements and inform AI deployment strategies. This structure enables data-driven continuous improvement while maintaining methodological rigor. The following will outline the definitional scope and analytical value of each pillar within this decision-making framework.
Satisfaction
In customer-centric organizations, emotions play a pivotal role in shaping experience and driving loyalty. Positive emotional responses are closely linked to increased customer satisfaction, retention and long-term value. To fully capture the drivers of sustained business success, it is essential to evaluate both customer and employee satisfaction because both directly impact loyalty and productivity.
The service-profit chain framework [3, 4] offers a well-established model for understanding these dynamics, demonstrating the link between employee satisfaction, customer loyalty and overall profitability. This perspective is especially relevant in the age of AI, where human-AI collaboration adds new dimensions to workplace engagement and service delivery.
By assessing satisfaction across customers and employees – particularly in environments in which AI augments decision-making and service interactions – businesses can gain a more integrated view of operational health and long-term value creation. Survey-based approaches to track satisfaction for customers such as CSAT and NPS are predominantly used in many industries.
Morgan Stanley has built an AI assistant using GPT-4 that helps its tens of thousands of wealth managers quickly find and synthesize answers from a massive internal knowledge base of research reports and documents [5]. Let’s consider this AI Assistant usage to provide advisors and customer personalization. The AI Assistant is typically used to address research enquiries, administrative questions and specialized questions. This helps with improving customer satisfaction because the advisor shows up informed and well prepared to deliver differentiated advice. The customer has a positive experience, and the organization gets a new client. Let’s assume that the advisor provides a post- or pre-AI survey response. Empowered advisors who found more time for meaningful suggestions and client interaction may be more likely to provide feedback (“Like” or “Dislike”). Now, are you able to identify the underlying cognitive and emotional drivers that are influencing client sentiment and shaping organizational commitment or disengagement?
Efficiency
Operational efficiency is quantified by the ratio of outputs to inputs, emphasizing resource optimization and waste reduction. It reflects the system’s ability to deliver value with minimal inefficiencies. To ensure relevance and comparability, organizations benchmark these metrics against industry standards, enabling continuous improvement and strategic decision-making. This data-driven approach supports sustained operational excellence and aligns closely with performance modeling and optimization techniques central to operations research and management science (OR/MS) methodologies.
Let’s look at a few use cases in manufacturing, e-commerce and retail. Product catalogs and service manuals can be complex, eventually making it hard for service technicians or employees to find the key piece of information they need to fix a broken part or make a sale. (When was the last time you looked at your car service manual?) If the employee uses GenAI to sift through the manuals or unstructured product catalog, it eventually summarizes the solution and provides a step-by-step instruction to service the broken part or make a sale recommendation. In a Google GenAI benchmarking study, 74% of responding executives found this use case extremely or fairly valuable for manufacturers [6]. Key research-backed metrics include mean time to repair (MTTR), overall equipment effectiveness (OEE) and resource utilization rate in manufacturing. Efficiency is also assessed through cost per unit, time to value and first-pass yield, reflecting both speed and quality. For example, a lower MTTR using enhanced solutions may indicate streamlined processes and minimal follow-ups, reducing downtime cost. Each operation can have different metrics associated with efficiency. Break them down by processes and identify metrics that help you improve efficiency.
Soundness
Are we efficiently addressing the right business problems? There’s often a tug-of-war between solving problems quickly and ensuring the robustness of those solutions. Interestingly, solution soundness can be assessed using a straightforward feedback loop, such as expert or employee evaluations. Embedding such evaluation mechanisms into operational workflows enhances continuous improvement. Combined with other industry metrics and downstream impact, this feedback mechanism helps organizations refine their AI strategy and improve adoption. This approach is especially valuable in environments leveraging AI-driven tools, in which speed must be balanced with precision and long-term business impact.
A study was performed to translate radiology reports into plain language using GenAI for patients [7]. Radiology reports summarize expert opinions on medical images acquired using radiography, such as chest computed tomography (CT), brain magnetic resonance imaging (MRI), etc. For patients from nonmedical backgrounds, the reports are often difficult to understand – using GenAI to convert them into plain language helps reduce patient anxiety, promotes compliance and improves outcome.
For the purpose of this analysis, we focus on how well the AI preserves meaning and reduces errors, which is critical in healthcare communication. In the study of radiology report translation, a subset of translated reports saw an overall length reduction of 26.7% (chest CT report) and 21.1% (brain MRI report). Therefore, the translation provided was efficient. The core question remains: Can the solution accurately translate complex medical content without losing or misrepresenting information?
The study used a structured expert review approach. Two professional radiologists assessed translation quality based on three key areas: 1) the number of places with information loss, 2) misinterpretation and 3) an overall quality score on a 5-point scale (with 5 being the highest quality). Results show that, for chest CT reports, 76% of outputs received the highest (5) quality rating. Even for the more complex brain MRI reports, 69% received top scores (4, 5), showing consistent performance. For our use case, these metrics directly demonstrate effectiveness: The AI supports faster, more reliable translations for patients.
Let’s flip the use case for the report translation to be used by healthcare professionals. Will patients want them to use the translation, provided that the risk of information loss and misrepresentation can endanger a life? Maybe not. Hence, accounting for both efficiency and soundness of your solution based on the problem complexity and use case becomes important.
Effort
Optimizing employee and customer effort directly contributes to business value by streamlining operations, reducing friction and enhancing satisfaction. When employee effort is optimized through automation of repetitive tasks, it allows them to focus on high-impact, value-creating activities such as personalized service, innovation or strategic problem-solving. Similarly, offering intuitive self-service tools, proactive issue resolution or simplified onboarding leads to higher customer satisfaction, greater loyalty and lower churn.
GenAI acts as the bridge that aligns both efforts. We argue that the collaborative aspect of AI will bring better productivity gains and AI solution development. Let’s look at our previously identified GenAI use cases: In financial services, AI Assistant helps advisors process regulations or customer histories faster, reducing both cognitive and operational load (optimizing advisors’ time toward value creation for clients). In healthcare, AI supports clinicians with documentation and administrative tasks, allowing more time for patient care (optimizing time from lengthy administrative tasks to better care). In e-commerce, AI enhances catalog search, recommendation and support interactions, ensuring customers quickly find what they need (optimizing customer search).
Conclusion
What makes this framework powerful is its systematic approach to synthesizing business value and insights. Do not just look at these pillars in isolation, but analyze how they interplay to gather real insights. It’s not enough to focus on a single pillar; true power lies in using them collectively to assess your solution. For example, by combining insights from customer satisfaction scores with the classification of customer issues, you can refine your implementation and create more targeted outcomes. The value lies in the systematic use of this approach to analyze and implement use cases for GenAI solutions.
To measure progress, use a few pillars and their identified metrics. Benchmark your performance before and after implementing GenAI, or start by testing it within a smaller segment of your operations to understand its impact. Don’t forget to compare your results with industry standards to see how you measure up. Lastly, see if your GenAI providers have industry standards or benchmarks you can work with; it gives key input on operational health.
References
- Zhang, L.G. Pee and L. Cui, 2021, “Artificial intelligence in E-commerce fulfillment: A case study of resource orchestration at Alibaba’s Smart Warehouse,” International Journal of Information Management, Vol. 57, https://doi.org/10.1016/j.ijinfomgt.2020.102304.
- https://www.llama.com/
- L. Heskett, T.O. Jones, G.W. Loveman, W.E. Sasser, Jr., and L.A. Schlesinger, 2008, “Putting the Service-Profit Chain to Work,” Harvard Business Review, July/August, https://hbr.org/2008/07/putting-the-service-profit-chain-to-work.
- A. Kamakura, V. Mittal, F. de Rosa and J.A. Mazzon, 2002, “Assessing the Service-Profit Chain,” Marketing Science, Vol. 21, No. 3, pp. 294-317, https://doi.org/10.1287/mksc.21.3.294.140.
- McKinsey & Company, 2023, “Capturing the full value of generative AI in banking,” December 3, https://www.mckinsey.com/industries/financial-services/our-insights/capturing-the-full-value-of-generative-ai-in-banking.
- Sheridan and M. Breunig, 2023, “Five use cases for manufacturers to get started with generative AI,” Google Cloud Blog, October 9, https://cloud.google.com/blog/topics/manufacturing/five-generative-ai-use-cases-for-manufacturing.
- Lyu, J. Tan, M.E. Zapadka, J. Ponnatapura, C. Niu, K.J. Myers, et al., 2023, “Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential,” Visual Computing for Industry, Biomedicine, and Art, Vol. 6, Art. No. 9, https://doi.org/10.1186/s42492-023-00136-5.
Cigil Achenkunju is a curious professional adept at delivering advanced data analytics and AI strategy, with expertise spanning data science, machine learning and AI-driven product management. With a track record of delivering impactful solutions across finance, healthcare and e-commerce, Cigil consistently aids the development of cutting-edge AI-powered products. He has cultivated a unique blend of technical depth and strategic insight while guiding cross-functional teams, shaping AI product roadmaps and driving innovation at scale. As a thought leader, Cigil is deeply committed to mentoring the next generation of data professionals and staying at the forefront of emerging technologies and industry best practices. Connect with him on LinkedIn.