# Evaluate AI answer accuracy using ground truth, F1 score, and Google Sheets logging
This workflow automates the evaluation of AI agent responses by comparing them against reference (ground truth) answers. It combines LLM-based classification (TP/FP/FN) with semantic similarity scoring via OpenAI embeddings to generate a unified accuracy score. Designed for developers and analysts working on RAG systems and chatbots, it enables consistent, scalable quality assessment.
## Who it´s for
- AI agent developers needing to test response correctness
- ML engineers evaluating RAG system performance
- Teams deploying chatbots who require quality metrics
- Analysts tracking model progress against benchmark data
## What the automation does
- Pulls question, AI response, and ground truth from Google Sheets
- Uses GPT-4 via LangChain to classify statements into TP, FP, FN categories
- Computes semantic similarity using OpenAI Embeddings API
- Calculates final score as weighted average of F1 and similarity metrics
- Writes results back to Google Sheets for analysis
- Can be triggered by new row insertion or external message event
## What´s included
- Ready-to-use n8n workflow
- Trigger and handler logic powered by LangChain
- Integrations with Google Sheets API, OpenAI API, and Embeddings API
- Basic text guide for setup and adaptation
## Requirements for setup
- n8n account (cloud or self-hosted)
- Google Sheets access with read/write permissions
- OpenAI API key
- Installed dependencies: LangChain, OpenAI SDK
## Benefits and outcomes
- Objective, multi-factor assessment of AI response quality
- Eliminates manual review in testing workflows
- Enables tracking of model performance over time
- Centralized metric storage in Google Sheets
- Supports regression testing and A/B evaluation
- Ready-to-use data for dashboards and reporting
## Important: template only
Important: you are purchasing a ready-made automation workflow template only. Rollout into your infrastructure, connecting specific accounts and services, 1:1 setup help, custom adjustments for non-standard stacks and any consulting support are provided as a separate paid service at an individual rate. To discuss custom work or 1:1 help, contact via chat
No feedback yet