Session 8: Evaluating RAG pipelines
Retrieval QualityEvaluation MetricsRelevancy Analysis

Session 8: Evaluating RAG pipelines

Presenters

Masih MoloodianYasin FakharMohammad Amin Dadgar

Evaluating RAG pipelines

Evaluating RAG pipelines

In our discussion on RAG pipeline evaluation, we started with a recap of key concepts such as retrieval-augmented generation (RAG) itself, function calling, prompt engineering, LLM agents, and stateful LLM applications. The focus then shifted to evaluation, emphasizing why it is essential for ensuring relevance, reliability, and user satisfaction. Evaluation in RAG pipelines is divided into two main phases: retrieval and generation. The retrieval phase uses traditional information retrieval metrics like accuracy, precision, recall, and F1-score, whereas generation is assessed using BLEU, ROUGE, METEOR, and human evaluations.

We then explored evaluation methods, distinguishing between offline and online evaluations. Offline evaluation involves testing on a static dataset, providing structured and controlled insights, while online evaluation assesses system performance in a live production environment, allowing real-time feedback and adaptation. For offline retrieval evaluation, the focus is on assessing how well the system retrieves relevant data. This is done using similarity score analysis, which involves computing descriptive statistics such as the first quartile (Q1), median, and third quartile (Q3) to detect anomalies and trends in retrieval performance. Additionally, relevancy score analysis leverages LLMs to measure the quality of retrieved data, ensuring that retrieved chunks align with user queries.

To deepen our understanding of retrieval effectiveness, we discussed how similarity and relevancy scores are combined for a holistic evaluation. Similarity scores help determine how well a retrieved document matches a query, while relevancy scores, calculated using an LLM, evaluate actual content alignment. Standard deviation is used to measure consistency, and combining mean, standard deviation, and range offers a more comprehensive view of retrieval quality. These consolidated metrics help identify areas needing improvement in retrieval reliability and alignment, ensuring that the RAG pipeline consistently retrieves high-quality, contextually relevant information.

Slides link: https://www.canva.com/design/DAGajo8SpWg/iRyWKo6ed8ajDUuYAFmBTw/edit?utm_content=DAGajo8SpWg&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton