Using TruLens and Pinecone to evaluate grounded LLM applications
pinecone-datasets
library allowing us to skip the embedding and preprocessing steps.
vectorstore
.
RetrievalQA
as our app:
qs_relevance
and qa_relevance
. They’re defined as follows:
QS Relevance: query-statement relevance is the average of relevance (0 to 1) for each context chunk returned by the semantic search.
QA Relevance: question-answer relevance is the relevance (again, 0 to 1) of the final answer to the original question.
.on_input_output()
to specify that the feedback function should be applied on both the input and output of the application.
For QS Relevance, we use TruLens selectors to locate the context chunks retrieved by our application. Let’s break it down into simple parts:
on_input
which appears first is a convenient shorthand and states that the first argument to qs_relevance
(the question) is to be the main input of the app.
on(Select...)
line specifies where the statement argument to the implementation comes from. We want to evaluate the context chunks, which are an intermediate step of the LLM app. This form references the langchain app object call chain, which can be viewed from tru.run_dashboard()
. This flexibility allows you to apply a feedback function to any intermediate step of your LLM app. Below is an example where TruLens displays how to select each piece of the context.
np.mean
) specifies how feedback outputs are to be aggregated. This only applies to cases where the argument specification names more than one value for an input or output.
f_qs_relevance
can be now be run on apps/records and will automatically select the specified components of those apps/records
To finish up, we just wrap our Retrieval QA app with TruLens along with a list of the feedback functions we will use for eval.
euclidean
or dotproduct
as the metric and follow the rest of the steps above as is.
Because we are using OpenAI embeddings, which are normalized to length 1, dot product and cosine distance are equivalent - and Euclidean will also yield the same ranking. See the OpenAI docs for more information. With the same document ranking, we should not expect a difference in response quality, but computation latency may vary across the metrics. Indeed, OpenAI advises that dot product computation may be a bit faster than cosine. We will be able to confirm this expected latency difference with TruLens.
text-ada-001
from the LangChain LLM store. Adding in easy evaluation with TruLens allows us to quickly iterate through different components to find our optimal app configuration.
top_k
, or the number of context chunks retrieved by the semantic search, may help.
We can do so as follows:
top_k
is implemented in LangChain’s RetrievalQA is that the documents are still retrieved by semantic search and only the top_k
are passed to the LLM. Therefore, TruLens also captures all of the context chunks that are being retrieved. In order to calculate an accurate QS Relevance metric that matches what’s being passed to the LLM, we only calculate the relevance of the top context chunk retrieved by slicing the input_documents
passed into the TruLens Select function:
qs_relevance
, qa_relevance
and latency!