In short
The hh.ru team shared its experience developing an evaluation system for resume analysis using neural networks. It turned out that creating a good LLM evaluator is a separate product in its own right, requiring specialized criteria, datasets, and metrics.
Creating an LLM judge is easy. It’s much harder to ensure that its evaluations can be trusted. We learned this firsthand while developing a resume analysis feature for the hh.ru AI assistant.
It quickly became clear that a good LLM evaluator is a standalone product with its own evaluation criteria, datasets, quality metrics, and operational costs. Naive approaches, such as a simple prompt, don’t work: the evaluations end up being inconsistent and unreliable.
We designed an evaluation system that includes:
Each rubric has its own set of criteria, and the LLM evaluator is trained on annotated examples. This has allowed us to improve the consistency of evaluations and reduce the number of errors.
Key takeaway: The LLM judge requires the same attention to quality as the main product. You can’t just take a model and hope it will evaluate correctly. You need:
This is the only way to build trust in automated evaluations.
Source: Habr