This is an info Alert.
⌘K
  • Home
  • News
  • Blog
  • Releases
  • LLM history
  • Compare LLMs
  • Library
  • About
Sign in

A blog and notes on development. The easiest way to reach me is via the social links below.

Documents
Terms of UsePrivacy Policy
Contacts
talalaev.misha@gmail.com

© All rights reserved.

How to Create a Reliable LLM Judge: The hh.ru Experience

Mikhail T. (Sh0ny)
Mikhail T. (Sh0ny)
22 июня 2026
  1. Home
  2. Blog
  3. How to Create a Reliable LLM Judge: The hh.ru Experience
1 min read

In short

The hh.ru team shared its experience developing an evaluation system for resume analysis using neural networks. It turned out that creating a good LLM evaluator is a separate product in its own right, requiring specialized criteria, datasets, and metrics.

Creating an LLM judge is easy. It’s much harder to ensure that its evaluations can be trusted. We learned this firsthand while developing a resume analysis feature for the hh.ru AI assistant.

The Trust Issue

It quickly became clear that a good LLM evaluator is a standalone product with its own evaluation criteria, datasets, quality metrics, and operational costs. Naive approaches, such as a simple prompt, don’t work: the evaluations end up being inconsistent and unreliable.

Judge Architecture

We designed an evaluation system that includes:

  • Clear evaluation criteria (e.g., alignment of skills, experience, and education).
  • Specialized datasets for training and validation.
  • Quality metrics that allow us to track the accuracy of evaluations.

Each rubric has its own set of criteria, and the LLM evaluator is trained on annotated examples. This has allowed us to improve the consistency of evaluations and reduce the number of errors.

Conclusions

Key takeaway: The LLM judge requires the same attention to quality as the main product. You can’t just take a model and hope it will evaluate correctly. You need:

  • Datasets annotated by experts.
  • Metrics for continuous monitoring.
  • Iterations to improve the model based on feedback.

This is the only way to build trust in automated evaluations.

Source: Habr

новостиllmнейросетиразработка
Liked this write-up? Get one like it in your inbox every week
​

Comments

(0)
​