Skip to content Skip to footer

Achieving Greater Self-Consistency in Large Language Models | by Anthony Alcaraz | Dec, 2023


When LLMs are used to evaluate qualities like the correctness, accuracy, or relevance of a piece of text, consistency is paramount. If an LLM exhibits inconsistent judgements, then its evaluations become unreliable and untrustworthy.

If an LLM evaluates the reasoning quality of arguments, but contradicts itself by rating an invalid argument as more logically sound than a perfectly valid one, then it fails as an arbiter of reason. Its evaluations lose credibility due to the model’s own lack of logical consistency.

When such inconsistencies appear, there is no stable basis for comparison between the LLM’s assessments of different pieces of text. If the model arbitrarily contradicts itself, then sentences cannot be reliably ranked against one another based on the model’s inconsistent scorings.

In essence, inconsistency destroys the grounds for comparison that evaluations aim to provide in the first place. If an LLM cannot demonstrate consistent application of assessment criteria, then using it to evaluate text loses all effectiveness and utility.

So, consistency in judgement and evaluation is mandatory for LLMs employed to score or judge textual qualities and features. Without a high level of stability in its assessments, grounded in a consistent understanding of concepts being evaluated, the basis for comparison falls apart when leveraging LLM output as a form of evaluation or scoring.

Sampling multiple solutions reveals consistency between outputs strongly correlates with quality. However, existing consistency techniques rely on extracting and matching closed-form answers, restricting their applicability. This article explores methods to enhance self-consistency without such constraints, while also grounding decisions in real-world knowledge.

Image by the author

The Need for Self-Consistency

Despite rapid progress, logical failures and falsehoods continue hindering reliable reasoning in state-of-the-art models. For complex multi-step analysis or free-form generation, models often contradict themselves or invent unsupported facts.

This manifests in two key ways — inconsistent open-ended generation, and incoherent inferences. When performing…



Source link