IFScale: Instruction Following at Scale

About IFScale

IFScale is a novel benchmark designed to evaluate large language model instruction-following performance as the number of instructions increases. Real-world applications often demand adherence to dozens or hundreds of simultaneous requirements and this is the first benchmark to measure instruction-following performance at scale. The benchmark task is to generate a business report while including a list of keywords specified in the instructions.

Our benchmark reveals 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors under the cognitive load of increasing instructions. We evaluate 20 state-of-the-art models across seven major providers, providing critical insights for reliable deployment of LLMs in complex, multi-instruction scenarios.

Comprehensive Evaluation

Systematic evaluation revealing performance hierarchies and degradation patterns across state-of-the-art models, addressing the critical gap in understanding high-density instruction scenarios.

Multi-dimensional Analysis

Detailed exploration of standard deviation patterns, primacy effects, error types, and curve patterns for all models considered, highlighting standard patterns and outliers.

IFScale Framework

A novel benchmark designed to characterize how models perform under the cognitive load of increasing instructions, providing essential insights for reliable deployment in complex, multi-instruction scenarios.

Interactive Visualization

Explore model performance across different instruction densities with our interactive plots. Switch between 3D and 2D views using the controls in the bottom-right corner.

Leaderboard

Performance rankings based on instruction-following capability under varying cognitive loads. Models are ranked by high density performance.

Rank	Model	Organization	10 Instructions	50 Instructions	100 Instructions	250 Instructions	500 Instructions
Loading data...

(r) indicates a hybrid model run with thinking enabled