IFScale

Measuring Instruction Following Performance at Scale

About IFScale

IFScale is a novel benchmark designed to evaluate large language model instruction-following performance as the number of instructions increases. Real-world applications often demand adherence to dozens or hundreds of simultaneous requirements and this is the first benchmark to measure instruction-following performance at scale. The benchmark task is to generate a business report while including a list of keywords specified in the instructions.

Our benchmark reveals 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors under the cognitive load of increasing instructions. We evaluate 20 state-of-the-art models across seven major providers, providing critical insights for reliable deployment of LLMs in complex, multi-instruction scenarios.

Comprehensive Evaluation

Systematic evaluation revealing performance hierarchies and degradation patterns across state-of-the-art models, addressing the critical gap in understanding high-density instruction scenarios.

Multi-dimensional Analysis

Detailed exploration of standard deviation patterns, primacy effects, error types, and curve patterns for all models considered, highlighting standard patterns and outliers.

IFScale Framework

A novel benchmark designed to characterize how models perform under the cognitive load of increasing instructions, providing essential insights for reliable deployment in complex, multi-instruction scenarios.

Interactive Visualization

Explore model performance across different instruction densities with our interactive plots. Switch between 3D and 2D views using the controls in the bottom-right corner.

How to Use the Interactive Visualization

  • View Selection: Use the buttons in bottom-right (3D, Accuracy, Latency) to switch views
  • 3D View: Click and drag to rotate, mouse wheel to zoom
  • Model Filter: Click "Show All Models" (top-right) to toggle between top 5 and all models
  • Legend Interaction: Click model names in legends to show/hide specific models
  • Hover Details: Hover over points/lines for detailed information
  • Reset View: Double-click any plot to reset to default view

The visualization defaults to showing the top 5 performing models. Switch between different views to explore various aspects of model performance across different instruction complexities.

Leaderboard

Performance rankings based on instruction-following capability under varying cognitive loads. Models are ranked by high density performance.

Rank Model Organization 10 Instructions 50 Instructions 100 Instructions 250 Instructions 500 Instructions
Loading data...

(r) indicates a hybrid model run with thinking enabled