Skip to main content
The Model Metrics tab provides deep visibility into LLM model performance, cost, and usage. This is where you go to understand how your models are performing under real-world load, compare them against each other, and identify bottlenecks or cost drivers.
Model Metrics tab showing requests per second, failure rates, latency percentiles, cost of inference, and token usage charts

View By Selector

You can pivot all charts on this tab using the View by selector. This changes the grouping dimension for every chart on the page:
View byGroups metrics byWhen to use
ModelsModel name (default)Compare performance across different LLM models
Virtual ModelsVirtual model / model aliasEvaluate model routing configurations
UsersUsername of the callerDebug user-specific issues or track per-user consumption
Virtual AccountsVirtual accountMonitor usage by application or API key
TeamsTeam nameTrack costs per team for chargebacks or budget management
MetadataCustom metadata keys sent in request headersCreate custom views (e.g. by tenant, environment, or feature)

Top-Level Counters

Four headline metrics are displayed at the top:
  • Total Input Tokens — total tokens sent to models in the selected time range.
  • Total Output Tokens — total tokens generated by models.
  • Total Count of Requests — number of LLM API calls.
  • Total Cost of Tokens — aggregate cost in USD.

Performance Charts

Requests Per Second

Shows the throughput over time, broken down by the selected dimension. Use this to identify traffic patterns, peak hours, and how load is distributed across models or users.

Request Failure Rate

Displays the percentage of requests that failed over time. A sudden spike here is an early warning of provider outages, quota exhaustion, or misconfiguration.

Request Failures Breakdown

A stacked bar chart showing the distribution of failures by error type across time. This makes it easy to see whether failures are dominated by a single error code or spread across multiple types.

Request Failure Rate By Error Type

Breaks down the failure rate by HTTP status code (4xx, 5xx, etc.). This helps distinguish between client-side errors (e.g. malformed requests) and provider-side issues (e.g. rate limiting, server errors).

Latency Charts

Request Latency

The end-to-end time taken to process a request, from the moment the gateway receives it until the complete response is returned. Displayed with P50, P75, P90, and P99 percentile selectors. Use this to identify models with consistently high latency or detect latency regressions over time.

Time To First Token (TTFT)

The time elapsed until the first token of a response is received. This is the most important latency metric for streaming use cases — it directly impacts the perceived responsiveness of your application.

Inter Token Latency (ITL)

The average time between consecutive tokens in a streaming response. High ITL means your users experience stuttering or pauses in the response stream.

Time Per Output Token (TPOT)

The average time to generate each output token. This normalizes latency by output length, making it useful for comparing models that produce different response sizes.

Cost and Token Charts

Cost of Inference

Shows cost over time, broken down by the selected dimension. Use this to track spending trends, identify cost spikes, and compare the cost-effectiveness of different models.

Input Tokens

Input token volume over time. Helps you understand how prompt sizes are trending and which models or users are sending the most context.

Output Tokens

Output token volume over time. Useful for identifying models that generate verbose responses or users whose usage patterns lead to high output costs.

Filtering

Click the Filter button in the top bar to narrow down the data. You can filter by metadata fields like user email, model name, and more. Active filters are shown as tags below the View by selector, and you can clear them at any time.
Model Metrics tab with filters applied showing filtered results for a specific user and model

Exporting Data

Click the export icon in the top-right corner to download aggregated metrics data. You can choose which dimensions to group the data by (Models, Virtual Models, Users, Virtual Accounts, Teams) and also include any custom metadata keys. The data can be downloaded as a CSV or fetched via API.
Export aggregated data dialog showing grouping options and download button

Common Use Cases

  • Compare models: Switch to the Models view and look at latency, cost, and error rates side by side. If a model has high P99 latency, it may be causing tail latency issues in your application.
  • Debug user issues: Switch to the Users view and filter for a specific user. Check if they are hitting higher error rates or experiencing worse latency than average.
  • Track team spending: Switch to the Teams view to see cost breakdowns for internal chargebacks or budget management.
  • Evaluate routing changes: Switch to the Virtual Models view to see if a routing rule change shifted traffic as expected and whether the new target model performs better.
  • Custom tenant analytics: Switch to the Metadata view and group by a custom key like tenant_name to build per-customer cost and usage reports.