# AI Coding Data Reference

This page documents every data point that Revenium collects from AI coding assistant integrations. Use this reference to understand exactly what telemetry is captured, how it's used, and what privacy guarantees apply.

***

## How Data Is Collected

All AI coding assistant data is collected via **OpenTelemetry (OTLP)** log records. Each coding tool has a dedicated integration that exports usage telemetry to Revenium's OTLP endpoint. No proprietary agents or background processes are involved — data flows through the standard OpenTelemetry protocol.

| Tool                     | Integration Method                           | Data Flow                                |
| ------------------------ | -------------------------------------------- | ---------------------------------------- |
| **Claude Code**          | `@revenium/cli` npm package                  | Claude Code hooks → OTLP logs → Revenium |
| **Gemini CLI SDK**       | `@revenium/cli` npm package                  | Gemini CLI → OTLP logs → Revenium        |
| **Gemini Go Middleware** | `github.com/revenium/revenium-go-sdk/google` | Go app → Completions API → Revenium      |
| **Cursor IDE**           | Admin API sync                               | Cursor Admin API → Revenium (periodic)   |

### Agent Identifiers

Each tool is identified by an **agent** value in the telemetry:

| Tool        | Agent Identifier |
| ----------- | ---------------- |
| Claude Code | `claude-code`    |
| Gemini CLI  | `gemini-cli`     |
| Cursor IDE  | `cursor-ide`     |

***

## Privacy Guarantees

{% hint style="success" %}
**Revenium never collects your code, prompts, or conversation content.** Only usage metadata is transmitted — token counts, model names, timestamps, and session identifiers. This applies to all integrations by default.
{% endhint %}

Specifically, the following are **never** sent in the default configuration:

* Source code or file contents
* Prompt text or system prompts
* AI response content
* API keys, credentials, or secrets
* Repository names or git history (diffs, commits, file contents)
* Screen content or clipboard data

{% hint style="info" %}
**Note on session metadata:** When backfilling historical Claude Code data, optional session metadata including the working directory and git branch name may be included if present in the local session logs. These provide context about where AI assistance was used. See [Claude Code > Session Metadata](#session-metadata) for details. No file contents, code, or git history are included.
{% endhint %}

***

## Common Data Points

The following data points are collected by **all** AI coding assistant integrations. These form the core telemetry schema that powers the [AI Coding Dashboard](https://docs.revenium.io/ai-coding-dashboard).

### Token Metrics

| Data Point                | Type    | Description                                                   |
| ------------------------- | ------- | ------------------------------------------------------------- |
| `inputTokenCount`         | Integer | Number of input tokens consumed in the request                |
| `outputTokenCount`        | Integer | Number of output tokens generated by the model                |
| `cacheReadTokenCount`     | Integer | Tokens served from the model's prompt cache (reduces cost)    |
| `cacheCreationTokenCount` | Integer | Tokens written to the model's prompt cache                    |
| `reasoningTokenCount`     | Integer | Extended thinking / chain-of-thought tokens (model-dependent) |
| `totalTokenCount`         | Integer | Sum of all token types for the request                        |

{% hint style="info" %}
Not all models or integrations populate every token type. `reasoningTokenCount` is populated by the Gemini Go middleware for models with extended thinking; the Claude Code SDK does not currently send this field (it may be zero in Claude Code data). `cacheCreationTokenCount` is always 0 for Gemini CLI (the Google API does not expose cache creation counts). `cacheReadTokenCount` and `cacheCreationTokenCount` depend on the model's prompt caching support. For Cursor IDE, `reasoningTokenCount` is always null.
{% endhint %}

### Cost Metrics

| Data Point        | Type    | Description                                                                    |
| ----------------- | ------- | ------------------------------------------------------------------------------ |
| `totalCost`       | Decimal | Calculated cost in USD for this request, based on model pricing                |
| `cost_multiplier` | Float   | Subscription tier discount factor (e.g., 0.08 for Max 20x = 8% of API pricing) |
| `cost_source`     | String  | Always `coding_assistant` for AI coding tool traffic                           |
| `costType`        | String  | Always `AI` for AI coding assistant requests                                   |

### Model & Provider Identity

| Data Point         | Type   | Description                                                                                                                                                   |
| ------------------ | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`            | String | AI model name (e.g., `claude-opus-4-5-20251101`, `gemini-2.5-pro`)                                                                                            |
| `provider`         | String | AI provider identifier. Set by backend mappers: `ClaudeCode`, `GeminiCli`, `CursorIde`. The Gemini Go middleware may also send `google-genai` or `vertex-ai`. |
| `agent`            | String | Coding assistant identifier (`claude-code`, `gemini-cli`, `cursor-ide`)                                                                                       |
| `middlewareSource` | String | SDK or middleware version that generated the telemetry                                                                                                        |

### Timing

| Data Point        | Type      | Description                                                   |
| ----------------- | --------- | ------------------------------------------------------------- |
| `requestTime`     | Timestamp | When the request was initiated (ISO 8601 / epoch nanoseconds) |
| `requestDuration` | Integer   | Total request duration in milliseconds                        |

### Attribution

| Data Point         | OTLP Attribute                           | Type   | Description                                                                                                                           |
| ------------------ | ---------------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| `subscriber`       | `user.email`                             | String | Developer email address for usage attribution (optional, user-configured)                                                             |
| `organizationName` | `organization.id` or `organization.name` | String | Organization or company name/ID for cost rollup (optional). The backend prefers `organization.name`; falls back to `organization.id`. |
| `productName`      | `product.id` or `product.name`           | String | Product or project name/ID for cost rollup (optional). The backend prefers `product.name`; falls back to `product.id`.                |
| `traceId`          | `session.id`                             | String | Session identifier — groups requests within a single coding session                                                                   |
| `transactionId`    | `transaction_id`                         | String | Unique identifier for each individual request (used for deduplication)                                                                |

{% hint style="info" %}
The **Data Point** column shows the name as stored in the analytics database. The **OTLP Attribute** column shows the key name in the raw telemetry payload. The backend mapper translates between these formats during ingestion.
{% endhint %}

### Operational Classification

| Data Point      | Type   | Description                                                                                                                                                                                 |
| --------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `operationType` | String | Request classification (e.g., `CHAT`)                                                                                                                                                       |
| `stopReason`    | String | Why the model stopped generating. Revenium enum values: `END`, `TOKEN_LIMIT`, `ERROR`, `CANCELLED`. See [Gemini Stop Reason Mapping](#stop-reason-mapping) for tool-specific normalization. |
| `errorReason`   | String | Error description if the request failed (empty on success)                                                                                                                                  |

### Coding Assistant Account Linkage

| Data Point                      | Type   | Description                                                                        |
| ------------------------------- | ------ | ---------------------------------------------------------------------------------- |
| `coding_assistant_account_uuid` | String | Links telemetry to a specific coding assistant account for cross-session tracking  |
| `subscription_tier`             | String | Subscription plan identifier (see [Subscription Tiers](#subscription-tiers) below) |

{% hint style="info" %}
These fields are defined in the ClickHouse schema (Migration 15) and populated during data enrichment. The OTLP mappers extract `claude_code.account_uuid` from resource attributes where available. Full persistence is being rolled out incrementally.
{% endhint %}

***

## Claude Code Data Points

In addition to the [Common Data Points](#common-data-points) above, Claude Code captures the following:

### Extended Token Breakdown

| Data Point                 | Type    | Description                                                                                                |
| -------------------------- | ------- | ---------------------------------------------------------------------------------------------------------- |
| `cache_creation_5m_tokens` | Integer | Cache tokens with 5-minute ephemeral expiry                                                                |
| `cache_creation_1h_tokens` | Integer | Cache tokens with 1-hour extended expiry                                                                   |
| `total_input_tokens`       | Integer | Aggregate input tokens (input + cache creation + cache read) — used for context window threshold detection |

{% hint style="info" %}
These granular cache fields are available in backfilled data where Claude Code's session logs contain the breakdown. Real-time telemetry reports the aggregate `cacheCreationTokenCount`.
{% endhint %}

### Session Metadata

| Data Point                 | Type   | Description                                               |
| -------------------------- | ------ | --------------------------------------------------------- |
| `claude_code.version`      | String | Claude Code application version                           |
| `claude_code.cwd`          | String | Working directory during the session                      |
| `claude_code.git_branch`   | String | Git branch name in the working directory                  |
| `claude_code.speed`        | String | Speed/quality setting: `instant`, `normal`, or `thorough` |
| `claude_code.service_tier` | String | Anthropic API service tier used for the request           |

{% hint style="info" %}
Session metadata fields are extracted from Claude Code's local session logs during backfill. They provide context about how and where AI coding assistance was used, without capturing any code or prompt content. These fields are currently extracted and logged by the backend mapper; full ClickHouse persistence is pending a schema migration.
{% endhint %}

### Subscription Tiers

Claude Code subscriptions determine the `cost_multiplier` applied to usage costs:

| Tier           | `cost_multiplier` | Description                                                   |
| -------------- | ----------------- | ------------------------------------------------------------- |
| `pro`          | 0.16              | Anthropic Pro plan (16% of API pricing)                       |
| `max_5x`       | 0.16              | Anthropic Max 5x plan (16% of API pricing)                    |
| `max_20x`      | 0.08              | Anthropic Max 20x plan (8% of API pricing)                    |
| `team_premium` | 0.24              | Anthropic Team Premium plan (24% of API pricing)              |
| `enterprise`   | 0.05              | Anthropic Enterprise plan (5% of API pricing)                 |
| `api`          | 1.0               | Direct API usage (full API pricing, no subscription discount) |

### Data Collection Modes

Claude Code supports two data collection modes:

| Mode          | Description                                                                                                                                                                                                                |
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Real-time** | Telemetry is exported automatically during each Claude Code session via OTLP hooks. Captures core token, cost, and timing metrics.                                                                                         |
| **Backfill**  | The `revenium-metering backfill` command scans local Claude Code session logs (`~/.claude/projects/`) and sends historical usage data. Captures extended token breakdown and session metadata in addition to core metrics. |

Backfill is idempotent — deterministic transaction IDs (SHA-256 hash of session ID, timestamp, model, and token counts) prevent duplicate records.

***

## Gemini Data Points

Gemini data flows into Revenium through two independent integration paths:

|                     | CLI SDK                                                      | Go Middleware                                |
| ------------------- | ------------------------------------------------------------ | -------------------------------------------- |
| **Package**         | `@revenium/cli`                                              | `github.com/revenium/revenium-go-sdk/google` |
| **Use case**        | Metering developer Gemini CLI usage                          | Metering server-side Go applications         |
| **Runs on**         | Developer workstation (one-time setup)                       | Server-side, wraps `genai` Go client         |
| **Fields captured** | 26 common fields ([Common Data Points](#common-data-points)) | \~52 fields (26 common + 26 extended)        |
| **Protocol**        | OTLP/HTTP logs                                               | Revenium Completions API                     |

### Gemini CLI SDK Data Points

The CLI SDK configures Gemini CLI's native OTLP export to send telemetry to Revenium. It captures the [Common Data Points](#common-data-points) listed above — token metrics, cost, model identity, timing, and attribution.

{% hint style="info" %}
**Need extended timing, tracing, vision detection, or prompt capture?** These require the [Go Middleware](#gemini-go-middleware-data-points) integration below.
{% endhint %}

Gemini CLI operates in **real-time only** — there is no backfill capability. Telemetry is captured and exported as each Gemini CLI request completes.

### Gemini Go Middleware Data Points

In addition to the [Common Data Points](#common-data-points) above, the Go middleware captures the following extended fields:

#### Extended Timing

| Data Point            | Type      | Description                                             |
| --------------------- | --------- | ------------------------------------------------------- |
| `responseTime`        | Timestamp | When the response was fully received                    |
| `completionStartTime` | Timestamp | When the model began generating tokens                  |
| `timeToFirstToken`    | Integer   | Time from request start to first token, in milliseconds |

#### Streaming & Model Configuration

| Data Point    | Type    | Description                                                                                 |
| ------------- | ------- | ------------------------------------------------------------------------------------------- |
| `isStreamed`  | Boolean | Whether the response was streamed (hardcoded `true` for Gemini CLI, `false` for Cursor IDE) |
| `temperature` | Float   | Temperature setting from the generation config                                              |

#### Additional Metadata

| Data Point             | Type    | Description                       |
| ---------------------- | ------- | --------------------------------- |
| `taskType`             | String  | Task type classification          |
| `taskId`               | String  | Task identifier                   |
| `subscriptionId`       | String  | Subscription identifier           |
| `modelSource`          | String  | Model source identifier           |
| `mediationLatency`     | Integer | Mediation latency in milliseconds |
| `responseQualityScore` | Float   | Response quality score            |
| `credentialAlias`      | String  | Credential alias for routing      |

#### Distributed Tracing

| Data Point            | Type    | Description                                                 |
| --------------------- | ------- | ----------------------------------------------------------- |
| `traceType`           | String  | Trace type classification (e.g., `completion`, `embedding`) |
| `traceName`           | String  | Human-readable trace name                                   |
| `environment`         | String  | Deployment environment (e.g., `production`, `development`)  |
| `region`              | String  | Cloud region for the request                                |
| `retryNumber`         | Integer | Retry attempt number (0 for first attempt)                  |
| `parentTransactionId` | String  | Parent transaction ID for request chaining                  |

#### Vision Content Detection

| Data Point                           | Type         | Description                                                              |
| ------------------------------------ | ------------ | ------------------------------------------------------------------------ |
| `hasVisionContent`                   | Boolean      | Whether the request contained image content                              |
| `attributes.vision_image_count`      | Integer      | Number of images detected in the request (nested in `attributes` object) |
| `attributes.vision_total_size_bytes` | Integer      | Total size of image data in bytes (nested in `attributes` object)        |
| `attributes.vision_media_types`      | String Array | MIME types of detected images (e.g., `["image/png", "image/jpeg"]`)      |

{% hint style="info" %}
Vision detection metadata is only populated when the Gemini request includes image or multimodal content. The `vision_*` fields are nested inside an `attributes` object in the payload. This helps track the adoption of vision capabilities in coding workflows.
{% endhint %}

#### Optional Prompt Capture

{% hint style="warning" %}
Prompt capture is **disabled by default** and must be explicitly enabled in the middleware configuration. When enabled, the following fields are populated. Organizations should review their data handling policies before enabling this feature.
{% endhint %}

| Data Point         | Type    | Description                                      |
| ------------------ | ------- | ------------------------------------------------ |
| `systemPrompt`     | String  | System prompt content                            |
| `inputMessages`    | String  | Input messages (JSON)                            |
| `outputResponse`   | String  | Model response content                           |
| `promptsTruncated` | Boolean | Whether content was truncated due to size limits |

#### Stop Reason Mapping

Gemini CLI normalizes Google's finish reasons to Revenium's internal `StopReason` enum:

| Gemini Finish Reason                                                         | Revenium StopReason         | Description                                                     |
| ---------------------------------------------------------------------------- | --------------------------- | --------------------------------------------------------------- |
| `STOP`                                                                       | `END`                       | Normal completion                                               |
| `MAX_TOKENS`                                                                 | `TOKEN_LIMIT`               | Token limit reached                                             |
| `SAFETY`, `BLOCKLIST`, `PROHIBITED_CONTENT`, `SPII`, `MODEL_ARMOR`           | `ERROR`                     | Content safety filter triggered                                 |
| `RECITATION`, `IMAGE_SAFETY`, `IMAGE_PROHIBITED_CONTENT`, `IMAGE_RECITATION` | `ERROR`                     | Recitation or image safety filter                               |
| `MALFORMED_FUNCTION_CALL`, `UNEXPECTED_TOOL_CALL`, `NO_IMAGE`                | `ERROR`                     | Tool call or image error                                        |
| `CANCELLED` / `CANCELED`                                                     | `CANCELLED`                 | Request cancelled                                               |
| `FINISH_REASON_UNSPECIFIED`, `OTHER`, `IMAGE_OTHER`                          | *(caller-supplied default)* | Returns the default stop reason provided by the calling context |

***

## Cursor IDE Data Points

In addition to the [Common Data Points](#common-data-points) above, Cursor IDE captures the following through its Admin API sync:

### Billing Classification

| Data Point                      | Type   | Description                                                                                                 |
| ------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------- |
| `billing.kind`                  | String | Cursor billing classification (`Included`, `Premium`, etc.) — determines whether usage counts against quota |
| `operation_type`                | String | Operation type from Cursor (e.g., request classification)                                                   |
| `stop_reason` / `finish_reason` | String | Finish reason from Cursor                                                                                   |

{% hint style="info" %}
When `billing.kind` is `Included`, the backend sets `billingSkipped = true`, `skipReason = FREE_TIER`, and forces `totalCost` to `null` — indicating the request was covered by the subscription and incurred no additional cost.
{% endhint %}

{% hint style="warning" %}
**Cursor IDE integration is under active development.** Additional fields such as `cursor.token_fee`, `cursor.requests_costs`, and `cursor.is_token_based` are planned but not yet mapped in the backend. This section will be updated as the integration matures.
{% endhint %}

### Data Collection Mode

Cursor IDE usage data is collected periodically from Cursor's Admin API and exported to Revenium via OTLP. Unlike Claude Code and Gemini CLI, data is not captured in real-time during each request — it is synced at regular intervals from Cursor's administrative interface.

***

## Derived Fields

The following fields are **not sent by the SDKs** but are calculated by the Revenium backend during ingestion:

| Field                           | Derivation                                            | Description                                  |
| ------------------------------- | ----------------------------------------------------- | -------------------------------------------- |
| `inputTokenCost`                | `inputTokenCount × model_input_cost_per_token`        | Cost attributed to input tokens              |
| `outputTokenCost`               | `outputTokenCount × model_output_cost_per_token`      | Cost attributed to output tokens             |
| `cacheCreationTokenCost`        | `cacheCreationTokenCount × model_cache_creation_cost` | Cost attributed to cache creation            |
| `cacheReadTokenCost`            | `cacheReadTokenCount × model_cache_read_cost`         | Cost attributed to cache reads               |
| `totalCost` (when not provided) | Sum of all token costs                                | Calculated when SDK sends zero or null cost  |
| `apiKey`                        | Extracted from `x-api-key` HTTP header                | Authentication key for tenant identification |
| `credentialId`                  | Extracted from `subscriber` JSON                      | Credential identifier for access control     |

***

## OTLP Transport Details

For teams implementing custom integrations or verifying data flow, here are the OTLP transport details:

### Endpoint

```
POST {base_url}/v1/logs
```

Where `base_url` is typically `https://api.revenium.ai/meter/v2/otlp`.

### Authentication

```
x-api-key: hak_XXXX_your_key_here
```

### Payload Format

All integrations use the OTLP/HTTP JSON format (`application/json`):

```json
{
  "resourceLogs": [{
    "resource": {
      "attributes": [
        { "key": "service.name", "value": { "stringValue": "claude-code" } },
        { "key": "cost_multiplier", "value": { "doubleValue": 0.08 } }
      ]
    },
    "scopeLogs": [{
      "scope": { "name": "claude-code", "version": "1.0.0" },
      "logRecords": [{
        "timeUnixNano": "1711324800000000000",
        "body": { "stringValue": "claude_code.api_request" },
        "attributes": [
          { "key": "session.id", "value": { "stringValue": "sess-abc123" } },
          { "key": "model", "value": { "stringValue": "claude-opus-4-5-20251101" } },
          { "key": "input_tokens", "value": { "intValue": 1500 } },
          { "key": "output_tokens", "value": { "intValue": 2000 } },
          { "key": "cache_read_tokens", "value": { "intValue": 500 } },
          { "key": "cache_creation_tokens", "value": { "intValue": 0 } },
          { "key": "total_input_tokens", "value": { "intValue": 2000 } }
        ]
      }]
    }]
  }]
}
```

{% hint style="info" %}
The example above shows a Claude Code backfill payload with the core token attributes. The real-time test/connectivity payload (via `revenium-metering test`) uses `stringValue` for token fields and additionally sends `cost_usd` and `duration_ms`. Gemini CLI payloads follow the same OTLP structure with `service.name` set to `gemini-cli` and scope name set to `gemini_cli`.
{% endhint %}

***

## Related Documentation

* [AI Coding Dashboard](https://docs.revenium.io/ai-coding-dashboard) — Dashboard views and analysis features
* [Integration Options for AI Metering](https://docs.revenium.io/integration-options-for-ai-metering) — Setup instructions for all integrations
* [OpenTelemetry Integration](https://docs.revenium.io/opentelemetry-integration) — General OTLP integration guide
* [Cost & Performance Alerts](https://docs.revenium.io/cost-and-performance-alerts) — Alerting on coding assistant metrics
