An SRE‑driven approach to reduce telemetry spend while protecting decision‑grade signal. This playbook combines OpenTelemetry sampling, value‑based retention, and native GCP levers (Logging exclusions, BigQuery partitioning/TTL, Monitoring cardinality controls).
processors:
tail_sampling:
decision_wait: 5s
policies:
- name: errors-and-slow
type: status_code
status_code:
status_codes: [ERROR]
- name: latency
type: latency
latency:
threshold_ms: 500
- name: key-endpoints
type: string_attribute
string_attribute:
key: http.target
values: ["/checkout","/payment","/login"]
- name: default-probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 5
service:
pipelines:
traces:
processors: [memory_limiter, resourcedetection, tail_sampling, batch]
resource "google_logging_project_sink" "app_sink" {
name = "logs-to-bq"
destination = "bigquery.googleapis.com/projects/${var.project_id}/datasets/observability"
filter = <<EOF
resource.type=("k8s_container" OR "cloud_run_revision")
severity>=DEFAULT
-jsonPayload.debug
-textPayload:("healthz" OR "readiness" OR "liveness")
EOF
unique_writer_identity = true
}
resource "google_bigquery_dataset" "obs" {
dataset_id = "observability"
location = var.region
}
resource "google_bigquery_table" "traces" {
dataset_id = google_bigquery_dataset.obs.dataset_id
table_id = "traces"
time_partitioning {
type = "DAY"
field = "receiveTimestamp"
}
deletion_protection = false
labels = { purpose = "observability" }
}
resource "google_bigquery_table_iam_member" "ro" {
dataset_id = google_bigquery_dataset.obs.dataset_id
table_id = google_bigquery_table.traces.table_id
role = "roles/bigquery.dataViewer"
member = "group:analytics@example.com"
}
High-cardinality metrics are the silent budget killer on GCP Monitoring. A single metric with user_id as a label can generate millions of time series. The OTel Collector can aggregate before export:
processors:
metricstransform:
transforms:
- include: http_request_duration
action: update
operations:
- action: aggregate_labels
label_set: [method, status_class, service]
aggregation_type: sum
This collapses high-cardinality labels (like full URL paths) into bounded sets (method + status class) before the data ever leaves the collector. The raw high-cardinality data is still available in traces via exemplars — you just don't pay to store it as metric series.
Not all telemetry data has the same shelf life. We use a three-tier retention strategy:
The key insight: most telemetry data is only valuable in the first 72 hours. Keeping 100% of traces for 90 days is paying hot-tier prices for cold-tier utility.
Here's a representative cost breakdown for a platform processing ~50K requests/minute across 12 services on GCP:
| Category | Before | After | Reduction |
|---|---|---|---|
| Cloud Logging ingestion | $2,400/mo | $720/mo | 70% |
| Cloud Trace spans | $1,800/mo | $540/mo | 70% |
| Cloud Monitoring metrics | $900/mo | $450/mo | 50% |
| BigQuery storage/query | $600/mo | $300/mo | 50% |
| Total | $5,700/mo | $2,010/mo | 65% |
Cost optimization is only valuable if you keep the signals that matter. After applying all optimizations, we verified:
The trick is that the 40-60% of telemetry we stopped ingesting was genuinely low-value: health check logs, debug-level noise, duplicate traces from retries, and metrics with unbounded cardinality that no dashboard ever queried.
The playbook extends to any cloud provider with similar levers. The principle is universal: sample at the edge, aggregate at the collector, tier by value, and verify you kept what matters. For more on the full observability framework these optimizations sit within, see Comprehensive Observability on GCP with OpenTelemetry.