Guides
Data exploration
Analyzing data from Query History export

Analyzing data from Query History export

You can use Query History export to bring Query History data to an external monitoring solution for further analysis.

In this recipe, we will show you how to export Query History data to Amazon S3, and then analyze it using Cube by reading the data from S3 using DuckDB.

Configuration

Vector configuration for exporting Query History to Amazon S3 and also outputting it to the console of the Vector agent in your Cube Cloud deployment.

In the example below, we are using the aws_s3 sink to export the cube-query-history-export-demo bucket in Amazon S3, but you can use any other storage solution that Vector supports.

[sinks.aws_s3]
type = "aws_s3"
inputs = [
  "query-history"
]
bucket = "cube-query-history-export-demo"
region = "us-east-2"
compression = "gzip"
 
[sinks.aws_s3.auth]
access_key_id = "$CUBE_CLOUD_MONITORING_AWS_ACCESS_KEY_ID"
secret_access_key = "$CUBE_CLOUD_MONITORING_AWS_SECRET_ACCESS_KEY"
 
[sinks.aws_s3.encoding]
codec = "json"
 
[sinks.aws_s3.healthcheck]
enabled = false
 
 
[sinks.my_console]
type = "console"
inputs = [
  "query-history"
]
target = "stdout"
encoding = { codec = "json" }

You'd also need to set the following environment variables in the Settings → Environment variables page of your Cube Cloud deployment:

CUBE_CLOUD_MONITORING_AWS_ACCESS_KEY_ID=your-access-key-id
CUBE_CLOUD_MONITORING_AWS_SECRET_ACCESS_KEY=your-secret-access-key
 
CUBEJS_DB_DUCKDB_S3_ACCESS_KEY_ID=your-access-key-id
CUBEJS_DB_DUCKDB_S3_SECRET_ACCESS_KEY=your-secret-access-key
CUBEJS_DB_DUCKDB_S3_REGION=us-east-2

Data modeling

Example data model for analyzing data from Query History export that is brought to a bucket in Amazon S3. The data is accessed directly from S3 using DuckDB.

With this data model, you can run queries that aggregate data by dimensions such as status, environment_name, api_type, etc. and also calculate metrics like count, total_duration, or avg_duration:

cubes:
  - name: requests
    sql: >
      SELECT
        *,
        api_response_duration_ms / 1000 AS api_response_duration,
        EPOCH_MS(start_time_unix_ms) AS start_time,
        EPOCH_MS(end_time_unix_ms) AS end_time
      FROM read_json_auto('s3://cube-query-history-export-demo/**/*.log.gz')
    
    dimensions:
      - name: trace_id
        sql: trace_id
        type: string
        primary_key: true
 
      - name: deployment_id
        sql: deployment_id
        type: number
 
      - name: environment_name
        sql: environment_name
        type: string
 
      - name: api_type
        sql: api_type
        type: string
 
      - name: api_query
        sql: api_query
        type: string
 
      - name: security_context
        sql: security_context
        type: string
 
      - name: cache_type
        sql: cache_type
        type: string
 
      - name: start_time
        sql: start_time
        type: time
 
      - name: end_time
        sql: end_time
        type: time
 
      - name: duration
        sql: api_response_duration
        type: number
 
      - name: status
        sql: status
        type: string
 
      - name: error_message
        sql: error_message
        type: string
 
      - name: user_name
        sql: "SUBSTRING(security_context::JSON ->> 'user', 3, LENGTH(security_context::JSON ->> 'user') - 4)"
        type: string
 
    segments:
      - name: production_environment
        sql: "{environment_name} IS NULL"
 
      - name: errors
        sql: "{status} <> 'success'"
 
    measures:
      - name: count
        type: count
 
      - name: count_non_production
        description: >
          Counts all non-production environments.
          See for details: https://cube.dev/docs/product/workspace/environments
        type: count
        filters:
          - sql: "{environment_name} IS NOT NULL"
 
      - name: total_duration
        type: sum
        sql: "{duration}"
        
      - name: avg_duration
        type: number
        sql: "{total_duration} / {count}"
 
      - name: median_duration
        type: number
        sql: "MEDIAN({duration})"
 
      - name: min_duration
        type: min
        sql: "{duration}"
 
      - name: max_duration
        type: max
        sql: "{duration}"
        
    pre_aggregations:
      - name: count_and_durations_by_status_and_start_date
        measures:
          - count
          - min_duration
          - max_duration
          - total_duration
        dimensions:
          - status
        time_dimension: start_time
        granularity: hour
        refresh_key:
          sql: SELECT MAX(end_time) FROM {requests.sql()}
          every: 10 minutes
 

Result

Example query in Playground: