Skip to content

Dataset Tools — MCP Tool Reference

Dataset tools implement the medallion data pipeline: extracting survey responses into Bronze datasets, applying post-stratification weighting to create Silver datasets, and refining into Gold datasets for export. Tools also provide quality metrics, semantic text coding, and multi-format export.

list_datasets

List datasets with optional filters.

Parameter Type Required Default Description
stage string No Filter by stage: "bronze", "silver", or "gold"
source_type string No Filter: "campaign", "project", "upload", "weighted", or "refined"
campaign_id string No Filter by campaign ID
limit integer No 100 Maximum results to return

Returns: { items: [...], count, limit }


get_dataset

Get detailed dataset information including column schema and metrics.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID

Returns: Full dataset object with id, name, description, stage, source_type, row_count, column_count, column_schema, has_weights, weight_column, weighting_method, weighting_config, quality_score, processing_status, strategy_id.


create_bronze_dataset

Extract survey responses from one or more campaigns into a Bronze dataset.

Parameter Type Required Default Description
name string Yes Dataset name
campaign_id string No Single campaign to extract from
project_id string No Project ID (required with campaign_ids)
campaign_ids string[] No List of campaign IDs to extract from
include_demographics boolean No true Include respondent demographic columns

Returns: { dataset_id, name, stage: "bronze", ... } with extraction report.

Source Selection

Provide either campaign_id alone, or both project_id and campaign_ids together.


apply_raking

Apply post-stratification weighting (iterative proportional fitting) to a Bronze dataset, creating a Silver dataset.

Parameter Type Required Default Description
dataset_id string Yes Bronze dataset UUID

Returns: { bronze_dataset_id, silver_dataset_id, stage: "silver", ... } with raking report.

Stage validation: Only Bronze datasets can be weighted. Returns error for other stages.


create_gold_dataset

Create a refined Gold dataset from a Silver or Bronze source, optionally transforming columns.

Parameter Type Required Default Description
source_id string Yes Source dataset UUID (Silver or Bronze)
name string No Name for Gold dataset. Defaults to "{source_name} (Gold)"
fields object[] No Column transformations

Each fields entry:

Field Type Description
id string Original column name
name string New display name
deleted boolean true to exclude column

Returns: { gold_dataset_id, name, stage: "gold", source_id, source_stage, row_count, column_count }


export_dataset

Export a dataset in the specified format.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID
format string No "csv" Export format: "csv", "xlsx", "spss", or "parquet"

Returns: { path, ... } with format-specific metadata.

Export Formats

Format Extension Description
csv .csv Universal format for most tools
xlsx .xlsx Excel with summary and schema sheets
spss .sav SPSS with variable and value labels
parquet .parquet Columnar format with full metadata (Snappy compression)

get_dataset_quality

Get sample representativeness metrics comparing dataset demographics against a sampling strategy's targets.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID
strategy_id string No Sampling strategy UUID. Auto-detected from campaign if omitted

Returns: Quality metrics including overall score, composite error, RMSE, and per-factor deviation analysis.


compare_dataset_quality

Compare representativeness metrics between Bronze (unweighted) and Silver (weighted) datasets to evaluate raking effectiveness.

Parameter Type Required Default Description
bronze_dataset_id string Yes Bronze dataset UUID
silver_dataset_id string Yes Silver dataset UUID
strategy_id string No Sampling strategy UUID. Auto-detected if omitted

Returns: Comparison metrics showing improvement from post-stratification weighting.


get_dataset_response_quality

Analyze response-level quality metrics for a dataset. Evaluates answer diversity, response patterns, internal consistency, and acquiescence bias from the actual survey responses — complementing get_dataset_quality which measures sample representativeness.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID (Bronze or Silver)

Returns:

{
  "per_question": [
    {
      "column_name": "q1",
      "item_id": "q1",
      "item_kind": "Question",
      "item_title": "Preferred method",
      "control_type": "radio",
      "normalized_entropy": 0.87,
      "effective_num_answers": 3.2,
      "option_usage_ratio": 1.0,
      "item_variance": null,
      "item_nonresponse_rate": 0.02,
      "n_responses": 98,
      "n_missing": 2,
      "n_total": 100,
      "response_distribution": {"a": 0.35, "b": 0.40, "c": 0.25}
    }
  ],
  "straightlining": [
    {
      "item_id": "g1",
      "item_title": "Satisfaction Battery",
      "item_kind": "QuestionGroup",
      "straightlining_score": 0.08,
      "n_respondents": 95,
      "n_straightliners": 8
    }
  ],
  "reliability": [
    {
      "item_id": "g1",
      "item_title": "Satisfaction Battery",
      "item_kind": "QuestionGroup",
      "cronbachs_alpha": 0.82,
      "n_items": 4,
      "n_respondents": 90
    }
  ],
  "mean_normalized_entropy": 0.75,
  "acquiescence_bias_index": 0.52,
  "overall_nonresponse_rate": 0.04,
  "n_questions_analyzed": 12,
  "n_categorical": 8,
  "n_numeric": 2,
  "n_text": 2,
  "sample_size": 100,
  "warnings": []
}

Metric Reference

Per-question (categorical):

Metric Range Description
normalized_entropy 0–1 Answer diversity (1.0 = uniform, 0.0 = single value)
effective_num_answers 1–k How many options carry meaningful weight
option_usage_ratio 0–1 Fraction of options that received responses

Per-question (numeric):

Metric Range Description
item_variance 0+ Response spread (weighted if Silver dataset)

Group metrics (QuestionGroup, MatrixQuestion):

Metric Range Description
straightlining_score 0–1 Identical-answer proportion across sub-items
cronbachs_alpha 0–1 Internal consistency (3+ sub-items required)

Dataset aggregates:

Metric Description Concern Threshold
mean_normalized_entropy Average diversity across categorical questions < 0.3
acquiescence_bias_index Agree-side proportion on Likert items (0.5 = no bias) > 0.6
overall_nonresponse_rate Mean skip rate across questions > 0.15

Weight Awareness

For Silver datasets, entropy, variance, and acquiescence metrics use the weight column automatically. Straightlining uses unweighted respondent counts since weighting doesn't affect response patterns.


code_text_responses

Automatically code open-ended text responses using semantic clustering. Creates a Gold dataset with binary indicator columns for each discovered theme.

Parameter Type Required Default Description
dataset_id string Yes Bronze or Silver dataset UUID
columns string[] No Text columns to code. Defaults to all Textarea columns
labeling_method string No "tfidf" "tfidf" for keyword-based or "llm" for AI-generated labels
min_cluster_size integer No 5 Minimum responses to form a cluster
similarity_threshold float No 0.65 Cosine similarity threshold for multi-label assignment
multi_label boolean No true true for binary indicators, false for exclusive category

Returns:

{
  "gold_dataset_id": "...",
  "name": "Survey Data (Coded)",
  "stage": "gold",
  "source_id": "...",
  "row_count": 500,
  "column_count": 35,
  "coding_results": [
    {
      "source_column": "q_feedback",
      "n_themes": 4,
      "n_responses": 480,
      "n_missing": 20,
      "avg_labels_per_response": 1.3,
      "theme_frequencies": {"Product quality": 120, "Customer service": 95},
      "indicator_columns": ["q_feedback_theme_1", "q_feedback_theme_2"]
    }
  ]
}

Uses bge-m3 embeddings via Ollama and HDBSCAN density-based clustering.


assign_strategy_to_dataset

Link a sampling strategy to a dataset for quality analysis. Enables quality metrics that compare actual demographics to strategy targets.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID
strategy_id string Yes Sampling strategy UUID

Returns: { dataset_id, strategy_id, strategy_name, dataset_name, dataset_stage }


delete_dataset

Delete a dataset and optionally its data file.

Parameter Type Required Default Description
dataset_id string Yes Dataset UUID
delete_file boolean No false Also delete the Parquet data file

Returns: { deleted: true, dataset_id, dataset_info: { name, stage, row_count, data_path }, file_deleted }


Error Handling

Stage Validation

The medallion pipeline enforces stage progression:

Operation Required Source Stage Error
apply_raking Bronze "Can only weight bronze datasets (got stage={stage})"
create_gold_dataset Silver or Bronze "Can only create gold from silver or bronze (got stage={stage})"

Common Errors

Scenario Response
Dataset not found {"error": "Dataset {id} not found"}
Strategy not found {"error": "Sampling strategy {id} not found"}
No data file {"error": "Dataset {id} has no data file"}
Invalid export format {"error": "Invalid format. Allowed: csv, xlsx, spss, parquet"}
No codeable columns {"error": "No columns could be coded (too few responses or no Textarea columns found)"}
Missing source params {"error": "Either campaign_id or both project_id and campaign_ids are required"}