Dataset Tools — MCP Tool Reference¶
Dataset tools implement the medallion data pipeline: extracting survey responses into Bronze datasets, applying post-stratification weighting to create Silver datasets, and refining into Gold datasets for export. Tools also provide quality metrics, semantic text coding, and multi-format export.
list_datasets¶
List datasets with optional filters.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
stage |
string |
No | — | Filter by stage: "bronze", "silver", or "gold" |
source_type |
string |
No | — | Filter: "campaign", "project", "upload", "weighted", or "refined" |
campaign_id |
string |
No | — | Filter by campaign ID |
limit |
integer |
No | 100 |
Maximum results to return |
Returns: { items: [...], count, limit }
get_dataset¶
Get detailed dataset information including column schema and metrics.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID |
Returns: Full dataset object with id, name, description, stage, source_type, row_count, column_count, column_schema, has_weights, weight_column, weighting_method, weighting_config, quality_score, processing_status, strategy_id.
create_bronze_dataset¶
Extract survey responses from one or more campaigns into a Bronze dataset.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string |
Yes | — | Dataset name |
campaign_id |
string |
No | — | Single campaign to extract from |
project_id |
string |
No | — | Project ID (required with campaign_ids) |
campaign_ids |
string[] |
No | — | List of campaign IDs to extract from |
include_demographics |
boolean |
No | true |
Include respondent demographic columns |
Returns: { dataset_id, name, stage: "bronze", ... } with extraction report.
Source Selection
Provide either campaign_id alone, or both project_id and campaign_ids together.
apply_raking¶
Apply post-stratification weighting (iterative proportional fitting) to a Bronze dataset, creating a Silver dataset.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Bronze dataset UUID |
Returns: { bronze_dataset_id, silver_dataset_id, stage: "silver", ... } with raking report.
Stage validation: Only Bronze datasets can be weighted. Returns error for other stages.
create_gold_dataset¶
Create a refined Gold dataset from a Silver or Bronze source, optionally transforming columns.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id |
string |
Yes | — | Source dataset UUID (Silver or Bronze) |
name |
string |
No | — | Name for Gold dataset. Defaults to "{source_name} (Gold)" |
fields |
object[] |
No | — | Column transformations |
Each fields entry:
| Field | Type | Description |
|---|---|---|
id |
string |
Original column name |
name |
string |
New display name |
deleted |
boolean |
true to exclude column |
Returns: { gold_dataset_id, name, stage: "gold", source_id, source_stage, row_count, column_count }
export_dataset¶
Export a dataset in the specified format.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID |
format |
string |
No | "csv" |
Export format: "csv", "xlsx", "spss", or "parquet" |
Returns: { path, ... } with format-specific metadata.
Export Formats¶
| Format | Extension | Description |
|---|---|---|
csv |
.csv |
Universal format for most tools |
xlsx |
.xlsx |
Excel with summary and schema sheets |
spss |
.sav |
SPSS with variable and value labels |
parquet |
.parquet |
Columnar format with full metadata (Snappy compression) |
get_dataset_quality¶
Get sample representativeness metrics comparing dataset demographics against a sampling strategy's targets.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID |
strategy_id |
string |
No | — | Sampling strategy UUID. Auto-detected from campaign if omitted |
Returns: Quality metrics including overall score, composite error, RMSE, and per-factor deviation analysis.
compare_dataset_quality¶
Compare representativeness metrics between Bronze (unweighted) and Silver (weighted) datasets to evaluate raking effectiveness.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
bronze_dataset_id |
string |
Yes | — | Bronze dataset UUID |
silver_dataset_id |
string |
Yes | — | Silver dataset UUID |
strategy_id |
string |
No | — | Sampling strategy UUID. Auto-detected if omitted |
Returns: Comparison metrics showing improvement from post-stratification weighting.
get_dataset_response_quality¶
Analyze response-level quality metrics for a dataset. Evaluates answer diversity, response patterns, internal consistency, and acquiescence bias from the actual survey responses — complementing get_dataset_quality which measures sample representativeness.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID (Bronze or Silver) |
Returns:
{
"per_question": [
{
"column_name": "q1",
"item_id": "q1",
"item_kind": "Question",
"item_title": "Preferred method",
"control_type": "radio",
"normalized_entropy": 0.87,
"effective_num_answers": 3.2,
"option_usage_ratio": 1.0,
"item_variance": null,
"item_nonresponse_rate": 0.02,
"n_responses": 98,
"n_missing": 2,
"n_total": 100,
"response_distribution": {"a": 0.35, "b": 0.40, "c": 0.25}
}
],
"straightlining": [
{
"item_id": "g1",
"item_title": "Satisfaction Battery",
"item_kind": "QuestionGroup",
"straightlining_score": 0.08,
"n_respondents": 95,
"n_straightliners": 8
}
],
"reliability": [
{
"item_id": "g1",
"item_title": "Satisfaction Battery",
"item_kind": "QuestionGroup",
"cronbachs_alpha": 0.82,
"n_items": 4,
"n_respondents": 90
}
],
"mean_normalized_entropy": 0.75,
"acquiescence_bias_index": 0.52,
"overall_nonresponse_rate": 0.04,
"n_questions_analyzed": 12,
"n_categorical": 8,
"n_numeric": 2,
"n_text": 2,
"sample_size": 100,
"warnings": []
}
Metric Reference¶
Per-question (categorical):
| Metric | Range | Description |
|---|---|---|
normalized_entropy |
0–1 | Answer diversity (1.0 = uniform, 0.0 = single value) |
effective_num_answers |
1–k | How many options carry meaningful weight |
option_usage_ratio |
0–1 | Fraction of options that received responses |
Per-question (numeric):
| Metric | Range | Description |
|---|---|---|
item_variance |
0+ | Response spread (weighted if Silver dataset) |
Group metrics (QuestionGroup, MatrixQuestion):
| Metric | Range | Description |
|---|---|---|
straightlining_score |
0–1 | Identical-answer proportion across sub-items |
cronbachs_alpha |
0–1 | Internal consistency (3+ sub-items required) |
Dataset aggregates:
| Metric | Description | Concern Threshold |
|---|---|---|
mean_normalized_entropy |
Average diversity across categorical questions | < 0.3 |
acquiescence_bias_index |
Agree-side proportion on Likert items (0.5 = no bias) | > 0.6 |
overall_nonresponse_rate |
Mean skip rate across questions | > 0.15 |
Weight Awareness
For Silver datasets, entropy, variance, and acquiescence metrics use the weight column automatically. Straightlining uses unweighted respondent counts since weighting doesn't affect response patterns.
code_text_responses¶
Automatically code open-ended text responses using semantic clustering. Creates a Gold dataset with binary indicator columns for each discovered theme.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Bronze or Silver dataset UUID |
columns |
string[] |
No | — | Text columns to code. Defaults to all Textarea columns |
labeling_method |
string |
No | "tfidf" |
"tfidf" for keyword-based or "llm" for AI-generated labels |
min_cluster_size |
integer |
No | 5 |
Minimum responses to form a cluster |
similarity_threshold |
float |
No | 0.65 |
Cosine similarity threshold for multi-label assignment |
multi_label |
boolean |
No | true |
true for binary indicators, false for exclusive category |
Returns:
{
"gold_dataset_id": "...",
"name": "Survey Data (Coded)",
"stage": "gold",
"source_id": "...",
"row_count": 500,
"column_count": 35,
"coding_results": [
{
"source_column": "q_feedback",
"n_themes": 4,
"n_responses": 480,
"n_missing": 20,
"avg_labels_per_response": 1.3,
"theme_frequencies": {"Product quality": 120, "Customer service": 95},
"indicator_columns": ["q_feedback_theme_1", "q_feedback_theme_2"]
}
]
}
Uses bge-m3 embeddings via Ollama and HDBSCAN density-based clustering.
assign_strategy_to_dataset¶
Link a sampling strategy to a dataset for quality analysis. Enables quality metrics that compare actual demographics to strategy targets.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID |
strategy_id |
string |
Yes | — | Sampling strategy UUID |
Returns: { dataset_id, strategy_id, strategy_name, dataset_name, dataset_stage }
delete_dataset¶
Delete a dataset and optionally its data file.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string |
Yes | — | Dataset UUID |
delete_file |
boolean |
No | false |
Also delete the Parquet data file |
Returns: { deleted: true, dataset_id, dataset_info: { name, stage, row_count, data_path }, file_deleted }
Error Handling¶
Stage Validation¶
The medallion pipeline enforces stage progression:
| Operation | Required Source Stage | Error |
|---|---|---|
apply_raking |
Bronze | "Can only weight bronze datasets (got stage={stage})" |
create_gold_dataset |
Silver or Bronze | "Can only create gold from silver or bronze (got stage={stage})" |
Common Errors¶
| Scenario | Response |
|---|---|
| Dataset not found | {"error": "Dataset {id} not found"} |
| Strategy not found | {"error": "Sampling strategy {id} not found"} |
| No data file | {"error": "Dataset {id} has no data file"} |
| Invalid export format | {"error": "Invalid format. Allowed: csv, xlsx, spss, parquet"} |
| No codeable columns | {"error": "No columns could be coded (too few responses or no Textarea columns found)"} |
| Missing source params | {"error": "Either campaign_id or both project_id and campaign_ids are required"} |