Semantic Clustering¶

Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.

Overview¶

When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:

Embeds text responses into vector space using a multilingual embedding model
Clusters similar responses using density-based algorithms
Labels each cluster with a descriptive theme name
Assigns themes back to individual responses as indicator columns

This is the first step of deriving a Bundle's Silver stage — coding runs before weighting, so the discovered categories are available as weighting variables in the same derive. The result is a Silver dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support) alongside the weight column, enabling quantitative analysis of qualitative data right where the weighted case base lives.

How It Works¶

Pipeline Architecture¶

Textarea responses (strings)
        │
        ▼
┌─────────────────┐
│   Embedding      │  Local bge-m3 (1024-dim vectors)
└────────┬────────┘
         ▼
┌─────────────────┐
│   Clustering     │  UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
         ▼
┌─────────────────┐
│   Labeling       │  c-TF-IDF keywords or LLM summaries
└────────┬────────┘
         ▼
┌─────────────────┐
│   Assignment     │  Cosine similarity → binary indicators
└────────┬────────┘
         ▼
Silver dataset with theme columns (weighting runs next, same derive)

Step 1: Embedding¶

Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model on the platform's local CPU-only inference service. Semantically similar responses produce vectors that are close together in the embedding space.

Step 2: Clustering¶

The high-dimensional embeddings are processed in two stages:

UMAP reduces dimensionality while preserving local structure
HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters

Responses that don't fit any cluster are marked as noise (unclustered).

Step 3: Labeling¶

Each cluster receives a descriptive label using one of two methods:

Method	Speed	Quality	Description
c-TF-IDF	Fast	Good	Extracts distinctive keywords from each cluster using class-based TF-IDF
LLM	~1s/cluster	High	Generates natural-language theme labels using an LLM

Step 4: Assignment¶

Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:

Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.

Using Semantic Clustering¶

Prerequisites¶

A Bundle whose Bronze dataset is ready and contains at least one Textarea column
Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
Embedding service available (bge-m3 model on the platform's local CPU-only inference service)

Via Balansor UI¶

On the pipeline board, click "Derive" on the Bundle's Silver node
Balansor auto-detects Textarea columns from the Bronze schema; if there are none, coding is skipped automatically and weighting runs directly
Review the parameters (or accept the defaults):
- Labeling method: c-TF-IDF (fast) or LLM (higher quality)
- Minimum cluster size: Smallest group to form a theme (default: 5)
- Similarity threshold: How closely a response must match a theme (default: 0.65)
- Multi-label: Whether responses can belong to multiple themes
The Silver node shows "working" while coding and weighting run in the background
When it reaches "ready", review the discovered categories on the Bundle details page (click the Bundle's title on the board): accept or reject each category per column, save, and re-derive — the next derive keeps only the accepted indicator columns. Weighting targets (Bundle strategy, manual entry, auto-detect, plus any accepted indicator columns) are configured in the derive dialog — see Data Analysis for the full workflow

Via MCP / API¶

The code_open_ends tool (alias: derive_silver) starts the background Bronze→Silver derive over a Bundle's ready Bronze. It returns immediately with the Silver dataset in a processing state; poll get_dataset(dataset_id) until its processing_status reaches ready or error:

result = code_open_ends(bundle_id="<bundle-id>")
silver_id = result["silver_dataset_id"]

# Poll until the dataset is terminal
while True:
    dataset = get_dataset(dataset_id=silver_id)
    if dataset["processing_status"] in ("ready", "error"):
        break
    time.sleep(5)

code_open_ends always sources the Bundle's own Bronze — there is no dataset_id/columns parameter to point it at an arbitrary dataset. Discovered themes are automatically coded across every Textarea column in the questionnaire's schema.

The review checkpoint is human-only. Accepting or rejecting discovered categories happens on the Bundle details page — there is deliberately no MCP/API write for it. Before re-deriving, check the review state with the read-only get_bundle_coding tool (also GET /api/v1/bundles/{id}/coding): it returns the discovered categories per source column, the human-accepted set, the Silver node status, and a pending_review flag. A second code_open_ends call re-derives against whatever accepted set a human last saved — a column with no accepted entry keeps everything it discovers — so an agent driving the pipeline should surface pending_review: true to its user rather than silently re-deriving:

state = get_bundle_coding(bundle_id="<bundle-id>")
if state["pending_review"]:
    # ask the researcher to review categories on the Bundle details page first
    ...

Output Format¶

Indicator Columns¶

For each theme discovered in a text column, a binary indicator column is added to the Bundle's Silver dataset:

Column Name	Type	Values	Description
`{source}_price`	indicator	0 / 1	Response mentions pricing themes
`{source}_support`	indicator	0 / 1	Response mentions customer support
`{source}_ease_of_use`	indicator	0 / 1	Response mentions usability

The original text column is preserved alongside the new indicators.

Column Schema Metadata¶

Each indicator column includes metadata in the dataset's column schema:

control_type: "indicator" — distinguishes coded columns from survey controls
labels: {0: "No", 1: "Yes"} — value labels for export
coding_metadata — source column, method, theme label, threshold, total themes

Integration with the Bundle Pipeline¶

Semantic clustering is the first half of the Bronze→Silver derive — it runs, then weighting runs, both producing the same Silver dataset:

🥉 Bronze          🥈 Silver                🥇 Gold
(raw extraction) → (coded, then weighted) → (analysis-ready)

Input: the Bundle's own Bronze dataset — no other source is selectable
Output: the Bundle's Silver dataset, carrying both the original data, the new indicator columns, and the _weight column
Downstream: Gold refinement (rename/reorder/computed fields) and export both work from this Silver, so coded categories are available for cross-tabulation everywhere downstream
Export: Silver and Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels, once ready

Designing Surveys for Clustering¶

QML Textarea Control¶

Open-ended questions use the Textarea control in QML:

- id: q_feedback
  kind: Question
  title: "What could we improve?"
  input:
    control: Textarea
    placeholder: "Share your suggestions..."
    maxLength: 1000

Best Practices¶

Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
Set reasonable maxLength: 500-1000 characters encourages substantive responses
Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters

Configuration Reference¶

Parameter	Default	Description
`labeling_method`	`"tfidf"`	Cluster labeling strategy: `"tfidf"` or `"llm"`
`min_cluster_size`	`5`	Minimum responses to form a cluster
`similarity_threshold`	`0.65`	Cosine similarity threshold for theme assignment
`multi_label`	`true`	Allow responses to match multiple themes

Tuning Tips¶

Low cluster count? Decrease min_cluster_size (try 3-4)
Too many small clusters? Increase min_cluster_size (try 8-10)
Responses assigned to too many themes? Increase similarity_threshold (try 0.75)
Most responses unclustered? Decrease similarity_threshold (try 0.50) or check if responses are too short/diverse

Troubleshooting¶

"No Textarea columns found" The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.

Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.

Embedding service unavailable The platform's local CPU-only inference service must be running with the bge-m3 model. Check service health on the monitoring dashboard.

LLM labeling fails If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.