Semantic Clustering¶
Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.
Overview¶
When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:
- Embeds text responses into vector space using a multilingual embedding model
- Clusters similar responses using density-based algorithms
- Labels each cluster with a descriptive theme name
- Assigns themes back to individual responses as indicator columns
The result is a Gold dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support), enabling quantitative analysis of qualitative data.
How It Works¶
Pipeline Architecture¶
Textarea responses (strings)
│
▼
┌─────────────────┐
│ Embedding │ Local bge-m3 (1024-dim vectors)
└────────┬────────┘
▼
┌─────────────────┐
│ Clustering │ UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
▼
┌─────────────────┐
│ Labeling │ c-TF-IDF keywords or LLM summaries
└────────┬────────┘
▼
┌─────────────────┐
│ Assignment │ Cosine similarity → binary indicators
└────────┬────────┘
▼
Gold dataset with theme columns
Step 1: Embedding¶
Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model on the platform's local CPU-only inference service. Semantically similar responses produce vectors that are close together in the embedding space.
Step 2: Clustering¶
The high-dimensional embeddings are processed in two stages:
- UMAP reduces dimensionality while preserving local structure
- HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters
Responses that don't fit any cluster are marked as noise (unclustered).
Step 3: Labeling¶
Each cluster receives a descriptive label using one of two methods:
| Method | Speed | Quality | Description |
|---|---|---|---|
| c-TF-IDF | Fast | Good | Extracts distinctive keywords from each cluster using class-based TF-IDF |
| LLM | ~1s/cluster | High | Generates natural-language theme labels using an LLM |
Step 4: Assignment¶
Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:
- Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
- Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.
Using Semantic Clustering¶
Prerequisites¶
- A Bronze or Silver dataset containing at least one
Textareacolumn - Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
- Embedding service available (bge-m3 model on the platform's local CPU-only inference service)
Via Balansor UI¶
- Navigate to Refine from the main menu
- Select a bundle, then choose a source dataset (Bronze or Silver)
- Switch to the Semantic Clustering tab
- Balansor auto-detects
Textareacolumns available for coding - Configure parameters:
- Labeling method: c-TF-IDF (fast) or LLM (higher quality)
- Minimum cluster size: Smallest group to form a theme (default: 5)
- Similarity threshold: How closely a response must match a theme (default: 0.65)
- Multi-label: Whether responses can belong to multiple themes
- Click "Run Clustering"
- A new Gold dataset is created with indicator columns
Refine page combines both operations
The Refine page has two tabs: Field Transformation (rename, delete, reorder, computed fields) and Semantic Clustering. Both produce Gold datasets from the same source. See Data Analysis for the full Refine workflow.
Via MCP / API¶
Use the code_text_responses tool:
code_text_responses(
dataset_id="<bronze-or-silver-id>",
columns=None, # Auto-detect all Textarea columns
labeling_method="tfidf", # or "llm"
min_cluster_size=5,
similarity_threshold=0.65,
multi_label=true
)
When columns is None, all Textarea columns are automatically discovered from the dataset's column schema.
Output Format¶
Indicator Columns¶
For each theme discovered in a text column, a binary indicator column is added:
| Column Name | Type | Values | Description |
|---|---|---|---|
{source}_price |
indicator | 0 / 1 | Response mentions pricing themes |
{source}_support |
indicator | 0 / 1 | Response mentions customer support |
{source}_ease_of_use |
indicator | 0 / 1 | Response mentions usability |
The original text column is preserved alongside the new indicators.
Column Schema Metadata¶
Each indicator column includes metadata in the dataset's column schema:
control_type: "indicator"— distinguishes coded columns from survey controlslabels: {0: "No", 1: "Yes"}— value labels for exportcoding_metadata— source column, method, theme label, threshold, total themes
Integration with Medallion Architecture¶
Semantic clustering fits into the Gold refinement stage:
- Input: Bronze (raw) or Silver (weighted) dataset
- Output: Gold dataset with original data + indicator columns
- Weights preserved: If coding a Silver dataset, the
_weightcolumn carries through - Export: Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels
Designing Surveys for Clustering¶
QML Textarea Control¶
Open-ended questions use the Textarea control in QML:
- id: q_feedback
kind: Question
title: "What could we improve?"
input:
control: Textarea
placeholder: "Share your suggestions..."
maxLength: 1000
Best Practices¶
- Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
- Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
- Set reasonable maxLength: 500-1000 characters encourages substantive responses
- Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
- Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters
Configuration Reference¶
| Parameter | Default | Description |
|---|---|---|
labeling_method |
"tfidf" |
Cluster labeling strategy: "tfidf" or "llm" |
min_cluster_size |
5 |
Minimum responses to form a cluster |
similarity_threshold |
0.65 |
Cosine similarity threshold for theme assignment |
multi_label |
true |
Allow responses to match multiple themes |
Tuning Tips¶
- Low cluster count? Decrease
min_cluster_size(try 3-4) - Too many small clusters? Increase
min_cluster_size(try 8-10) - Responses assigned to too many themes? Increase
similarity_threshold(try 0.75) - Most responses unclustered? Decrease
similarity_threshold(try 0.50) or check if responses are too short/diverse
Troubleshooting¶
"No Textarea columns found"
The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.
Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.
Embedding service unavailable The platform's local CPU-only inference service must be running with the bge-m3 model. Check service health on the monitoring dashboard.
LLM labeling fails
If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.