Semantic Clustering¶
Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.
Overview¶
When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:
- Embeds text responses into vector space using a multilingual embedding model
- Clusters similar responses using density-based algorithms
- Labels each cluster with a descriptive theme name
- Assigns themes back to individual responses as indicator columns
The result is a Gold dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support), enabling quantitative analysis of qualitative data.
How It Works¶
Pipeline Architecture¶
Textarea responses (strings)
│
▼
┌─────────────────┐
│ Embedding │ Ollama bge-m3 (1024-dim vectors)
└────────┬────────┘
▼
┌─────────────────┐
│ Clustering │ UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
▼
┌─────────────────┐
│ Labeling │ c-TF-IDF keywords or LLM summaries
└────────┬────────┘
▼
┌─────────────────┐
│ Assignment │ Cosine similarity → binary indicators
└────────┬────────┘
▼
Gold dataset with theme columns
Step 1: Embedding¶
Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model via Ollama. Semantically similar responses produce vectors that are close together in the embedding space.
Step 2: Clustering¶
The high-dimensional embeddings are processed in two stages:
- UMAP reduces dimensionality while preserving local structure
- HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters
Responses that don't fit any cluster are marked as noise (unclustered).
Step 3: Labeling¶
Each cluster receives a descriptive label using one of two methods:
| Method | Speed | Quality | Description |
|---|---|---|---|
| c-TF-IDF | Fast | Good | Extracts distinctive keywords from each cluster using class-based TF-IDF |
| LLM | ~1s/cluster | High | Generates natural-language theme labels using an LLM |
Step 4: Assignment¶
Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:
- Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
- Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.
Using Semantic Clustering¶
Prerequisites¶
- A Bronze or Silver dataset containing at least one
Textareacolumn - Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
- Ollama embedding service available (bge-m3 model)
Via Balansor UI¶
- Navigate to Datasets and select a Bronze or Silver dataset
- Open the Refine tab
- Balansor auto-detects
Textareacolumns available for coding - Configure parameters:
- Labeling method: c-TF-IDF (fast) or LLM (higher quality)
- Minimum cluster size: Smallest group to form a theme (default: 5)
- Similarity threshold: How closely a response must match a theme (default: 0.65)
- Multi-label: Whether responses can belong to multiple themes
- Click "Code Text Responses"
- A new Gold dataset is created with indicator columns
Via MCP / API¶
Use the code_text_responses tool:
code_text_responses(
dataset_id="<bronze-or-silver-id>",
columns=None, # Auto-detect all Textarea columns
labeling_method="tfidf", # or "llm"
min_cluster_size=5,
similarity_threshold=0.65,
multi_label=true
)
When columns is None, all Textarea columns are automatically discovered from the dataset's column schema.
Output Format¶
Indicator Columns¶
For each theme discovered in a text column, a binary indicator column is added:
| Column Name | Type | Values | Description |
|---|---|---|---|
{source}_price |
indicator | 0 / 1 | Response mentions pricing themes |
{source}_support |
indicator | 0 / 1 | Response mentions customer support |
{source}_ease_of_use |
indicator | 0 / 1 | Response mentions usability |
The original text column is preserved alongside the new indicators.
Column Schema Metadata¶
Each indicator column includes metadata in the dataset's column schema:
control_type: "indicator"— distinguishes coded columns from survey controlslabels: {0: "No", 1: "Yes"}— value labels for exportcoding_metadata— source column, method, theme label, threshold, total themes
Integration with Medallion Architecture¶
Semantic clustering fits into the Gold refinement stage:
- Input: Bronze (raw) or Silver (weighted) dataset
- Output: Gold dataset with original data + indicator columns
- Weights preserved: If coding a Silver dataset, the
_weightcolumn carries through - Export: Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels
Designing Surveys for Clustering¶
QML Textarea Control¶
Open-ended questions use the Textarea control in QML:
- id: q_feedback
kind: Question
title: "What could we improve?"
input:
control: Textarea
placeholder: "Share your suggestions..."
maxLength: 1000
Best Practices¶
- Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
- Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
- Set reasonable maxLength: 500-1000 characters encourages substantive responses
- Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
- Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters
Configuration Reference¶
| Parameter | Default | Description |
|---|---|---|
labeling_method |
"tfidf" |
Cluster labeling strategy: "tfidf" or "llm" |
min_cluster_size |
5 |
Minimum responses to form a cluster |
similarity_threshold |
0.65 |
Cosine similarity threshold for theme assignment |
multi_label |
true |
Allow responses to match multiple themes |
Tuning Tips¶
- Low cluster count? Decrease
min_cluster_size(try 3-4) - Too many small clusters? Increase
min_cluster_size(try 8-10) - Responses assigned to too many themes? Increase
similarity_threshold(try 0.75) - Most responses unclustered? Decrease
similarity_threshold(try 0.50) or check if responses are too short/diverse
Troubleshooting¶
"No Textarea columns found"
The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.
Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.
Embedding service unavailable The Ollama service with bge-m3 model must be running. Check service health on the monitoring dashboard.
LLM labeling fails
If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.