Semantic Clustering¶

Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.

Overview¶

When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:

Embeds text responses into vector space using a multilingual embedding model
Clusters similar responses using density-based algorithms
Labels each cluster with a descriptive theme name
Assigns themes back to individual responses as indicator columns

The result is a Gold dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support), enabling quantitative analysis of qualitative data.

How It Works¶

Pipeline Architecture¶

Textarea responses (strings)
        │
        ▼
┌─────────────────┐
│   Embedding      │  Ollama bge-m3 (1024-dim vectors)
└────────┬────────┘
         ▼
┌─────────────────┐
│   Clustering     │  UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
         ▼
┌─────────────────┐
│   Labeling       │  c-TF-IDF keywords or LLM summaries
└────────┬────────┘
         ▼
┌─────────────────┐
│   Assignment     │  Cosine similarity → binary indicators
└────────┬────────┘
         ▼
Gold dataset with theme columns

Step 1: Embedding¶

Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model via Ollama. Semantically similar responses produce vectors that are close together in the embedding space.

Step 2: Clustering¶

The high-dimensional embeddings are processed in two stages:

UMAP reduces dimensionality while preserving local structure
HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters

Responses that don't fit any cluster are marked as noise (unclustered).

Step 3: Labeling¶

Each cluster receives a descriptive label using one of two methods:

Method	Speed	Quality	Description
c-TF-IDF	Fast	Good	Extracts distinctive keywords from each cluster using class-based TF-IDF
LLM	~1s/cluster	High	Generates natural-language theme labels using an LLM

Step 4: Assignment¶

Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:

Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.

Using Semantic Clustering¶

Prerequisites¶

A Bronze or Silver dataset containing at least one Textarea column
Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
Ollama embedding service available (bge-m3 model)

Via Balansor UI¶

Navigate to Datasets and select a Bronze or Silver dataset
Open the Refine tab
Balansor auto-detects Textarea columns available for coding
Configure parameters:
- Labeling method: c-TF-IDF (fast) or LLM (higher quality)
- Minimum cluster size: Smallest group to form a theme (default: 5)
- Similarity threshold: How closely a response must match a theme (default: 0.65)
- Multi-label: Whether responses can belong to multiple themes
Click "Code Text Responses"
A new Gold dataset is created with indicator columns

Via MCP / API¶

Use the code_text_responses tool:

code_text_responses(
    dataset_id="<bronze-or-silver-id>",
    columns=None,              # Auto-detect all Textarea columns
    labeling_method="tfidf",   # or "llm"
    min_cluster_size=5,
    similarity_threshold=0.65,
    multi_label=true
)

When columns is None, all Textarea columns are automatically discovered from the dataset's column schema.

Output Format¶

Indicator Columns¶

For each theme discovered in a text column, a binary indicator column is added:

Column Name	Type	Values	Description
`{source}_price`	indicator	0 / 1	Response mentions pricing themes
`{source}_support`	indicator	0 / 1	Response mentions customer support
`{source}_ease_of_use`	indicator	0 / 1	Response mentions usability

The original text column is preserved alongside the new indicators.

Column Schema Metadata¶

Each indicator column includes metadata in the dataset's column schema:

control_type: "indicator" — distinguishes coded columns from survey controls
labels: {0: "No", 1: "Yes"} — value labels for export
coding_metadata — source column, method, theme label, threshold, total themes

Integration with Medallion Architecture¶

Semantic clustering fits into the Gold refinement stage:

🥉 Bronze  →  🥈 Silver  →  🥇 Gold
 (Raw)        (Weighted)    (Coded + Weighted)

Input: Bronze (raw) or Silver (weighted) dataset
Output: Gold dataset with original data + indicator columns
Weights preserved: If coding a Silver dataset, the _weight column carries through
Export: Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels

Designing Surveys for Clustering¶

QML Textarea Control¶

Open-ended questions use the Textarea control in QML:

- id: q_feedback
  kind: Question
  title: "What could we improve?"
  input:
    control: Textarea
    placeholder: "Share your suggestions..."
    maxLength: 1000

Best Practices¶

Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
Set reasonable maxLength: 500-1000 characters encourages substantive responses
Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters

Configuration Reference¶

Parameter	Default	Description
`labeling_method`	`"tfidf"`	Cluster labeling strategy: `"tfidf"` or `"llm"`
`min_cluster_size`	`5`	Minimum responses to form a cluster
`similarity_threshold`	`0.65`	Cosine similarity threshold for theme assignment
`multi_label`	`true`	Allow responses to match multiple themes

Tuning Tips¶

Low cluster count? Decrease min_cluster_size (try 3-4)
Too many small clusters? Increase min_cluster_size (try 8-10)
Responses assigned to too many themes? Increase similarity_threshold (try 0.75)
Most responses unclustered? Decrease similarity_threshold (try 0.50) or check if responses are too short/diverse

Troubleshooting¶

"No Textarea columns found" The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.

Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.

Embedding service unavailable The Ollama service with bge-m3 model must be running. Check service health on the monitoring dashboard.

LLM labeling fails If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.