Skip to content

Semantic Clustering

Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.

Overview

When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:

  1. Embeds text responses into vector space using a multilingual embedding model
  2. Clusters similar responses using density-based algorithms
  3. Labels each cluster with a descriptive theme name
  4. Assigns themes back to individual responses as indicator columns

The result is a Gold dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support), enabling quantitative analysis of qualitative data.

How It Works

Pipeline Architecture

Textarea responses (strings)
┌─────────────────┐
│   Embedding      │  Ollama bge-m3 (1024-dim vectors)
└────────┬────────┘
┌─────────────────┐
│   Clustering     │  UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
┌─────────────────┐
│   Labeling       │  c-TF-IDF keywords or LLM summaries
└────────┬────────┘
┌─────────────────┐
│   Assignment     │  Cosine similarity → binary indicators
└────────┬────────┘
Gold dataset with theme columns

Step 1: Embedding

Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model via Ollama. Semantically similar responses produce vectors that are close together in the embedding space.

Step 2: Clustering

The high-dimensional embeddings are processed in two stages:

  • UMAP reduces dimensionality while preserving local structure
  • HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters

Responses that don't fit any cluster are marked as noise (unclustered).

Step 3: Labeling

Each cluster receives a descriptive label using one of two methods:

Method Speed Quality Description
c-TF-IDF Fast Good Extracts distinctive keywords from each cluster using class-based TF-IDF
LLM ~1s/cluster High Generates natural-language theme labels using an LLM

Step 4: Assignment

Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:

  • Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
  • Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.

Using Semantic Clustering

Prerequisites

  • A Bronze or Silver dataset containing at least one Textarea column
  • Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
  • Ollama embedding service available (bge-m3 model)

Via Balansor UI

  1. Navigate to Datasets and select a Bronze or Silver dataset
  2. Open the Refine tab
  3. Balansor auto-detects Textarea columns available for coding
  4. Configure parameters:
    • Labeling method: c-TF-IDF (fast) or LLM (higher quality)
    • Minimum cluster size: Smallest group to form a theme (default: 5)
    • Similarity threshold: How closely a response must match a theme (default: 0.65)
    • Multi-label: Whether responses can belong to multiple themes
  5. Click "Code Text Responses"
  6. A new Gold dataset is created with indicator columns

Via MCP / API

Use the code_text_responses tool:

code_text_responses(
    dataset_id="<bronze-or-silver-id>",
    columns=None,              # Auto-detect all Textarea columns
    labeling_method="tfidf",   # or "llm"
    min_cluster_size=5,
    similarity_threshold=0.65,
    multi_label=true
)

When columns is None, all Textarea columns are automatically discovered from the dataset's column schema.

Output Format

Indicator Columns

For each theme discovered in a text column, a binary indicator column is added:

Column Name Type Values Description
{source}_price indicator 0 / 1 Response mentions pricing themes
{source}_support indicator 0 / 1 Response mentions customer support
{source}_ease_of_use indicator 0 / 1 Response mentions usability

The original text column is preserved alongside the new indicators.

Column Schema Metadata

Each indicator column includes metadata in the dataset's column schema:

  • control_type: "indicator" — distinguishes coded columns from survey controls
  • labels: {0: "No", 1: "Yes"} — value labels for export
  • coding_metadata — source column, method, theme label, threshold, total themes

Integration with Medallion Architecture

Semantic clustering fits into the Gold refinement stage:

🥉 Bronze  →  🥈 Silver  →  🥇 Gold
 (Raw)        (Weighted)    (Coded + Weighted)
  • Input: Bronze (raw) or Silver (weighted) dataset
  • Output: Gold dataset with original data + indicator columns
  • Weights preserved: If coding a Silver dataset, the _weight column carries through
  • Export: Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels

Designing Surveys for Clustering

QML Textarea Control

Open-ended questions use the Textarea control in QML:

- id: q_feedback
  kind: Question
  title: "What could we improve?"
  input:
    control: Textarea
    placeholder: "Share your suggestions..."
    maxLength: 1000

Best Practices

  • Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
  • Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
  • Set reasonable maxLength: 500-1000 characters encourages substantive responses
  • Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
  • Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters

Configuration Reference

Parameter Default Description
labeling_method "tfidf" Cluster labeling strategy: "tfidf" or "llm"
min_cluster_size 5 Minimum responses to form a cluster
similarity_threshold 0.65 Cosine similarity threshold for theme assignment
multi_label true Allow responses to match multiple themes

Tuning Tips

  • Low cluster count? Decrease min_cluster_size (try 3-4)
  • Too many small clusters? Increase min_cluster_size (try 8-10)
  • Responses assigned to too many themes? Increase similarity_threshold (try 0.75)
  • Most responses unclustered? Decrease similarity_threshold (try 0.50) or check if responses are too short/diverse

Troubleshooting

"No Textarea columns found" The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.

Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.

Embedding service unavailable The Ollama service with bge-m3 model must be running. Check service health on the monitoring dashboard.

LLM labeling fails If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.