Skip to content

Semantic Clustering

Semantic clustering transforms open-ended text responses into structured, quantitative themes. This enables statistical analysis of qualitative data — turning free-text answers into coded variables ready for cross-tabulation, weighting, and export.

Overview

When surveys include open-ended questions (using the Textarea control), respondents provide unstructured text. Semantic clustering automatically:

  1. Embeds text responses into vector space using a multilingual embedding model
  2. Clusters similar responses using density-based algorithms
  3. Labels each cluster with a descriptive theme name
  4. Assigns themes back to individual responses as indicator columns

The result is a Gold dataset where each open-ended question produces one or more binary indicator columns (e.g., q_feedback_price, q_feedback_support), enabling quantitative analysis of qualitative data.

How It Works

Pipeline Architecture

Textarea responses (strings)
┌─────────────────┐
│   Embedding      │  Local bge-m3 (1024-dim vectors)
└────────┬────────┘
┌─────────────────┐
│   Clustering     │  UMAP dimensionality reduction → HDBSCAN
└────────┬────────┘
┌─────────────────┐
│   Labeling       │  c-TF-IDF keywords or LLM summaries
└────────┬────────┘
┌─────────────────┐
│   Assignment     │  Cosine similarity → binary indicators
└────────┬────────┘
Gold dataset with theme columns

Step 1: Embedding

Each text response is converted into a 1024-dimensional vector using the bge-m3 multilingual embedding model on the platform's local CPU-only inference service. Semantically similar responses produce vectors that are close together in the embedding space.

Step 2: Clustering

The high-dimensional embeddings are processed in two stages:

  • UMAP reduces dimensionality while preserving local structure
  • HDBSCAN identifies dense clusters of similar responses without requiring a predetermined number of clusters

Responses that don't fit any cluster are marked as noise (unclustered).

Step 3: Labeling

Each cluster receives a descriptive label using one of two methods:

Method Speed Quality Description
c-TF-IDF Fast Good Extracts distinctive keywords from each cluster using class-based TF-IDF
LLM ~1s/cluster High Generates natural-language theme labels using an LLM

Step 4: Assignment

Themes are assigned back to individual responses based on cosine similarity between the response embedding and cluster centroids:

  • Multi-label mode (default): Each response can belong to multiple themes. Produces binary indicator columns (0/1) for each theme where similarity exceeds the threshold.
  • Single-best mode: Each response belongs to exactly one theme. Produces a single categorical column.

Using Semantic Clustering

Prerequisites

  • A Bronze or Silver dataset containing at least one Textarea column
  • Sufficient responses for clustering (minimum 5 per cluster; realistically 30+ total text responses)
  • Embedding service available (bge-m3 model on the platform's local CPU-only inference service)

Via Balansor UI

  1. Navigate to Refine from the main menu
  2. Select a bundle, then choose a source dataset (Bronze or Silver)
  3. Switch to the Semantic Clustering tab
  4. Balansor auto-detects Textarea columns available for coding
  5. Configure parameters:
    • Labeling method: c-TF-IDF (fast) or LLM (higher quality)
    • Minimum cluster size: Smallest group to form a theme (default: 5)
    • Similarity threshold: How closely a response must match a theme (default: 0.65)
    • Multi-label: Whether responses can belong to multiple themes
  6. Click "Run Clustering"
  7. A new Gold dataset is created with indicator columns

Refine page combines both operations

The Refine page has two tabs: Field Transformation (rename, delete, reorder, computed fields) and Semantic Clustering. Both produce Gold datasets from the same source. See Data Analysis for the full Refine workflow.

Via MCP / API

Use the code_text_responses tool:

code_text_responses(
    dataset_id="<bronze-or-silver-id>",
    columns=None,              # Auto-detect all Textarea columns
    labeling_method="tfidf",   # or "llm"
    min_cluster_size=5,
    similarity_threshold=0.65,
    multi_label=true
)

When columns is None, all Textarea columns are automatically discovered from the dataset's column schema.

Output Format

Indicator Columns

For each theme discovered in a text column, a binary indicator column is added:

Column Name Type Values Description
{source}_price indicator 0 / 1 Response mentions pricing themes
{source}_support indicator 0 / 1 Response mentions customer support
{source}_ease_of_use indicator 0 / 1 Response mentions usability

The original text column is preserved alongside the new indicators.

Column Schema Metadata

Each indicator column includes metadata in the dataset's column schema:

  • control_type: "indicator" — distinguishes coded columns from survey controls
  • labels: {0: "No", 1: "Yes"} — value labels for export
  • coding_metadata — source column, method, theme label, threshold, total themes

Integration with Medallion Architecture

Semantic clustering fits into the Gold refinement stage:

🥉 Bronze  →  🥈 Silver  →  🥇 Gold
 (Raw)        (Weighted)    (Coded + Weighted)
  • Input: Bronze (raw) or Silver (weighted) dataset
  • Output: Gold dataset with original data + indicator columns
  • Weights preserved: If coding a Silver dataset, the _weight column carries through
  • Export: Gold datasets export to CSV, Excel, SPSS, or Parquet with full variable labels

Designing Surveys for Clustering

QML Textarea Control

Open-ended questions use the Textarea control in QML:

- id: q_feedback
  kind: Question
  title: "What could we improve?"
  input:
    control: Textarea
    placeholder: "Share your suggestions..."
    maxLength: 1000

Best Practices

  • Ask focused questions: "What do you like most?" produces more clusterable responses than "Any comments?"
  • Use multiple open-ended questions: Each targets a different aspect (likes, dislikes, use cases)
  • Set reasonable maxLength: 500-1000 characters encourages substantive responses
  • Combine with structured questions: Demographics and rating scales enable cross-tabulation with discovered themes
  • Plan for sample size: Aim for 30+ responses per open-ended question for meaningful clusters

Configuration Reference

Parameter Default Description
labeling_method "tfidf" Cluster labeling strategy: "tfidf" or "llm"
min_cluster_size 5 Minimum responses to form a cluster
similarity_threshold 0.65 Cosine similarity threshold for theme assignment
multi_label true Allow responses to match multiple themes

Tuning Tips

  • Low cluster count? Decrease min_cluster_size (try 3-4)
  • Too many small clusters? Increase min_cluster_size (try 8-10)
  • Responses assigned to too many themes? Increase similarity_threshold (try 0.75)
  • Most responses unclustered? Decrease similarity_threshold (try 0.50) or check if responses are too short/diverse

Troubleshooting

"No Textarea columns found" The dataset's source questionnaire has no Textarea controls. Use structured controls (Radio, Slider) for quantitative questions and Textarea for open-ended ones.

Too few clusters discovered HDBSCAN requires sufficient density. Ensure at least 30 text responses and that the question elicits varied answers.

Embedding service unavailable The platform's local CPU-only inference service must be running with the bge-m3 model. Check service health on the monitoring dashboard.

LLM labeling fails If using labeling_method="llm", the inference model must be available. Fall back to "tfidf" which runs locally without external dependencies.