Skip to content

Quality Metrics Reference

Askalot evaluates survey data quality across two complementary dimensions grounded in the Total Survey Error framework (Groves et al., 2009):

  • Sample Representativeness — Do respondent demographics match the target population? Quantifies coverage and nonresponse error.
  • Response Quality — Are survey questions producing informative, diverse answers? Detects measurement error.

Both dimensions must be acceptable for data to be fit for analysis.


Sample Representativeness Metrics

These metrics compare the actual demographic distribution of survey respondents against target distributions defined in a sampling strategy.

RMSE (Root Mean Square Error)

Overall deviation between sample and target distributions across all demographic categories.

\[ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (a_i - t_i)^2} \]

Where \(a_i\) is the actual proportion and \(t_i\) is the target proportion for category \(i\).

RMSE penalizes large deviations more than many small ones due to the squared term. A single demographic category that is 10 percentage points off-target contributes more to RMSE than five categories each off by 2 points.

RMSE Interpretation
< 0.02 Excellent — sample closely matches targets
0.02–0.05 Acceptable — minor deviations, weighting should correct
0.05–0.10 Concerning — significant deviations, investigate causes
> 0.10 Problematic — structural sampling issues, weighting may not suffice

MAE (Mean Absolute Error)

Average per-category deviation. More robust to outlier categories than RMSE.

\[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |a_i - t_i| \]

When MAE is much smaller than RMSE, one or two categories are severely off while others are fine. When MAE ≈ RMSE, deviations are uniformly distributed.

Chi-Square Test

Tests whether observed deviations from targets are statistically significant versus random sampling noise.

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \]

Where \(O_i\) is the observed count and \(E_i\) is the expected count based on target proportions.

Sample Size Sensitivity

With large samples (N > 500), chi-square will flag almost any deviation as statistically significant, even if the practical magnitude is trivial. Use RMSE/MAE to assess practical significance, and chi-square for statistical significance. A significant chi-square with RMSE < 0.02 may not be actionable.

Max Deviation

The largest single-category gap between actual and target proportions. Identifies the most problematic demographic category.

Max Deviation Interpretation
< 0.05 All categories within acceptable range
0.05–0.10 One category moderately off — likely correctable by weighting
> 0.10 Structural gap — at least one group is substantially misrepresented

Composite Quality Score

The overall quality score (0–1) combines RMSE across all factors:

Score Rating Interpretation
> 0.8 Good Sample adequately represents target population
0.6–0.8 Acceptable Some factors off-target, weighting recommended
0.4–0.6 Poor Significant representativeness issues, investigate root causes
< 0.4 Critical Sample does not represent target population

Per-Factor Review

A high overall score can mask one poorly represented factor. Always review the per-factor breakdown — the factor with the lowest score often drives the most bias in survey estimates.


Weighting Diagnostics

After raking (iterative proportional fitting) produces a Silver dataset, these metrics evaluate whether weighting improved representativeness without introducing excessive variance.

Design Effect (DEFF)

Variance inflation from unequal weighting (Kish, 1965):

\[ \text{DEFF} = 1 + \text{CV}^2_w \]

Where \(\text{CV}_w\) is the coefficient of variation of the weights.

DEFF Interpretation
< 1.5 Minimal impact on precision
1.5–2.0 Moderate precision loss — acceptable for most studies
2.0–3.0 Substantial — review weight distribution
> 3.0 Severe — weighting compensating for structural sample problems

Effective Sample Size

The equivalent unweighted sample size after accounting for the design effect:

\[ n_{\text{eff}} = \frac{n}{\text{DEFF}} = \frac{(\sum w_i)^2}{\sum w_i^2} \]
\(n_{\text{eff}} / n\) Interpretation
> 0.80 Minimal efficiency loss from weighting
0.50–0.80 Moderate loss — some groups heavily weighted
< 0.50 Severe loss — weighting compensating too aggressively

Weight Quality Benchmarks

Metric Good Acceptable Problematic
Weight CV < 0.5 0.5–1.0 > 1.0
Max/min weight ratio < 4:1 4:1–6:1 > 6:1

Interpreting Design Effect

A design effect of 1.5 means your effective sample size is 67% of your actual sample size. Weights exceeding a 4:1 ratio (max/min) signal structural problems that weighting alone cannot fix — consider adjusting recruitment strategy instead.

Completion Rate

Proportion of surveys reaching "completed" status out of all surveys created for a campaign. Provides a high-level measure of fieldwork success and potential nonresponse bias.


Response Quality Metrics

These metrics evaluate survey responses themselves — independent of sampling strategy.

Per-Question Metrics

Normalized Entropy (Categorical Questions)

Measures answer diversity for radio and dropdown questions. Based on Shannon entropy (Shannon, 1948), normalized to a 0–1 scale:

\[ H_{\text{norm}} = \frac{H}{\log k} = \frac{-\sum_{i=1}^{k} p_i \log p_i}{\log k} \]

Where \(p_i\) is the proportion of responses selecting option \(i\) and \(k\) is the number of available options.

\(H_{\text{norm}}\) Interpretation
> 0.7 Good diversity — responses spread across options
0.3–0.7 Moderate — some options dominate
< 0.3 Low diversity — responses concentrated on few options (ceiling/floor effect)

A value of 1.0 means perfectly uniform distribution. A value approaching 0.0 means nearly all respondents chose the same option.

Effective Number of Answers

The Hill number of order 1 (Hill, 1973) — how many options carry meaningful weight:

\[ N_{\text{eff}} = \exp(H) = \exp\!\left(-\sum_{i=1}^{k} p_i \log p_i\right) \]

If a question has 5 options but \(N_{\text{eff}} = 2.1\), only about 2 options are receiving substantial responses. More intuitive than raw entropy for identifying underutilized response scales.

Option Usage Ratio

Fraction of available options that received at least one response:

\[ \text{OUR} = \frac{|\{i : c_i > 0\}|}{k} \]

A ratio below 0.5 suggests some options are irrelevant and could be merged or removed.

Item Variance (Numeric Questions)

Response spread for slider and number questions. For weighted (Silver) datasets, uses weighted variance:

\[ \sigma^2_w = \frac{\sum_{i=1}^{n} w_i (x_i - \bar{x}_w)^2}{\sum_{i=1}^{n} w_i} \]

A variance of zero means all respondents gave the same answer — a strong signal of a poorly calibrated scale or ceiling/floor effect.

Item Non-Response Rate

Proportion of respondents who skipped the question:

\[ \text{NRR} = \frac{n_{\text{missing}}}{n_{\text{total}}} \]
Non-Response Rate Interpretation
< 5% Normal
5–10% Monitor — may indicate sensitivity or confusion
10–20% Investigate wording, placement, or sensitivity
> 20% Redesign the question

Increasing non-response rates toward the end of the survey typically indicate respondent fatigue.

Group Metrics

For question groups and matrix questions with multiple sub-items.

Straightlining Score

Proportion of respondents giving identical answers to all sub-items in a group (Krosnick, 1991):

\[ S = \frac{n_{\text{straightliners}}}{n_{\text{respondents}}} \]
Straightlining Interpretation
< 5% Normal — no action needed
5–15% Investigate — may indicate confusing question design
> 15% Structural problem — redesign the matrix or add attention checks

Weight Independence

Straightlining is computed from unweighted respondent counts, since weighting adjusts demographic balance but does not affect individual response patterns.

Cronbach's Alpha (Internal Consistency)

Measures whether sub-items in a group measure the same underlying construct (Cronbach, 1951). Requires 3 or more sub-items:

\[ \alpha = \frac{k}{k-1} \left(1 - \frac{\sum_{i=1}^{k} \sigma^2_i}{\sigma^2_{\text{total}}}\right) \]

Where \(k\) is the number of sub-items, \(\sigma^2_i\) is the variance of sub-item \(i\), and \(\sigma^2_{\text{total}}\) is the variance of the total score.

\(\alpha\) Interpretation
> 0.8 Good internal consistency
0.6–0.8 Acceptable
< 0.6 Poor — sub-items may not measure the same construct

Dataset-Level Aggregates

Metric Formula Concern Threshold
Mean Normalized Entropy Average \(H_{\text{norm}}\) across categorical questions < 0.3
Acquiescence Bias Index Proportion of "agree" responses across Likert items > 0.6
Overall Non-Response Rate Mean skip rate across all questions > 0.15

Acquiescence Bias Index

Detects systematic tendency to agree with statements regardless of content. Computed only for Likert-type agree/disagree scales:

\[ \text{ABI} = \frac{\sum \text{agree responses}}{\sum \text{total Likert responses}} \]

A value of 0.5 indicates no bias (balanced agreement/disagreement). Values above 0.6 suggest respondents are agreeing regardless of question content — consider adding reverse-coded items.

Weight Awareness

For Silver (weighted) datasets, entropy, variance, and acquiescence metrics automatically use the weight column. Straightlining and speeder detection use unweighted respondent counts since weighting doesn't affect response patterns.

Speeder Detection

Identifies respondents who completed the survey unusually fast, suggesting inattentive or automated responding (Kunz & Hadler, 2020):

\[ \text{threshold} = \frac{\text{median completion time}}{3} \]

Respondents completing below the threshold are flagged as speeders.

Minimum Sample Requirement

Speeder detection requires at least 3 completed surveys to compute a meaningful median. With fewer surveys, the check is skipped.

Multi-Flag Aggregation (Baseball Rule)

No single quality check should be used alone to exclude respondents. The baseball rule (CloudResearch, 2024) aggregates multiple independent quality flags per respondent and recommends exclusion only when 3 or more flags co-occur:

Flag Trigger
Speeder Completion time below ⅓ of median
Straightliner Identical responses across all sub-items in any question group
High non-response More than 30% of questions skipped

This multi-indicator approach reduces false positives — a respondent who happens to be fast but answers thoughtfully is not penalized.


Interpreting Results

Common Patterns and Actions

Observation Possible Cause Action
Low entropy (< 30%) on a question Ceiling/floor effect or poorly balanced options Review option wording; consider splitting dominant category
High non-response (> 15%) on a question Confusing or sensitive question Simplify wording or make optionality explicit
High straightlining (> 15%) in a group Inattentive responding or undifferentiated sub-items Add attention checks; vary scale anchors; shorten battery
Low Cronbach's alpha (< 0.5) Sub-items don't measure the same construct Review item selection; consider removing weak items
High acquiescence (> 60%) Yea-saying bias Add reverse-coded items; vary scale direction
Low option usage (< 50%) Unused options are irrelevant Merge or remove rarely-chosen options
High RMSE (> 0.05) Systematic sampling bias Adjust quotas, diversify recruitment channels
Large design effect (> 2.0) Extreme weight adjustments Improve sample balance before relying on weighting
Speeders flagged (> 5%) Inattentive or incentive-driven responding Review incentive structure; add attention checks
Multi-flag respondents Multiple quality issues co-occur Exclude from analysis; document exclusion criteria

Two Dimensions, One Picture

Sample representativeness and response quality are independent dimensions. A dataset can have excellent representativeness but poor response quality, or vice versa.

Representativeness Response Quality Assessment
Good Good Data is fit for analysis
Good Poor Sample matches targets but responses are unreliable — investigate measurement issues
Poor Good Responses are genuine but sample is biased — apply weighting or adjust recruitment
Poor Poor Both structural and measurement problems — address recruitment and questionnaire design

Methodological Foundation

The quality analysis framework is grounded in established survey research methodology:

Framework Application
Total Survey Error (Groves et al., 2009) Organizes metrics by error source: coverage, sampling, nonresponse, measurement
AAPOR Standards Response rate calculations, disposition codes, quality reporting
Kish (1965) Design effects, effective sample size under complex designs
Krosnick (1991) Satisficing theory — straightlining, acquiescence as response quality indicators
Shannon (1948) Information entropy as a measure of response diversity
Cronbach (1951) Internal consistency reliability for multi-item scales
Hill (1973) Effective number of species (answers) — Hill numbers for diversity measurement
Kalton & Flores-Cervantes (2003) Weighting methods — raking, calibration, weight trimming
Kunz & Hadler (2020) Web paradata — response time analysis, speeder detection thresholds
CloudResearch (2024) Multi-flag quality aggregation — baseball rule for respondent exclusion