Quality Metrics Reference¶
Askalot evaluates survey data quality across two complementary dimensions grounded in the Total Survey Error framework (Groves et al., 2009):
- Sample Representativeness — Do respondent demographics match the target population? Quantifies coverage and nonresponse error.
- Response Quality — Are survey questions producing informative, diverse answers? Detects measurement error.
Both dimensions must be acceptable for data to be fit for analysis.
Sample Representativeness Metrics¶
These metrics compare the actual demographic distribution of survey respondents against target distributions defined in a sampling strategy.
RMSE (Root Mean Square Error)¶
Overall deviation between sample and target distributions across all demographic categories.
Where \(a_i\) is the actual proportion and \(t_i\) is the target proportion for category \(i\).
RMSE penalizes large deviations more than many small ones due to the squared term. A single demographic category that is 10 percentage points off-target contributes more to RMSE than five categories each off by 2 points.
| RMSE | Interpretation |
|---|---|
| < 0.02 | Excellent — sample closely matches targets |
| 0.02–0.05 | Acceptable — minor deviations, weighting should correct |
| 0.05–0.10 | Concerning — significant deviations, investigate causes |
| > 0.10 | Problematic — structural sampling issues, weighting may not suffice |
MAE (Mean Absolute Error)¶
Average per-category deviation. More robust to outlier categories than RMSE.
When MAE is much smaller than RMSE, one or two categories are severely off while others are fine. When MAE ≈ RMSE, deviations are uniformly distributed.
Chi-Square Test¶
Tests whether observed deviations from targets are statistically significant versus random sampling noise.
Where \(O_i\) is the observed count and \(E_i\) is the expected count based on target proportions.
Sample Size Sensitivity
With large samples (N > 500), chi-square will flag almost any deviation as statistically significant, even if the practical magnitude is trivial. Use RMSE/MAE to assess practical significance, and chi-square for statistical significance. A significant chi-square with RMSE < 0.02 may not be actionable.
Max Deviation¶
The largest single-category gap between actual and target proportions. Identifies the most problematic demographic category.
| Max Deviation | Interpretation |
|---|---|
| < 0.05 | All categories within acceptable range |
| 0.05–0.10 | One category moderately off — likely correctable by weighting |
| > 0.10 | Structural gap — at least one group is substantially misrepresented |
Composite Quality Score¶
The overall quality score (0–1) combines RMSE across all factors:
| Score | Rating | Interpretation |
|---|---|---|
| > 0.8 | Good | Sample adequately represents target population |
| 0.6–0.8 | Acceptable | Some factors off-target, weighting recommended |
| 0.4–0.6 | Poor | Significant representativeness issues, investigate root causes |
| < 0.4 | Critical | Sample does not represent target population |
Per-Factor Review
A high overall score can mask one poorly represented factor. Always review the per-factor breakdown — the factor with the lowest score often drives the most bias in survey estimates.
Weighting Diagnostics¶
After raking (iterative proportional fitting) produces a Silver dataset, these metrics evaluate whether weighting improved representativeness without introducing excessive variance.
Design Effect (DEFF)¶
Variance inflation from unequal weighting (Kish, 1965):
Where \(\text{CV}_w\) is the coefficient of variation of the weights.
| DEFF | Interpretation |
|---|---|
| < 1.5 | Minimal impact on precision |
| 1.5–2.0 | Moderate precision loss — acceptable for most studies |
| 2.0–3.0 | Substantial — review weight distribution |
| > 3.0 | Severe — weighting compensating for structural sample problems |
Effective Sample Size¶
The equivalent unweighted sample size after accounting for the design effect:
| \(n_{\text{eff}} / n\) | Interpretation |
|---|---|
| > 0.80 | Minimal efficiency loss from weighting |
| 0.50–0.80 | Moderate loss — some groups heavily weighted |
| < 0.50 | Severe loss — weighting compensating too aggressively |
Weight Quality Benchmarks¶
| Metric | Good | Acceptable | Problematic |
|---|---|---|---|
| Weight CV | < 0.5 | 0.5–1.0 | > 1.0 |
| Max/min weight ratio | < 4:1 | 4:1–6:1 | > 6:1 |
Interpreting Design Effect
A design effect of 1.5 means your effective sample size is 67% of your actual sample size. Weights exceeding a 4:1 ratio (max/min) signal structural problems that weighting alone cannot fix — consider adjusting recruitment strategy instead.
Completion Rate¶
Proportion of surveys reaching "completed" status out of all surveys created for a campaign. Provides a high-level measure of fieldwork success and potential nonresponse bias.
Response Quality Metrics¶
These metrics evaluate survey responses themselves — independent of sampling strategy.
Per-Question Metrics¶
Normalized Entropy (Categorical Questions)¶
Measures answer diversity for radio and dropdown questions. Based on Shannon entropy (Shannon, 1948), normalized to a 0–1 scale:
Where \(p_i\) is the proportion of responses selecting option \(i\) and \(k\) is the number of available options.
| \(H_{\text{norm}}\) | Interpretation |
|---|---|
| > 0.7 | Good diversity — responses spread across options |
| 0.3–0.7 | Moderate — some options dominate |
| < 0.3 | Low diversity — responses concentrated on few options (ceiling/floor effect) |
A value of 1.0 means perfectly uniform distribution. A value approaching 0.0 means nearly all respondents chose the same option.
Effective Number of Answers¶
The Hill number of order 1 (Hill, 1973) — how many options carry meaningful weight:
If a question has 5 options but \(N_{\text{eff}} = 2.1\), only about 2 options are receiving substantial responses. More intuitive than raw entropy for identifying underutilized response scales.
Option Usage Ratio¶
Fraction of available options that received at least one response:
A ratio below 0.5 suggests some options are irrelevant and could be merged or removed.
Item Variance (Numeric Questions)¶
Response spread for slider and number questions. For weighted (Silver) datasets, uses weighted variance:
A variance of zero means all respondents gave the same answer — a strong signal of a poorly calibrated scale or ceiling/floor effect.
Item Non-Response Rate¶
Proportion of respondents who skipped the question:
| Non-Response Rate | Interpretation |
|---|---|
| < 5% | Normal |
| 5–10% | Monitor — may indicate sensitivity or confusion |
| 10–20% | Investigate wording, placement, or sensitivity |
| > 20% | Redesign the question |
Increasing non-response rates toward the end of the survey typically indicate respondent fatigue.
Group Metrics¶
For question groups and matrix questions with multiple sub-items.
Straightlining Score¶
Proportion of respondents giving identical answers to all sub-items in a group (Krosnick, 1991):
| Straightlining | Interpretation |
|---|---|
| < 5% | Normal — no action needed |
| 5–15% | Investigate — may indicate confusing question design |
| > 15% | Structural problem — redesign the matrix or add attention checks |
Weight Independence
Straightlining is computed from unweighted respondent counts, since weighting adjusts demographic balance but does not affect individual response patterns.
Cronbach's Alpha (Internal Consistency)¶
Measures whether sub-items in a group measure the same underlying construct (Cronbach, 1951). Requires 3 or more sub-items:
Where \(k\) is the number of sub-items, \(\sigma^2_i\) is the variance of sub-item \(i\), and \(\sigma^2_{\text{total}}\) is the variance of the total score.
| \(\alpha\) | Interpretation |
|---|---|
| > 0.8 | Good internal consistency |
| 0.6–0.8 | Acceptable |
| < 0.6 | Poor — sub-items may not measure the same construct |
Dataset-Level Aggregates¶
| Metric | Formula | Concern Threshold |
|---|---|---|
| Mean Normalized Entropy | Average \(H_{\text{norm}}\) across categorical questions | < 0.3 |
| Acquiescence Bias Index | Proportion of "agree" responses across Likert items | > 0.6 |
| Overall Non-Response Rate | Mean skip rate across all questions | > 0.15 |
Acquiescence Bias Index¶
Detects systematic tendency to agree with statements regardless of content. Computed only for Likert-type agree/disagree scales:
A value of 0.5 indicates no bias (balanced agreement/disagreement). Values above 0.6 suggest respondents are agreeing regardless of question content — consider adding reverse-coded items.
Weight Awareness
For Silver (weighted) datasets, entropy, variance, and acquiescence metrics automatically use the weight column. Straightlining and speeder detection use unweighted respondent counts since weighting doesn't affect response patterns.
Speeder Detection¶
Identifies respondents who completed the survey unusually fast, suggesting inattentive or automated responding (Kunz & Hadler, 2020):
Respondents completing below the threshold are flagged as speeders.
Minimum Sample Requirement
Speeder detection requires at least 3 completed surveys to compute a meaningful median. With fewer surveys, the check is skipped.
Multi-Flag Aggregation (Baseball Rule)¶
No single quality check should be used alone to exclude respondents. The baseball rule (CloudResearch, 2024) aggregates multiple independent quality flags per respondent and recommends exclusion only when 3 or more flags co-occur:
| Flag | Trigger |
|---|---|
| Speeder | Completion time below ⅓ of median |
| Straightliner | Identical responses across all sub-items in any question group |
| High non-response | More than 30% of questions skipped |
This multi-indicator approach reduces false positives — a respondent who happens to be fast but answers thoughtfully is not penalized.
Interpreting Results¶
Common Patterns and Actions¶
| Observation | Possible Cause | Action |
|---|---|---|
| Low entropy (< 30%) on a question | Ceiling/floor effect or poorly balanced options | Review option wording; consider splitting dominant category |
| High non-response (> 15%) on a question | Confusing or sensitive question | Simplify wording or make optionality explicit |
| High straightlining (> 15%) in a group | Inattentive responding or undifferentiated sub-items | Add attention checks; vary scale anchors; shorten battery |
| Low Cronbach's alpha (< 0.5) | Sub-items don't measure the same construct | Review item selection; consider removing weak items |
| High acquiescence (> 60%) | Yea-saying bias | Add reverse-coded items; vary scale direction |
| Low option usage (< 50%) | Unused options are irrelevant | Merge or remove rarely-chosen options |
| High RMSE (> 0.05) | Systematic sampling bias | Adjust quotas, diversify recruitment channels |
| Large design effect (> 2.0) | Extreme weight adjustments | Improve sample balance before relying on weighting |
| Speeders flagged (> 5%) | Inattentive or incentive-driven responding | Review incentive structure; add attention checks |
| Multi-flag respondents | Multiple quality issues co-occur | Exclude from analysis; document exclusion criteria |
Two Dimensions, One Picture¶
Sample representativeness and response quality are independent dimensions. A dataset can have excellent representativeness but poor response quality, or vice versa.
| Representativeness | Response Quality | Assessment |
|---|---|---|
| Good | Good | Data is fit for analysis |
| Good | Poor | Sample matches targets but responses are unreliable — investigate measurement issues |
| Poor | Good | Responses are genuine but sample is biased — apply weighting or adjust recruitment |
| Poor | Poor | Both structural and measurement problems — address recruitment and questionnaire design |
Methodological Foundation¶
The quality analysis framework is grounded in established survey research methodology:
| Framework | Application |
|---|---|
| Total Survey Error (Groves et al., 2009) | Organizes metrics by error source: coverage, sampling, nonresponse, measurement |
| AAPOR Standards | Response rate calculations, disposition codes, quality reporting |
| Kish (1965) | Design effects, effective sample size under complex designs |
| Krosnick (1991) | Satisficing theory — straightlining, acquiescence as response quality indicators |
| Shannon (1948) | Information entropy as a measure of response diversity |
| Cronbach (1951) | Internal consistency reliability for multi-item scales |
| Hill (1973) | Effective number of species (answers) — Hill numbers for diversity measurement |
| Kalton & Flores-Cervantes (2003) | Weighting methods — raking, calibration, weight trimming |
| Kunz & Hadler (2020) | Web paradata — response time analysis, speeder detection thresholds |
| CloudResearch (2024) | Multi-flag quality aggregation — baseball rule for respondent exclusion |
Related Documentation¶
- Data Analysis Guide — Bronze/Silver/Gold pipeline, weighting, export
- AI-Assisted Result Analysis — AI analyst agent for automated quality reports
- MCP Dataset Tools — Programmatic access to quality metrics
- Campaign Management — Sampling strategies and respondent pools