Quality Metrics Reference¶

Askalot evaluates survey data quality across two complementary dimensions grounded in the Total Survey Error framework (Groves et al., 2009):

Sample Representativeness — Do respondent demographics match the target population? Quantifies coverage and nonresponse error.
Response Quality — Are survey questions producing informative, diverse answers? Detects measurement error.

Both dimensions must be acceptable for data to be fit for analysis.

Sample Representativeness Metrics¶

These metrics compare the actual demographic distribution of survey respondents against target distributions defined in a sampling strategy.

RMSE (Root Mean Square Error)¶

Overall deviation between sample and target distributions across all demographic categories.

\[ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (a_i - t_i)^2} \]

Where \(a_i\) is the actual proportion and \(t_i\) is the target proportion for category \(i\).

RMSE penalizes large deviations more than many small ones due to the squared term. A single demographic category that is 10 percentage points off-target contributes more to RMSE than five categories each off by 2 points.

RMSE	Interpretation
< 0.02	Excellent — sample closely matches targets
0.02–0.05	Acceptable — minor deviations, weighting should correct
0.05–0.10	Concerning — significant deviations, investigate causes
> 0.10	Problematic — structural sampling issues, weighting may not suffice

MAE (Mean Absolute Error)¶

Average per-category deviation. More robust to outlier categories than RMSE.

\[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |a_i - t_i| \]

When MAE is much smaller than RMSE, one or two categories are severely off while others are fine. When MAE ≈ RMSE, deviations are uniformly distributed.

Chi-Square Test¶

Tests whether observed deviations from targets are statistically significant versus random sampling noise.

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \]

Where \(O_i\) is the observed count and \(E_i\) is the expected count based on target proportions.

Sample Size Sensitivity

With large samples (N > 500), chi-square will flag almost any deviation as statistically significant, even if the practical magnitude is trivial. Use RMSE/MAE to assess practical significance, and chi-square for statistical significance. A significant chi-square with RMSE < 0.02 may not be actionable.

Max Deviation¶

The largest single-category gap between actual and target proportions. Identifies the most problematic demographic category.

Max Deviation	Interpretation
< 0.05	All categories within acceptable range
0.05–0.10	One category moderately off — likely correctable by weighting
> 0.10	Structural gap — at least one group is substantially misrepresented

Composite Quality Score¶

The overall quality score (0–1) combines RMSE across all factors:

Score	Rating	Interpretation
> 0.8	Good	Sample adequately represents target population
0.6–0.8	Acceptable	Some factors off-target, weighting recommended
0.4–0.6	Poor	Significant representativeness issues, investigate root causes
< 0.4	Critical	Sample does not represent target population

Per-Factor Review

A high overall score can mask one poorly represented factor. Always review the per-factor breakdown — the factor with the lowest score often drives the most bias in survey estimates.

Weighting Diagnostics¶

After raking (iterative proportional fitting) produces a Silver dataset, these metrics evaluate whether weighting improved representativeness without introducing excessive variance.

Design Effect (DEFF)¶

Variance inflation from unequal weighting (Kish, 1965):

\[ \text{DEFF} = 1 + \text{CV}^2_w \]

Where \(\text{CV}_w\) is the coefficient of variation of the weights.

DEFF	Interpretation
< 1.5	Minimal impact on precision
1.5–2.0	Moderate precision loss — acceptable for most studies
2.0–3.0	Substantial — review weight distribution
> 3.0	Severe — weighting compensating for structural sample problems

Effective Sample Size¶

The equivalent unweighted sample size after accounting for the design effect:

\[ n_{\text{eff}} = \frac{n}{\text{DEFF}} = \frac{(\sum w_i)^2}{\sum w_i^2} \]

\(n_{\text{eff}} / n\)	Interpretation
> 0.80	Minimal efficiency loss from weighting
0.50–0.80	Moderate loss — some groups heavily weighted
< 0.50	Severe loss — weighting compensating too aggressively

Weight Quality Benchmarks¶

Metric	Good	Acceptable	Problematic
Weight CV	< 0.5	0.5–1.0	> 1.0
Max/min weight ratio	< 4:1	4:1–6:1	> 6:1

Interpreting Design Effect

A design effect of 1.5 means your effective sample size is 67% of your actual sample size. Weights exceeding a 4:1 ratio (max/min) signal structural problems that weighting alone cannot fix — consider adjusting recruitment strategy instead.

Completion Rate¶

Proportion of surveys reaching "completed" status out of all surveys created for a campaign. Provides a high-level measure of fieldwork success and potential nonresponse bias.

Response Quality Metrics¶

These metrics evaluate survey responses themselves — independent of sampling strategy.

Per-Question Metrics¶

Normalized Entropy (Categorical Questions)¶

Measures answer diversity for radio and dropdown questions. Based on Shannon entropy (Shannon, 1948), normalized to a 0–1 scale:

\[ H_{\text{norm}} = \frac{H}{\log k} = \frac{-\sum_{i=1}^{k} p_i \log p_i}{\log k} \]

Where \(p_i\) is the proportion of responses selecting option \(i\) and \(k\) is the number of available options.

\(H_{\text{norm}}\)	Interpretation
> 0.7	Good diversity — responses spread across options
0.3–0.7	Moderate — some options dominate
< 0.3	Low diversity — responses concentrated on few options (ceiling/floor effect)

A value of 1.0 means perfectly uniform distribution. A value approaching 0.0 means nearly all respondents chose the same option.

Effective Number of Answers¶

The Hill number of order 1 (Hill, 1973) — how many options carry meaningful weight:

\[ N_{\text{eff}} = \exp(H) = \exp\!\left(-\sum_{i=1}^{k} p_i \log p_i\right) \]

If a question has 5 options but \(N_{\text{eff}} = 2.1\), only about 2 options are receiving substantial responses. More intuitive than raw entropy for identifying underutilized response scales.

Option Usage Ratio¶

Fraction of available options that received at least one response:

\[ \text{OUR} = \frac{|\{i : c_i > 0\}|}{k} \]

A ratio below 0.5 suggests some options are irrelevant and could be merged or removed.

Item Variance (Numeric Questions)¶

Response spread for slider and number questions. For weighted (Silver) datasets, uses weighted variance:

\[ \sigma^2_w = \frac{\sum_{i=1}^{n} w_i (x_i - \bar{x}_w)^2}{\sum_{i=1}^{n} w_i} \]

A variance of zero means all respondents gave the same answer — a strong signal of a poorly calibrated scale or ceiling/floor effect.

Item Non-Response Rate¶

Proportion of respondents who skipped the question:

\[ \text{NRR} = \frac{n_{\text{missing}}}{n_{\text{total}}} \]

Non-Response Rate	Interpretation
< 5%	Normal
5–10%	Monitor — may indicate sensitivity or confusion
10–20%	Investigate wording, placement, or sensitivity
> 20%	Redesign the question

Increasing non-response rates toward the end of the survey typically indicate respondent fatigue.

Group Metrics¶

For question groups and matrix questions with multiple sub-items.

Straightlining Score¶

Proportion of respondents giving identical answers to all sub-items in a group (Krosnick, 1991):

\[ S = \frac{n_{\text{straightliners}}}{n_{\text{respondents}}} \]

Straightlining	Interpretation
< 5%	Normal — no action needed
5–15%	Investigate — may indicate confusing question design
> 15%	Structural problem — redesign the matrix or add attention checks

Weight Independence

Straightlining is computed from unweighted respondent counts, since weighting adjusts demographic balance but does not affect individual response patterns.

Cronbach's Alpha (Internal Consistency)¶

Measures whether sub-items in a group measure the same underlying construct (Cronbach, 1951). Requires 3 or more sub-items:

\[ \alpha = \frac{k}{k-1} \left(1 - \frac{\sum_{i=1}^{k} \sigma^2_i}{\sigma^2_{\text{total}}}\right) \]

Where \(k\) is the number of sub-items, \(\sigma^2_i\) is the variance of sub-item \(i\), and \(\sigma^2_{\text{total}}\) is the variance of the total score.

\(\alpha\)	Interpretation
> 0.8	Good internal consistency
0.6–0.8	Acceptable
< 0.6	Poor — sub-items may not measure the same construct

Dataset-Level Aggregates¶

Metric	Formula	Concern Threshold
Mean Normalized Entropy	Average \(H_{\text{norm}}\) across categorical questions	< 0.3
Acquiescence Bias Index	Proportion of "agree" responses across Likert items	> 0.6
Overall Non-Response Rate	Mean skip rate across all questions	> 0.15

Acquiescence Bias Index¶

Detects systematic tendency to agree with statements regardless of content. Computed only for Likert-type agree/disagree scales:

\[ \text{ABI} = \frac{\sum \text{agree responses}}{\sum \text{total Likert responses}} \]

A value of 0.5 indicates no bias (balanced agreement/disagreement). Values above 0.6 suggest respondents are agreeing regardless of question content — consider adding reverse-coded items.

Weight Awareness

For Silver (weighted) datasets, entropy, variance, and acquiescence metrics automatically use the weight column. Straightlining and speeder detection use unweighted respondent counts since weighting doesn't affect response patterns.

Speeder Detection¶

Identifies respondents who completed the survey unusually fast, suggesting inattentive or automated responding (Kunz & Hadler, 2020):

\[ \text{threshold} = \frac{\text{median completion time}}{3} \]

Respondents completing below the threshold are flagged as speeders.

Minimum Sample Requirement

Speeder detection requires at least 3 completed surveys to compute a meaningful median. With fewer surveys, the check is skipped.

Multi-Flag Aggregation (Baseball Rule)¶

No single quality check should be used alone to exclude respondents. The baseball rule (CloudResearch, 2024) aggregates multiple independent quality flags per respondent and recommends exclusion only when 3 or more flags co-occur:

Flag	Trigger
Speeder	Completion time below ⅓ of median
Straightliner	Identical responses across all sub-items in any question group
High non-response	More than 30% of questions skipped

This multi-indicator approach reduces false positives — a respondent who happens to be fast but answers thoughtfully is not penalized.

Interpreting Results¶

Common Patterns and Actions¶

Observation	Possible Cause	Action
Low entropy (< 30%) on a question	Ceiling/floor effect or poorly balanced options	Review option wording; consider splitting dominant category
High non-response (> 15%) on a question	Confusing or sensitive question	Simplify wording or make optionality explicit
High straightlining (> 15%) in a group	Inattentive responding or undifferentiated sub-items	Add attention checks; vary scale anchors; shorten battery
Low Cronbach's alpha (< 0.5)	Sub-items don't measure the same construct	Review item selection; consider removing weak items
High acquiescence (> 60%)	Yea-saying bias	Add reverse-coded items; vary scale direction
Low option usage (< 50%)	Unused options are irrelevant	Merge or remove rarely-chosen options
High RMSE (> 0.05)	Systematic sampling bias	Adjust quotas, diversify recruitment channels
Large design effect (> 2.0)	Extreme weight adjustments	Improve sample balance before relying on weighting
Speeders flagged (> 5%)	Inattentive or incentive-driven responding	Review incentive structure; add attention checks
Multi-flag respondents	Multiple quality issues co-occur	Exclude from analysis; document exclusion criteria

Two Dimensions, One Picture¶

Sample representativeness and response quality are independent dimensions. A dataset can have excellent representativeness but poor response quality, or vice versa.

Representativeness	Response Quality	Assessment
Good	Good	Data is fit for analysis
Good	Poor	Sample matches targets but responses are unreliable — investigate measurement issues
Poor	Good	Responses are genuine but sample is biased — apply weighting or adjust recruitment
Poor	Poor	Both structural and measurement problems — address recruitment and questionnaire design

Methodological Foundation¶

The quality analysis framework is grounded in established survey research methodology:

Framework	Application
Total Survey Error (Groves et al., 2009)	Organizes metrics by error source: coverage, sampling, nonresponse, measurement
AAPOR Standards	Response rate calculations, disposition codes, quality reporting
Kish (1965)	Design effects, effective sample size under complex designs
Krosnick (1991)	Satisficing theory — straightlining, acquiescence as response quality indicators
Shannon (1948)	Information entropy as a measure of response diversity
Cronbach (1951)	Internal consistency reliability for multi-item scales
Hill (1973)	Effective number of species (answers) — Hill numbers for diversity measurement
Kalton & Flores-Cervantes (2003)	Weighting methods — raking, calibration, weight trimming
Kunz & Hadler (2020)	Web paradata — response time analysis, speeder detection thresholds
CloudResearch (2024)	Multi-flag quality aggregation — baseball rule for respondent exclusion

Data Analysis Guide — Bronze/Silver/Gold pipeline, weighting, export
AI-Assisted Result Analysis — AI analyst agent for automated quality reports
MCP Dataset Tools — Programmatic access to quality metrics
Campaign Management — Sampling strategies and respondent pools