Analysis of Rating Experiments
D-prime analysis of ratings is treated by Macmillan & Creelman (1991) in Ch. 3. Recall that a rating experiment uses one stimulus per trial with a response set greater than two (the rating scale). The responses can be either only along the scale, or categorical plus a confidence rating - but the analysis of these is the same.
M&C's example is an experiment with a "yes-no" type response plus a 3-point confidence rating; these responses are treated as a single 6-point scale. So there are 6 response columns rather than 2, and the row total is the total of all 6 columns. (Note how the direction of the confidence scale is reversed across the yes-no responses, in order to give a single rating scale.) For each cell, the proportion of responses is calculated from #responses/total#responses just as before. The trick is getting H and FA from 6 columns, and the answer is to get several Hs and FAs. Cumulative probabilities are calculated by summing responses from left to right. That is, the value (proportion) in each cell is replaced by the sum of that cell plus all to its left. The values in the first column are unchanged by this operation, and the values in the last column are now all 1.0. The value in each cell is now a Hit rate for the division of the responses into 2 response categories at that column; the division at each successive column represents a different "decision rule" that the subject could have applied to a 2-way choice.
M&C's example (partial):
Raw responses (Table 3.1 top):
responses: |
YES |
|
|
NO |
|
|
total |
|
3 |
2 |
1 |
1 |
2 |
3 |
|
YES |
49 |
94 |
75 |
60 |
75 |
22 |
375 |
NO |
8 |
37 |
45 |
60 |
113 |
113 |
375 |
Treating only the yes/no responses categorically
(underneath Table 3.1):
responses: |
YES (total) |
NO (total) |
total |
YES |
218 |
157 |
375 |
NO |
90 |
285 |
375 |
from which H and FA can be calculated:
H = 218/375 = .58; FA = 90/375 = .24
and from those, d' can be calculated, as
0.91
Now the proportions across all the columns
(Table 3.2 top):
responses: |
YES |
|
|
NO |
|
|
total |
|
3 |
2 |
1 |
1 |
2 |
3 |
|
YES |
.131 |
.251 |
.200 |
.160 |
.200 |
.059 |
1.00 |
NO |
.021 |
.099 |
.120 |
.160 |
.301 |
.301 |
1.00 |
And the cumulative proportions (H and
FA) (Table 3.3 top):
responses: |
YES |
|
|
NO |
|
|
|
3 |
2 |
1 |
1 |
2 |
3 |
YES = H |
.131 |
.382 |
.582 |
.742 |
.942 |
1.00 |
NO = FA |
.021 |
.120 |
.240 |
.400 |
.701 |
1.00 |
Then d' can be calculated for each
H,FA pair (for each column except the last). M&C show the z scores
and d' (Table 3.4 top):
responses: |
YES |
|
|
NO |
|
|
3 |
2 |
1 |
1 |
2 |
yes |
-1.121 |
-0.301 |
0.207 |
0.649 |
1.573 |
no |
-2.037 |
-1.175 |
-0.706 |
-0.253 |
0.527 |
d' |
0.916 |
0.874 |
0.913 |
0.902 |
1.046 |
The d' values are quite similar across decision rules, even though the Hit rates vary.
Rietveld & van Hout (1993), Statistical Techniques for the Study of Language and Language Behaviour, Ch. 5 is more concerned with the validity of ratings, in terms of:
and these can be inter-rater, or intra-rater. So your first step will be to decide what you want to check about your rating responses.
Reliability looks at whether ratings co-vary, which is independent of whether they use the same absolute values. Rating scores are seen as a combination of the "true" scores plus "error", and the reliability R is the ratio of the "true" variance across scores to the error variance.
Don't just average all the judges as an estimate of the true score! Means = true scores only for infinite judges. Instead, first see how reliable your judges' ratings are.
Example in Rietveld & van Hout, Table 5.1 p. 191, in file rietveld_ex.sav, 10 items rated by 4 judges on a 10-point scale. Rows are the rated objects (the stimuli), columns are the raters.
Because we compare the "true" variance, due to the objects, to the variance due to the raters (and potentially, any object x rater interaction), this is like a 2-way ANOVA. Indeed, the SPSS Reliability procedure can be used to do an ANOVA.
The SPSS procedure assumes that there are multiple raters and the Raters factor is fixed (more on this below); that said, it works as follows to calculate Cronbach's Alpha:
file rietveld_ex
Analyze - Scale - Reliability Analysis(leave
"model" at "Alpha") - Statistics - ANOVA Table=F-test
This gives Rietveld & van Hout's Table 5.7 on p. 202. "Alpha" = Cronbach's
Alpha, reliability coefficient for raters as fixed factor.
Crucially, with multiple raters, raters can be treated as either a fixed or a random factor, and these situations require different calculations of the reliability coefficient R and tell you different things about your raters' reliability:
Since the objects (stimuli) are always treated as random, the model is either mixed (raters are fixed) or random (raters are random).
The MS terms in the ANOVA output from Reliability can be used to calculate other versions of R for these other situations. Rietveld & van Hout's Table 5.8 shows 4 reliability coefficients, for: raters fixed vs. random, and a single rater vs. of a group of raters. You can also use the MS terms to do an F test to test if R > 0, where the F ratio is formed from the SPSS MS"between people" over MS"residual", and df for the numerator (n-1) and df for the denominator (n-1)(k-1); see section 5.3.2 for more.
But SPSS Reliability also will compute the "IntraClass Correlation Coefficient". Rietveld & van Hout mentions ICC as another name for R, but does not mention the SPSS procedure. In Analyze-Scale-Reliability Analysis-Statistics-Intraclass correlation coefficient, there is a choice of three models (single rater, multiple raters random, multiple raters fixed), so this would seem to be a way to go beyond Cronbach's alpha.
Here's another important design for us:
Comparing reliability coefficients (are
they significantly different?) - e.g. raters from 2 language backgrounds
rate the same materials:
Tested with the M statistic, see section
5.3.3.
¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
In contrast, Agreement looks at
how similar the absolute levels of the ratings are, rather than whether they
covary. Correlation coefficient is generally not a good tool, as it's
too sensitive to covariance and not sensitive enough to absolute values.
Instead, read Rietveld & van
Hout's section 5.4 and compare T,
Kendall's W, and Cohen's Kappa re your own design.
Prepared by Pat Keating
in Spring 2004
Return to the UCLA
Phonetics Lab statistics page.