1. Taking
response bias into account
The model of discrimination performance
discussed in the previous
file assumes that when listeners do not hear a difference, or are not
sure, they respond "same" or "different" randomly, so that performance is
at chance. But there is no guarantee that listeners will do that.
Suppose you were a subject in a discrimination
task and you wanted to show 100% discrimination. You could answer "different"
to every item, and you would then get 100% correct on the different
pairs. You would of course also get 0% correct on the same pairs,
because you answered "different" to all of them. In many studies, the
same pairs are not analyzed at all, and this response strategy would
work well. Does this result, 100% correct, mean that you discriminated
the pairs very well? Clearly not; you don't even have to have listened
to them.
Compare this response strategy with an opposite
one -- suppose that you are very conservative in answering "different", and
only do so when you are quite sure that the stimuli are different. That
is, you don't respond at random when you do not hear a difference, or are
not sure; you consistently respond "same". You might then get 100% correct
on the same pairs, but you will probably have 0% correct on at least
some of the different pairs (small step sizes, within-category pairs).
Clearly you might nonetheless be discriminating better than a person who
readily answers "different". This pattern of results is common in AX discrimination
studies.
The point is that % correct on the different
pairs alone is not a very meaningful measure of discrimination.
It becomes meaningful when interpreted in terms of the listener's response
bias, or tendency to respond "same" or "different". The responses
to the same pairs can be used as an indication of response bias.
2. (Signal) Detection theory attributes responses to a combination of sensitivity and bias. Sensitivity is what
we are interested in, while bias is what we have to take into account to
recover sensitivity. The presentation that follows comes directly from
Macmillan and Creelman's 1991 Detection Theory: A User's Guide (known
here as “Detection for Dummies”).
Using detection theory, we conceive of sensitivity
as (broadly) detecting a signal (e.g. against background noise, or compared
to another signal), and model how a perceiver decides whether a signal is
present. An experiment presents signals and non-signals to subjects,
who try to detect all and only the signals. The traditional way of
viewing such an experiment, and naming the possible outcomes, is as follows. “Yes” here represents the presence of a signal or
difference to be detected; our different and same labels are
added for convenience in thinking about AX discrimination:
|
Response: Different (yes) |
Response: Same (no) |
Stimuli: YES (different) |
HIT |
MISS |
Stimuli: NO (same) |
FALSE ALARM |
CORRECT REJECTION |
This scheme is also used to organize and
tabulate subjects' responses. That is, the (raw) number of HITS etc.
is entered into the 4 cells. Notice that if the number of Signal (Different)
and No-signal (Same) stimuli are the same in the experiment, then the total
number of responses in the top row will equal the total number in the bottom
row (or, more generally, the total for each row is known in advance from the
design of the experiment); however, the total number of responses in the
YES response column will not necessarily be the same as the total number in
the NO response column, and neither number can be known in advance.
But, if you know the number of YES and NO trials in the experiment, you know
the value in one column from the value in the other. E.g. if there are
20 different trials, and a subject has 5 hits, then that subject must
have 15 misses. So, only 2 of the 4 numbers in the table (1 per row),
plus the total numbers of trials, are needed to characterize a subject's performance.
These are conventionally the Hits and False Alarms, and these are then given
as proportions of the row totals, which are in turn viewed as estimates of
probabilities of responses:
hit rate H: proportion of YES trials to which subject responded
YES = P("yes" | YES)
false alarm rate F: proportion of NO trials to which subject responded
YES = P("yes" | NO)
The table can be rewritten with these and
the other 2 rates, with each row totalling to 1.0; but the results of interest
are the pair (H,F). (Compare these to the total proportion correct,
which is (Hits + Correct rejections)/all responses.)
Consider then that the perfect subject's
performance is (1,0), while a random subject has H=F and our subject who always
answers YES has (1,1). Intuitively, the best subject maximizes H (and
thus minimizes the Miss rate) and minimizes F (and thus maximizes the Correct
Rejection rate); and thus the larger the difference between H and F, the
better the subject's sensitivity. The statistic d' ("d-prime") is a measure of this
difference; it is the distance between the Signal and the Signal+Noise.
However, d' is not simply H-F; rather, it is the difference between the
z-transforms of these 2 rates:
d' = z(H) - z(F)
where neither H nor F can be 0 or 1 (if
so, adjust slightly up or down). Note that z-scores can be positive
or negative so you have to watch the signs in the subtraction.
Background:
z-transform. A range of
values is cast as a normal distribution, with standard deviations around
the mean. The mean value is set to 0, and the range of most values
is about 3 standard deviations above and below the mean. So each value
is some number of SD units above or below the mean. This transform
is valuable in allowing comparison of measures with different ranges of
absolute values, and in taking into account the inherent variability of
different measures. For example, Wightman et al. (1992) J.
Acoust. Soc. Am. 91,1707-1717, comparing lengthening before different break
indices in a corpus with uncontrolled final consonants and vowels, used
transformed duration measurements because different segments have different
absolute durations and different degrees of variability. |
Of course, whether you use the original
proportions or their transforms, when H = F, then d' = 0. This will
be true whether the "yes" rate is near 1 or near 0. The highest possible
d' (greatest sensitivity) is 6.93, the effective limit (using .99 and .01)
4.65, typical values are up to 2.0, and 69% correct for both different
and same trials corresponds to a d' of 1.0.
There are other sensitivity measures - e.g.
a transform other than z (or even no transform at all), or differential weighting
of H and F, and even alternate versions of d' (see below) - but this is the
one you usually see in speech research.
3. How to get d' for your data.
You could calculate H and F, convert them
to z-scores, and subtract them. M&C's table A5.1 in Appendix 5 gives
the z-score conversions.
M&C's first example of getting H and
F:
|
#responses different |
#responses same |
total # responses |
stimuli different |
20 |
5 |
25 |
stimuli same |
10 |
15 |
25 |
So the hit rate H is 20/25, or .8
the miss rate is 5/25, or .2 (these
2 add up to 1.0)
the false alarm rate is 10/25, or
.4
the correct rejection rate is 15/25,
or .6 (these 2 add up to 1.0)
and the (H,F) pair is (.8,.4)
z(H) = 0.842 and z(F) = -0.253
d' = 0.824-(-0.253) = 1. 095
But probably you will want to do it more automatically. M&C's Appendix 6 provides some information on available programs, the late Tom Wickens still has a website on a UCLA server with a downloadable program, and a search online will turn up several options.
Colin Wilson has provided his Excel formula: d' = NORMINV(hit-rate,0,1) - NORMINV(false-alarm-rate,0,1) where Excel's NORMINV "Returns the inverse of the normal cumulative distribution for the specified mean and standard deviation", 0 being the specified mean and 1 being the specified SD.But see 13 below for different d'
calculations for different experimental designs, including our AX discrimination.
4. Some use of d' in the discrimination
literature.
d' is often plotted for instead of, or in
addition to, % correct/% different responses to different pairs.
Here is a recent example , from Francis & Ciocca (2003), JASA 114(3),
p. 1614:
Best et al. (1981), Perc. &
Psychophys. 29:191-211 calculate d' for obtained and predicted discrimination,
and compare these by ANOVA. Here is a figure plotting obtained minus
predicted d' (p. 211):
Godfrey et al. (1981) and others have studies
with small numbers of trials for each pair (especially same pairs,
and especially in studies with kids), and in these cases d' is calculated
not for individual stimulus pairs, but for each subject, combining all different
pairs and all same pairs. Alternatively, a d' for each pair is
sometimes seen for a group of subjects (e.g. Francis & Ciocca 2003); this
has the advantage that perfect scores on any pair are unlikely for a whole
group (and thus require no adjusting down from 1.0). See J. Sussman
(1993) for averaging H and F, so that d' scores are group scores; tested with
G test. Clearly in small experiments we have no choice
but to average something; see Ch. 11 in M&C on how to average carefully.
Sometimes H and CR are added to give a proportion
correct (for all pairs, not just for different pairs), which is then
arcsin transformed and analyzed in the usual way. See Sussman &
Carney, Francis & Ciocca.
But see M&C p.100ff on the dangers
of proportion correct ("an unexpectedly theory-laden statistic").
5. Bias
Bias is measured as the inclination of the
subject to say "yes" (or "no"). The bias measure c is a function
of H + F. But no one in speech research seems to report it, so we won't
cover how to calculate it.
6.
Two models of AX discrimination performance
A "same-different" experiment
uses 2 or more stimuli in a trial and calls for a "same/different" response.
M&C propose that there are really 2 different kinds of these, with different
likely subject strategies and therefore different appropriate models for d'
(pp. 143ff). In "fixed" designs the 2 stimuli in a pair are the same
across trials in a block, and subjects are likely to apply an independent-observation strategy, estimating the category for each stimulus
and then comparing the category estimates. For this strategy, d' is
calculated in the usual way. In "roving" designs the 2 stimuli vary
from trial to trial, and subjects are likely to apply a differencing strategy,
applying a threshold of difference to decide if 2 stimuli are different enough
to count as different. For this strategy, d' is calculated differently,
e.g. using the table of H vs. FA values in M&C's appendix A5.4.
It would seem
that speech experiments almost always use a roving design, and thus would
seem to call for the differencing model; but on the other hand the theory
of the independent-observation model is more like the idea of categorical
perception. Both approaches to d' are seen in the speech perception
literature. To pursue the issue of whether categorical perception can
be modeled as a differencing strategy, see Macmillan, Kaplan, and Creelman
1977, "The psychophysics of categorical perception", Psych Review
84: 452-71.
How much difference will this make in analyzing
an experiment? Consider M&C's example in (11) above, for the (.8,.4)
pair: d' was
1. 095. This pair is found
on p. 347 of Table A5.4, with a d' of 2.35. Compare also the values
in the file "some sample dprime.xls". (Some values still need to be
looked up - try this yourself.) The differences can be large, but it
might not matter when comparing values calculated by the same method.
7. Applying detection
theory to identification data
In some papers we see pairs of items along a continuum treated as
signal vs. noise for purposes of computing a d-prime for identification responses.
For example, Massaro 1989, starting p. 410: “The probabilities
of responding /r/ are transformed into z scores. The d’ between two
adjacent levels along the /l/-/r/ continuum is simply the positive difference
between the respective z scores.” (example given) “A d’ value was computed
for each of the two pairs of adjacent levels along the /l/-/r/ continuum.”
Then he reports an ANOVA on these d’ values, with 2 within-subject factors
(context: 3 levels, and stimpair: 2 levels).
In effect, by subtracting like this, the responses (say, the /l/ responses)
to one stimulus are treated as the HITS, and the responses (with the same
response category) to the next stimulus over are treated as the FALSE ALARMS.
Another example, which I quote here a bit, is Iverson & Kuhl
(1996). Starting on p. 1134: estimating the “perceptual distances”
between stimuli from identification responses:
“Through the application of detection theory, identification percentages
can also be used to estimate the perceptual distances separating tokens (Macmillan
and Creelman, 1991). Within this theoretical framework, the z-transformed
identification probability for each token, z(p), indicates its location relative
to the category boundary. The absolute value of this measure indicates each
token’s distance from the boundary in standard-deviation units. The sign
of this measure indicates whether each token is within (positive) or out
of (negative) the category. For example, z(p)=0.0 for tokens that are identified
as a member of the category on 50% of trials, z(p)=2.3 for tokens that are
identified as a member of the category on 99% of trials, and z(p)=-2.3 for
tokens that are identified as a member of the category on 1% of trials. The
perceptual distances between pairs of tokens (d’) can then be found by subtracting
these location measures; tokens that are at similar locations will have a
small d’ and tokens that are at dissimilar locations will have a large d’.
In other words, d’ will be greater to the extent that tokens are identified
differently.”
“The identification judgments were used to estimate perceptual distances
by calculating the z transform of the mean /l/ identification percentage
for each token and then taking the absolute value of the difference for each
pair of tokens. The z transform reaches infinity when percentages equal 0
or 100, so tokens with 0% /l/ identifications were assigned values of 1%
and tokens with 100% /l/ identifications were assigned values of 99% (Macmillan
and Creelman, 1991).”
Pairwise comparison of identification responses is described by M&C on pp. 212-13. However, it’s not clear that they would use it directly as a measure of perceptual distance.
8. Comparing identification with discrimination responses.
Above we see ways to get sensitivity scores for discrimination
and identification responses. Once they are both in terms of d-prime,
a common currency, they can be compared directly. That is, no need
to "predict" an expected discrimination from identification. If discrimination
is constrained by categorization, then the two sensitivity functions should
be the same. An example of this, though without explanation, is in
Schouten et al. (2003).
9. Some Terminology
and models in M&C
"one-interval
discrimination": "one-interval" means one stimulus in a trial; "discrimination"
means telling the different stimuli of the experiment (not of the trial!)
apart by selecting different responses from the available set -- so, confusingly,
this refers to what we call identification.
These are of 2 types:
"two interval": 2 stimuli
per trial
"2AFC": which one of these 2 stimuli?
(e.g. which one have you seen before (recognition), which one came first (temporal
order), which one matches the prompt (identification))
M&C say that a 2AFC design is better
than yes-no recognition when a priori familiarity is a likely confound; BUT
because 2AFC is easier for subjects, d' is lowered by a factor of roughly
.7. They also note that as this design tends to minimize bias (at least
for simultaneous visual presentation of stimulus pairs), percent correct is
not a bad measure of sensitivity.
Prepared by Pat Keating, Spring 2004, updated Fall 2005