# Calibration and overconfidence

What confidence do people place in their erroneous estimates? In section 1 on availability, I discussed an experiment on perceived risk, in which subjects overestimated the probability of newsworthy causes of death in a way that correlated to their selective reporting in newspapers. Slovic et. al. (1982) also observed:

A particularly pernicious aspect of heuristics is that people typically have great confidence in judgments based upon them. In another followup to the study on causes of death, people were asked to indicate the odds that they were correct in choosing the more frequent of two lethal events (Fischoff, Slovic, and Lichtenstein, 1977)... In Experiment 1, subjects were reasonably well calibrated when they gave odds of 1:1, 1.5:1, 2:1, and 3:1. That is, their percentage of correct answers was close to the appropriate percentage correct, given those odds. However, as odds increased from 3:1 to 100:1, there was little or no increase in accuracy. Only 73% of the answers assigned odds of 100:1 were correct (instead of 99.1%). Accuracy "jumped" to 81% at 1000:1 and to 87% at 10,000:1. For answers assigned odds of 1,000,000:1 or greater, accuracy was 90%; the appropriate degree of confidence would have been odds of 9:1... In summary, subjects were frequently wrong at even the highest odds levels. Morever, they gave many extreme odds responses. More than half of their judgments were greater than 50:1. Almost one-fourth were greater than 100:1... 30% of the respondents in Experiment 1 gave odds greater than 50:1 to the incorrect assertion that homicides are more frequent than suicides.'

This extraordinary-seeming result is quite common within the heuristics and biases literature, where it is known as overconfidence. Suppose I ask you for your best guess as to an uncertain quantity, such as the number of "Physicians and Surgeons" listed in the Yellow Pages of the Boston phone directory, or total U.S. egg production in millions. You will generate some value, which surely will not be exactly correct; the true value will be more or less than your guess. Next I ask you to name a lower bound such that you are 99% confident that the true value lies above this bound, and an upper bound such that you are 99% confident the true value lies beneath this bound. These two bounds form your 98% confidence interval. If you are well-calibrated, then on a test with one hundred such questions, around 2 questions will have answers that fall outside your 98% confidence interval.

Alpert and Raiffa (1982) asked subjects a collective total of 1000 general-knowledge questions like those described above; 426 of the true values lay outside the subjects 98% confidence intervals. If the subjects were properly calibrated there would have been approximately 20 surprises. Put another way: Events to which subjects assigned a probability of 2% happened 42.6% of the time.

Another group of 35 subjects was asked to estimate 99.9% confident upper and lower bounds. They received 40% surprises. Another 35 subjects were asked for "minimum" and "maximum" values and were surprised 47% of the time. Finally, a fourth group of 35 subjects were asked for "astonishingly low" and "astonishingly high" values; they recorded 38% surprises.

In a second experiment, a new group of subjects was given a first set of questions, scored, provided with feedback, told about the results of previous experiments, had the concept of calibration explained to them at length, and then asked to provide 98% confidence intervals for a new set of questions. The post-training subjects were surprised 19% of the time, a substantial improvement over their pre-training score of 34% surprises, but still a far cry from the well-calibrated value of 2% surprises.

Similar failure rates have been found for experts. Hynes and Vanmarke (1976) asked seven internationally known geotechical engineers to predict the height of an embankment that would cause a clay foundation to fail and to specify confidence bounds around this estimate that were wide enough to have a 50% chance of enclosing the true height. None of the bounds specified enclosed the true failure height. Christensen-Szalanski and Bushyhead (1981) reported physician estimates for the probability of pneumonia for 1,531 patients examined because of a cough. At the highest calibrated bracket of stated confidences, with average verbal probabilities of 88%, the proportion of patients actually having pneumonia was less than 20%.

In the words of Alpert and Raiffa (1982): 'For heaven's sake, Spread Those Extreme Fractiles! Be honest with yourselves! Admit what you don't know!'

Lichtenstein et. al. (1982) reviewed the results of fourteen papers on thirty-four experiments performed by twenty-three researchers studying human calibration. The overwhelmingly strong result was that people are overconfident. In the modern field, overconfidence is no longer noteworthy; but it continues to show up, in passing, in nearly any experiment where subjects are allowed to assign extreme probabilities.

Overconfidence applies forcefully to the domain of planning, where it is known as the planning fallacy. Buehler et. al. (1994) asked psychology students to predict an important variable - the delivery time of their psychology honors thesis. They waited until students approached the end of their year-long projects, and then asked the students when they realistically expected to submit their thesis, and also when they would submit the thesis "if everything went as poorly as it possibly could." On average, the students took 55 days to complete their thesis; 22 days longer than they had anticipated; and 7 days longer than their worst-case predictions.

Buehler et. al. (1995) asked students for times by which the student was 50% sure, 75% sure, and 99% sure they would finish their academic project. Only 13% of the participants finished their project by the time assigned a 50% probability level, only 19% finished by the time assigned a 75% probability, and 45% finished by the time of their 99% probability level. Buehler et. al. (2002) wrote: "The results for the 99% probability level are especially striking: Even when asked to make a highly conservative forecast, a prediction that they felt virtually certain that they would fulfill, students' confidence in their time estimates far exceeded their accomplishments."

Newby-Clark et. al. (2000) found that asking subjects for their predictions based on realistic "best guess" scenarios, and asking subjects for their hoped-for "best case" scenarios, produced indistinguishable results. When asked for their "most probable" case, people tend to envision everything going exactly as planned, with no unexpected delays or unforeseen catastrophes: the same vision as their "best case". Reality, it turns out, usually delivers results somewhat worse than the "worst case".

This paper discusses overconfidence after discussing the confirmation bias and the sub-problem of the disconfirmation bias. The calibration research is dangerous knowledge - so tempting to apply selectively. "How foolish my opponent is, to be so certain of his arguments!

Doesn't he know how often people are surprised on their certainties?" If you realize that expert opinions have less force than you thought, you had better also realize that your own thoughts have much less force than you thought, so that it takes less force to compel you away from your preferred belief. Otherwise you become slower to react to incoming evidence. You are left worse off than if you had never heard of calibration. That is why - despite frequent great temptation - I

avoid discussing the research on calibration unless I have previously spoken of the confirmation bias, so that I can deliver this same warning.

Note also that an expert strongly confident in their opinion, is quite a different matter from a calculation made strictly from actuarial data, or strictly from a precise, precisely confirmed model. Of all the times an expert has ever stated, even from strict calculation, that an

event has a probability of 10 , they have undoubtedly been wrong more often than one time in a

million. But if combinatorics could not correctly predict that a lottery ticket has a 10 chance of winning, ticket sellers would go broke.