Zum Hauptinhalt springen

Why Is Monitoring Accuracy so Poor in Number Line Estimation? The Importance of Valid Cues and Systematic Variability For U.S. College Students

Fitzsimmons, Charles J. ; Thompson, Clarissa A.
In: Metacognition and Learning, Jg. 19 (2024), Heft 1, S. 21-52
Online academicJournal

Why is monitoring accuracy so poor in number line estimation? The importance of valid cues and systematic variability for U.S. college students 

Metacognitive monitoring, recognizing when one is accurate or not, is important because judgments of one's performance or knowledge often relate to control decisions, such as help seeking. Unfortunately, children and adults struggle to accurately monitor their performance during number-magnitude estimation. People's accuracy in estimating number magnitudes is related to math achievement and health risk comprehension. Thus, poor monitoring of number-magnitude estimation performance could pose problems when completing math tasks or making health decisions. Here, we evaluated why monitoring accuracy was so poor during number-line estimation, whether it was greater in the presence of a cue that was predictive of performance or when accounting for spatial skills, and the relation between monitoring judgments and control. Monitoring accuracy was greater in a condition in which familiarity, a cue adults commonly rely on to monitor their performance in this task, was predictive of estimation accuracy, compared to a condition in which familiarity was misleading. Although indices of monitoring accuracy did not improve when accounting for spatial skills, reducing variability by dichotomizing estimation performance into accurate or not improved monitoring accuracy metrics. Accurate monitoring was important because adults were more likely to ask for help when they were less confident in their estimate. Taken together, our data on monitoring accuracy suggested that (a) using cues predictive of accuracy is important for monitoring in number-line estimation, (b) adults are poor at detecting small differences in their performance, and (c) prior estimates of monitoring accuracy in number-line estimation may underestimate people's true monitoring ability.

Keywords: Metacognitive monitoring; Numerical cognition; Measurement; Metacognitive control; Number-line estimation; Spatial skills

Part of the data and writing in this study were included as the first authors' dissertation. We report all measures included in the study and all data and analytic scripts are available on the OSF project page (https://osf.io/g7rxn/). All analyses were run in R. This study was pre-registered on OSF (https://osf.io/myqjp/).

Metacognitive monitoring involves identifying one's own knowledge or recognizing when one is correct or incorrect. Accurate monitoring is important because people often self-regulate their learning by making control decisions like seeking help, checking for errors, or restudying based on their judgments about their performance and learning (e.g., Dunlosky & Hertzog, [17]; Dunlosky & Metcalfe, [18]; Efklides, [20]; Metcalfe, [42], [43]; Nelson & Narens, [50]). Thus, it is important to understand whether and how people monitor their performance in a variety of tasks. One task that children and adults struggle to monitor their performance in is number-line estimation. Number-line estimation involves thinking about how big a to-be-estimated number is (e.g., 1/2) relative to the two endpoints of the line (e.g., 0 and 1), then marking the physical location of the number on the line with a mark (e.g., a slider or hatchmark). People's accuracy in estimating magnitudes is related to their math achievement (Fazio et al., [21]; Siegler & Thompson, [69]; Siegler et al., [70]), and improvements in magnitude knowledge are causally related to improved math performance (Booth & Siegler, [4], [5]). However, in a variety of different numerical ranges (e.g., fractions 0–1; 0–100; 0–1,000; 0–100,000; 0–1 billion), children and adults were poor at judging which numbers they estimated more accurately than others (Fitzsimmons et al., [26]; Fitzsimmons et al., [25]; Fitzsimmons & Thompson, [23]; Rivers et al., [64]; Wall et al., [77]). For example, children and adults were more confident when they estimated 1/2 than when they estimated 3/8, but they estimated both numbers with similar levels of accuracy (Fitzsimmons & Thompson, [23]; Fitzsimmons et al., [24]). Although the correspondence between their monitoring judgments and performance was above chance, indices of monitoring accuracy in all studies using the number-line estimation task suggest that people of various ages are poor at monitoring their number-magnitude estimates (Mγ of 0.10—0.35 where a γ of 1 is perfect monitoring).

The aims of the current study were to investigate why people are so poor at monitoring their number-line estimation accuracy, to estimate the upper limits of monitoring accuracy in this task, and to investigate the role of monitoring judgments in item-level control decisions. Accurate monitoring of number-magnitude knowledge is important because people interact with numbers when making health and financial decisions (for a review, see Peters, [60]). Recognizing when one understands or does not understand a number could have implications for informed decision making. Additionally, examining monitoring accuracy and control in number-line estimation can inform theories of metacognition more broadly.

We asked two questions about monitoring accuracy: (1) Are adults better at monitoring their performance when there is a cue available that is predictive of performance (i.e., a diagnostic cue)? And, (2) are indices of adults' monitoring accuracy greater when noise in magnitude-estimation performance is reduced by controlling for spatial skills that we hypothesized influence estimation accuracy? Guided by inferential-judgment models (Brunswik, [9]; Koriat, [36]), we designed stimuli to test the upper limits of monitoring accuracy in this task by asking participants to estimate numbers in which surface-level features distinguished among a mixture of easy and difficult numbers. Then, we tested models of self-regulation that predict control decisions will be more useful when monitoring was more accurate.

Why is monitoring accuracy so poor in number-line estimation?

Inferential-judgment models, like Brunswik's Lens Model ([9]; see also Koriat's [[36]] cue-utilization framework) provide a framework for understanding how people make judgments about their own performance. According to these frameworks, people judge their performance based on a variety of cues (i.e., sources of information), such as familiarity or perceived difficulty, to infer their accuracy or performance (for an example of inferential judgment models in the domain of metamemory, see Bröder & Undorf, [8]). Monitoring judgments will be accurate to the extent that people rely on cues that are predictive of performance (i.e., diagnostic cues).

One reason monitoring accuracy is poor in number-line estimation is because people rely on cues that are not valid predictors of their performance (i.e., non-diagnostic cues). In a series of studies, Fitzsimmons, Thompson and colleagues (Fitzsimmons & Thompson, [23]; Fitzsimmons et al., [24], [26]) found that children's and adults' item-level confidence judgments were related to their familiarity with the numbers. Unfortunately, familiarity was not a strong predictor of children's or adults' fraction estimation accuracy, and as a result, their monitoring accuracy was poor. Thus, in line with predictions from the Lens Model (Brunswik, [9]; Hammond & Stewart, [32]), monitoring accuracy may be greater when familiarity predicts estimation accuracy. We tested whether manipulating stimuli so that unfamiliar fractions were difficult to accurately represent and estimate, and familiar fractions were easy to accurately represent and estimate, led to higher monitoring accuracy.

An additional reason monitoring accuracy may be poor in numerical-magnitude estimation is because of variability in performance that is unassociated with number representations. Performance on the number-line estimation task relies, at least in part, on participants' approximate magnitude representations (i.e., how big the to-be-estimated number is relative to the endpoints; Opfer et al., [57], [58]; Thompson et al., [73]; Clarke & Beck, [11]; Dehaene, [12], [13]; Fazio et al., [21]; Leibovich et al., [41]; but see Barth & Paladino, 2011), their strategy use (Fazio et al., [22]; Fitzsimmons et al., [25]; Peters et al., [59]; Sidney et al., [68]; Siegler & Thompson, [69]; Sullivan et al., [72]), their familiarity with the number and endpoints (Ebersbach et al., [19]), and their spatial skills to accurately mark the location on the line where intended (Gilligan et al., [29]; Jirout & Newcombe, [35]; Jirout et al., [34]; Möhring et al., [44], [45]; Newcombe et al., [52]). Thus, even when individuals have accurate representations (e.g., fifth graders accurately estimate fractions on 0–1 lines; Braithwaite & Siegler, [7]), estimates are rarely exact. That is, even though an adult likely recognizes that 1/2 is equal to 0.5 and should be placed in exactly the middle of the line, they may estimate this number as being located at 0.47 or 0.52. Thus, compared to problems with an exact answer, such as in arithmetic, estimating number magnitudes involves additional variability unassociated with magnitude knowledge, such as their spatial skills (e.g., Gilligan et al., [29]; Jirout et al., [34]; Möhring et al., [45]). If an individual relies on a cue (e.g., their familiarity with the number) to infer their understanding or representation of that number, they may not be considering other factors that impact their performance on the task (e.g., their spatial skill at marking the physical location of the number on the line) which could limit indices of monitoring accuracy.

In the current study, we tested whether reducing variability in estimation accuracy that is theorized to be independent of number-magnitude knowledge improved indices of monitoring accuracy. One source of number-estimation variability is an individual's spatial-scaling skills (Gilligan et al., [29]; Jirout et al., [34]; Möhring et al., [45]; Newcombe et al., [52]). Spatial scaling involves thinking about different-sized objects that have proportionally equivalent spatial areas, such as when one scales up a map to think of how the distance between two dots corresponds to a distance in the environment (see Fig. 1). This skill accounts for unique variance in children's number-line estimation precision, potentially because children need to scale their mental representation of number magnitudes to the physical size of the number line (e.g., Gilligan et al., [29]; Jirout et al., [34]; Möhring et al., [45]). We reasoned that people do not consider their spatial skills when judging their estimation accuracy. Therefore, controlling for variability in number-line estimation accuracy due to spatial-scaling skills may improve indices of monitoring accuracy.

Graph: Fig. 1Scaled trial of the spatial-localization task adapted from Frick and Newcombe ([27]). In the original version of the task, participants were tasked with estimating the location of a white egg in a green "farm field" based on a map that was located to the side of the target field. Please see the online article for a color version of this photo

Monitoring accuracy is important because judgments sometimes relate to control

Models of self-regulation (Dunlosky & Hertzog, [17]; Metcalfe, [43]; Nelson & Narens, [50]) predict that people often make decisions, such as submitting an assignment for a reward or asking for help, based on monitoring judgments, and people often do so regardless of whether their judgments are aligned with performance (e.g., Fyfe et al., [28]; Nelson & Fyfe, [51]; O'Leary & Sloutsky, [54], [55]; Wall et al., [77]). For example, first, second, and fourth graders submitted their number-line estimates for a reward based on their confidence in estimating the numbers more than their accuracy in estimating them (Wall et al., [77]). Similarly, when children solved right-blank equivalence problems (e.g., 3 + 7 = 3 + __), they were much more likely to ask for help when they had lower confidence than when they had higher confidence (Fyfe et al., [28]; Nelson & Fyfe, [51]). However, children often incorrectly answered the right-blank equivalence problems with high confidence; thus, their control decisions were not very useful (i.e., they did not ask for help when they were incorrect because they were highly confident). In both number-line estimation and equivalence problems, children's confidence judgments were likely biased by the surface-level features of the problems, which subsequently biased their control decisions. If surface-level features of problems can bias judgments, then help seeking during number-line estimation may also be biased. However, when judgements better align with performance (i.e., better monitoring accuracy), help seeking should be more useful. In the current study, we investigated whether help seeking varied between conditions in which we expected surface-level features of problems to lead to better or worse monitoring accuracy.

The current study

Given that monitoring accuracy can be important for making self-regulatory control decisions like help-seeking or restudying (e.g., Fyfe et al., [28]; Metcalfe, [43]; Nelson & Fyfe, [51]), and that number-line estimation assesses magnitude knowledge that is linked to math achievement, financial decision making, and medical decision making (Peters, [60]; Siegler et al., [71]), we investigated why adults are so poor at monitoring their performance in this task and whether some factors may positively influence their monitoring accuracy. We randomly assigned adults to a diagnostic-cue condition in which half of the items were raised to a power (e.g., [27/36]3) or to a non-diagnostic cue condition in which half of the items had large components (e.g., 27/64). We labeled the condition with fractions raised to a power of two or three the diagnostic-cue condition because we expected subjective familiarity to be diagnostic (i.e., predictive) of estimation accuracy in this condition. We expected that the surface-level features of the fractions (i.e., raised to a power or composed of large whole-number components) would influence adults' subjective familiarity and confidence on these items, even though the magnitude of both number types (i.e., 0.422) was the same. However, we expected that items raised to a power would be difficult to estimate accurately given that adults make systematic estimation errors when the to-be-estimated numbers or the endpoints of the line are unfamiliar (e.g., Chesney & Matthews, [10]; Opfer & DeVries, [56]; Thompson & Opfer, [74]). Thus, even though people may infrequently encounter fractions raised to a power in their everyday life, these stimuli provide an ideal test-case for predictions from inferential-judgment models (Brunswik, [9]; Koriat, [36]).

We made three pre-registered (https://osf.io/myqjp) hypotheses. We hypothesized that participants would be more confident in their estimates of fractions that they rated as more familiar (H1a). However, we hypothesized that familiarity would be a better predictor of performance in the diagnostic-cue condition in which half of the items were raised to a power (e.g., [27/36]3) compared to the non-diagnostic cue condition in which fraction magnitudes had small (e.g., 1/2) or large components (e.g., 27/64) (H1b). If adults relied on familiarity to make their confidence judgments, monitoring accuracy should be greater when familiarity was predictive of performance compared to a condition when it was not. Thus, we hypothesized that indices of adults' monitoring accuracy would be greater in the diagnostic-cue condition compared to the non-diagnostic cue condition (H2a). Additionally, if variability in number-line estimation unassociated with number knowledge limits monitoring accuracy, then removing non-number-knowledge variability in performance should increase monitoring accuracy scores. Thus, we hypothesized that monitoring accuracy would be greater when removing estimation variability due to spatial-scaling skills (see Fig. 1) compared to when this variability was unaccounted for (H2b).

After completing the number-line estimation task, adults completed a help-seeking task in which they asked for help in identifying where to estimate the number on the line. Consistent with models of self-regulation (Metcalf, [43]; Nelson & Narens, [50]), we expected all participants to ask for help on items that they were less confident in estimating (H3a). However, we predicted that confidence would be a better predictor of estimation performance in the diagnostic- than non-diagnostic cue condition. Thus, we also hypothesized that participants' control decisions would be better predicted by estimation accuracy in the diagnostic-cue than in the non-diagnostic cue condition (H3b).

Given that our stimuli manipulation was essential to test predictions from inferential-judgment models (Brunswik, [9]; Koriat, [36]) and models of monitoring and control (Nelson & Narens, [50]), we pilot tested whether fractions raised to a power were objectively more difficult to estimate on number lines.

Study 1

We pilot tested the materials for this study to confirm that adults were less accurate and confident in their estimates of fractions raised to a power relative to simplified fractions. We aimed to recruit 13 participants through the Kent State University subject pool, based on a medium-sized effect dz = 0.74, for a one-tailed paired samples t-test with 80% power and alpha set to 0.05. This is the effect size that would be found if adults estimated fractions raised to a power with 12% error, which is about the level that fourth and fifth graders estimated fractions across component sizes in prior work (see Table 6 from Fitzsimmons & Thompson, [23]) and small component fractions with 7% error (the observed mean estimates for adults in Fitzsimmons & Thompson, [23]). That is, if there is a mean difference of 5% percent absolute error (PAE) between small-component fractions and fractions raised to a power. We expected that adults would estimate fractions raised to a power with at least as much error as fifth graders estimated large-component fractions given that adults rarely encounter these types of items. Sixteen adults participated in the pilot study.

Adults estimated twelve fraction magnitudes, some with small whole-number components (e.g., 1/2) and others fractions raised to a power (e.g., [12/15]2) under a three-second per trial time constraint. In prior work (Fitzsimmons et al., [25]), estimating under a time constraint did not reliably affect estimation accuracy relative to estimating under no time constraint. As expected, participants were less accurate in estimating, F(1, 15) = 12.47, p = 0.003, ηp2 = 0.45, less confident in estimating, F(1, 15) = 32.60, p < 0.001, ηp2 = 0.69, and less familiar with, F(1, 15) = 315.83, p < 0.001, ηp2 = 0.96, fractions raised to a power compared to small-component fractions. This confirmed our expectations that adults had difficulty estimating fractions raised to a power of two or three. Thus, even though the surface-level features of fractions (e.g., component size) impact adults' confidence and familiarity in cases when they estimate accurately (e.g., on large-component items like 15/30 in Fitzsimmons et al., [26]), our pilot demonstrated that there are instances when adults' confidence and familiarity better align with their estimation accuracy.

Study 2

Method

Design

We randomly assigned participants to a diagnostic-cue condition (n = 70) or to a non-diagnostic cue condition (n = 67). In the diagnostic-cue condition, half of the fractions were raised to a power of two or three; in the non-diagnostic cue condition, half of the items had large components but were not raised to a power of two or three (see Table 1).

Table 1 Fraction stimuli between conditions

Item

Non-Diagnostic

Diagnostic

Magnitude

1

9/72

(38/76)3

0.125

2

16/74

(9/15)3

0.216

3

2/9

2/9

0.222

4

1/4

1/4

0.250

5

2/7

2/7

0.286

6

16/54

(38/57)3

0.296

7

16/49

(12/21)2

0.326

8

27/64

(51/68)3

0.422

9

4/9

4/9

0.444

10

1/2

1/2

0.500

11

5/9

5/9

0.556

12

22/38

(40/48)3

0.579

13

16/25

(12/15)2

0.640

14

25/36

(35/42)2

0.694

15

5/7

5/7

0.714

16

3/4

3/4

0.750

17

64/81

(32/36)2

0.790

18

52/61

(12/13)2

0.852

19

7/8

7/8

0.875

20

8/9

8/9

0.889

Tasks

Number-line estimation

Participants estimated 20 fraction magnitudes (see Table 1 for stimuli) on 0–1 number lines, one at a time, under a three-second time constraint. For participants randomly assigned to the non-diagnostic cue condition, items included a mix of small- and large-component fractions (e.g., 9/72, 1/4, 8/27, 25/36). For those in the diagnostic-cue condition, half of the fractions were raised to a power of two or three (e.g., [38/76]3, [24/36]3, [35/42]2). The magnitude of items were equivalent between conditions (rounded to three decimal places), ranged from 0.125 to 0.889, at least one magnitude was located within each quarter of the range, and all base fractions had one or two-digit components. Half of the base fractions that were raised to a power greater than one in the diagnostic group had larger whole-number components, and the other half had smaller whole-number components than the components of the equivalent magnitude in the non-diagnostic group. Items were presented in the same pre-randomized order in both conditions. We measured performance on this task as percent absolute error (PAE), which is the percentage of absolute deviation from the correct location on the line. If a participant estimated 1/2 as being located at 0.4, their PAE for that trial was 10% (|0.4—0.5|/1 * 100). PAE is a measure of error, such that higher values reflect less accurate estimates.

Spatial-localization

We adapted a spatial-localization task from prior work (Möhring et al., [45]) to administer online. Participants marked the location of a target on a referent line using a map that was scaled down by a factor of four (see Fig. 1), or a map that was the same size as the referent spatial area (i.e., map to referent of 1:1, Fig. 2). Thus, on scaled trials, the distance in the map corresponded to four times the distance in the referent space. A scaling factor of four was used in prior work that evaluated the relation between spatial scaling and number-line estimation (Möhring et al., [45]). We included the unscaled trials to account for shared method variance between the localization task and the number-line estimation task. Each location in the localization task corresponded to a location of a magnitude estimated in the number-line task. For example, participants estimated the location of 1/2 in the number-line task (i.e., 0.5), and they also estimated a target on the referent line marked halfway across the map line. To ensure the scaling factor was equivalent for all participants, images contained the map and referent area. The map was left justified and 187.5 × 10 pixels; the referent area was 750 × 40 pixels. Thus, even for participants who used different sized screens, the proportional difference between the map and referent space was equivalent (i.e., 1:4).

Graph: Fig. 2Unscaled trial from the scaling task

Participants located the spatial location of 40 targets in 20 different locations that each corresponded to the location of the 20 to-be-estimated magnitudes from the number-line estimation task. They first completed 20 unscaled trials (i.e., 1:1 map to referent ratio) followed by 20 scaled trials (i.e., 1:4 map to referent area ratio) for a total of 40 trials. This blocked format was used in prior work with this task (Frick & Newcombe, [27]; Möhring et al., [45]). Similar to number-line estimation, we calculated the average percent absolute error (PAE) by taking the absolute deviation between the true location and estimate and dividing by the range of the possible area (750 pixels) for scaled and unscaled trials separately.

Confidence judgments

Immediately after each trial in the number-line estimation and spatial-scaling tasks, participants rated their confidence on a slider from not sure at all to totally sure. Responses were recorded on a 0–100 point scale.

Familiarity judgments

As in prior work (Fitzsimmons & Thompson, [23]; Fitzsimmons et al., [24]), participants rated their familiarity with the numbers they estimated in the estimation task in the same order as they estimated from 0 = not familiar at all, to 100 = very familiar. Participants typed a number between 0 and 100 in a textbox to indicate their familiarity. We instructed participants that, "How familiar you are means how much you remember seeing or working with the fraction before."

Metacognitive control

After estimating all numbers on number lines, participants indicated which items they would like to receive help on (Fyfe et al., [28]; Nelson & Fyfe, [51]). We instructed participants that, "For this next task, imagine that you're in a math class and you are preparing for a test on this content. Imagine that you want to do the best because if you do, you will win a gift card of your choice. On the page below, your prior estimates are displayed. You cannot change your prior estimate. Your job is to indicate which items you would like to get a hint from the instructor on how to estimate the number and where it goes on the line. For each of your estimates, please indicate whether you would or would not like a hint if given the opportunity."

Participants' number-line estimates were displayed on a single page as a reminder of where they estimated each number; they were unable to modify their prior estimate. Immediately below each of their displayed estimates, we asked, "Would you like a hint on how to estimate this number?" similar to how children's help-seeking behaviors were investigated by Nelson and Fyfe ([51]). Participants did not receive any hints on the problems.

Demographic information

Participants reported their gender, race, age, the number of math courses they have taken, and the type of device they used to complete the survey. In addition to asking participants to report their device, Qualtrics recorded the operating system used to complete the survey and the size of their browser window.

Participants

We pre-registered (https://osf.io/myqjp) that we would recruit at least 120 and up to 165 participants from the psychology department's subject pool at Kent State University. We based this sample size on a simulated power analysis conducted in R using the simr package (Green, 2016) to detect an interaction effect of -0.05 in a linear mixed effects model with by-subject and by-item random intercepts and a by-subject random slope for person-mean centered confidence judgments. We aimed to reach 80% power to test our primary hypotheses about monitoring accuracy, and we accounted for up to 10% attrition.

A total of 226 participants opened the survey, but 55 participants were prevented from participating because they had a screen size that was smaller than the pre-registered minimum of 11 inches, as determined by a test of the pixel width of their browser at maximum size. We pre-registered the 11-inch computer or laptop screen size requirement to eliminate the possibility that performance on the spatial-localization task was limited due to noise associated with smaller touch-screen use. We further excluded 5 participants who completed less than 10% of the survey. An additional 13 participants reported that they were less than 18 years old; we only had IRB approval to recruit adults 18-years-old and older. We did not collect any identifying information that could be linked to parental permission forms, and thus their data were removed from analyses. We excluded an additional 16 participants who failed to meet pre-registered inclusion criteria: one participant for responding in the same location on the line on 19 of 20 estimation trials, eight for having a number-line estimation PAE that was 2 SD greater than the mean PAE for their randomly assigned condition, four for having an estimation error that was 2 SD greater than the mean for scaled trials in the spatial-localization task, and four for having an estimation error that was 2 SD away from the mean for the unscaled trials. We also tested whether participants who incorrectly answered the attention check question that asked how long they would have to estimate in the timed number-line task (n = 7) differed from the remaining participants in their estimation error. However, average number-line estimation PAE and spatial-localization error for scaled and unscaled trials were similar between those who correctly answered the attention check item and those who did not (ps >.15). Thus, we retained these participants for analyses.

The remaining 137 participants (Mage = 19.28, SD = 1.62 years; range 18—30) primarily identified as White (n = 116; 84.67%) and female (n = 109; 79.56%). The large proportion of female participants is likely because there are a greater proportion of female Psychology majors in the department's participant pool. We confirmed there were no gender differences in monitoring accuracy as operationalized as gamma correlations between confidence and estimation accuracy (p =.851). On average, participants reported completing 5 (M = 4.74, SD = 1.54, range: 1–9) out of the ten math courses we asked about in the demographics section.

Procedure

Participants completed the survey on their own device. To qualify, participants had to take the survey on a laptop or computer (i.e., no mobile devices or tablets) with an 11 inch screen or larger. We coded the Qualtrics survey so that it excluded participants from participating if the survey was opened with a mobile device or if the browser size was less than or equal to 1190 × 760 CSS pixels.[1]This browser size is less than the CSS pixel width and height for the most popular 11 inch monitor sizes reported by screensiz.es in 2021 (Hammer, n.d., https://screensiz.es/chromebook). After participants provided informed consent, we instructed them to maximize their browser window and to not use a calculator at any point. They completed the number-line estimation task, then the control-decisions task, then the spatial-localization task, then rated their familiarity with numbers, and concluded by answering demographic questions.

Results

Checks of random assignment

First, we confirmed there were no differences between conditions in the proportion of males and non-males (X2(1) = 0.00, p = 1), self-reported age (t(113.27) = 0.08, p = 0.940), average number of math courses taken (t(133.96) = 1.19, p = 0.238), and average PAE in the spatial-localization task for scaled trials (t(132.71) = 0.49, p = 0.628) and unscaled trials (t(132.88) = 0.49, p = 0.625). We also tested whether participants' number-line estimation PAE differed when their response time was longer than or within the the three-second time constraint[2] to estimate numbers. Participants exceeded the time limit on 15.69% of trials (n = 430 out of 2,740 total estimates, calculated as the number of participants times the number of trials per participant), an average of 3.14 out of 20 possible trials. Participants were slightly, but not significantly, more accurate when they estimated within the time constraint (MPAE = 13.54%, SD = 7.16%) compared to when their response time was longer than the time constraint (MPAE = 16.11% SD = 12.80%), t(125) = 1.94, p = 0.055. This suggests that participants whose response times were longer than the time constraint did not use this additional time to correctly calculate the exact magnitude of the to-be-estimated number. Given that the time constraint was put in place to prevent people from calculating exact magnitudes, we do not discuss this difference further.

Analytic overview

Below, we report descriptive statistics and correlations between conditions (see Tables 2, 3, and 4). Then, we report preliminary analyses testing whether the manipulation of item type between conditions affected number-line estimation accuracy, confidence judgments, and familiarity. Then, we report tests of the three primary hypotheses related to cue use, monitoring accuracy, and control decisions.

Table 2 Descriptive statistics between conditions

Non-diagnostic cue condition

Diagnostic cue condition

Condition Comparison

Mean

SD

Mean

SD

t

p

1. Mean PAE SC Fractions

7.42%

5.28%

9.34%

6.06%

-1.98

.050

2. Mean PAE LC Fractions

11.72%

6.62%

27.63%

6.96%

-13.71

<.001

3. Mean unscaled PAE

2.80%

1.30%

2.92%

1.41%

-0.49

.625

4. Mean scaled PAE

4.40%

2.10%

4.22%

2.30%

0.49

.628

5. Mean CJ SC fractions

72.19

11.63

68.42

16.35

1.56

.121

6. Mean CJ LC fractions

53.42

14.15

34.64

17.92

6.82

<.001

7. Mean familiarity SC fractions

76.46

20.83

87.89

15.93

-3.59

<.001

8. Mean familiarity LC fractions

29.00

22.58

17.72

23.20

2.88

.005

PAE = percent absolute error, with higher values indicating less accurate estimates. CJ = confidence judgments. LC = large components, SC = small components. The large-component fractions in the diagnostic-cue condition were all raised to a power of two or three, but in the non-diagnostic cue condition they were not

Descriptive statistics and preliminary analyses

In Tables 2, 3, and 4, we report descriptive statistics and correlations between conditions. For these tables and continuity from prior research, we also calculated within individual gamma correlations for monitoring accuracy, cue use, cue diagnosticity, and monitoring-based-control behaviors. Monitoring accuracy was operationalized as the within-individual gamma correlation between confidence and PAE, cue use as the relation between familiarity and confidence, cue-diagnosticity as the relation between familiarity and PAE, and monitoring-based-control as the relation between control decisions and confidence. We recognize that there are known limitations in using gamma to operationalize monitoring accuracy and therefore used mixed-effects models as our primary analytic approach (see Murayama et al., [49]). Mixed-effects models that consider by-subject and by-item random effects, have lower Type I error rates than a wide array of measures of monitoring accuracy, and can be implemented with continuous outcome measures such as the measure of error we used in the current study. However, our data are publicly available so that people can examine whether other measurement approaches, such as signal-detection approaches, yield similar results.

Table 3 Descriptive statistics between conditions for average within-individual gamma correlations for monitoring accuracy, cue use, cue validity, and control decisions

Non-diagnostic cue condition

Diagnostic-cue condition

Condition Comparison

Variable

n

Mean

SD

n

Mean

SD

t

p

1. CJ-PAE Gamma

67

0.24*

0.16

70

0.35*

0.17

-3.97

<.001

2. CJ-PAE Gamma SC Only

67

0.29*

0.27

70

0.26*

0.24

0.80

.427

3. CJ-PAE Gamma LC Only

67

0.06

0.28

70

0.08*

0.26

-0.42

.676

4. CJ-Familiarity Gamma

67

0.42*

0.22

69

0.61*

0.23

-4.93

<.001

5. CJ-Familiarity Gamma SC

61

0.47*

0.31

55

0.47*

0.34

-0.07

.947

6. CJ-Familiarity Gamma LC

63

0.12*

0.36

38

0.06

0.40

0.74

.462

7. Familiarity-PAE gamma

67

0.19*

0.20

69

0.50*

0.24

-8.16

<.001

8. Familiarity-PAE gamma SC

61

0.17*

0.35

55

0.26*

0.33

-1.43

.155

9. Familiarity-PAE gamma LC

63

-0.10*

0.36

39

0.07

0.36

-2.30

.024

10. CJ-Control Gamma

56

-0.45*

0.35

60

-0.79*

0.23

6.14

<.001

11. CJ-Control Gamma SC

33

-0.58*

0.45

21

-0.80*

0.23

2.29

.026

12. CJ-Control Gamma LC

47

-0.10

0.46

14

-0.34

0.60

1.36

.190

13. CJ-PAE Spatial-Scaling Gamma

66

0.08*

0.15

69

0.06*

0.15

0.91

.364

PAE = percent absolute error, with higher values indicating less accurate estimates. We recoded gamma correlations with PAE by multiplying values by -1 such that positive gamma correlations between PAE and confidence or familiarity reflect more accurate monitoring and more diagnostic cues (i.e., higher cj associated with more accurate estimates). CJ = confidence judgments. LC = large components, SC = small components. The large-component fractions in the diagnostic-cue condition were all raised to a power of two or three, but in the non-diagnostic cue condition they were not. Values marked with an * are reliably different from zero. n = number of participants in that analysis; participants for each measure varied due to invariance on one variable (e.g., all CJs the same) for some participants

Table 4 Correlations among variables in the experimental (above diagonal) and control group (below diagonal)

1

2

3

4

5

6

7

8

9

10

11

12

13

1. Mean PAE SC fractions

0.18

0.13

0.28

-0.61

-0.20

-0.08

0.20

-0.53

-0.53

-0.56

0.18

0.18

2. Mean PAE LC fractions

0.47

-0.01

-0.08

-0.21

-0.28

0.15

-0.20

0.16

0.02

0.33

0.10

0.09

3. Mean unscaled PAE

0.33

0.21

0.53

-0.08

-0.01

-0.04

-0.07

-0.11

-0.11

-0.16

-0.06

-0.05

4. Mean scaled PAE

0.24

0.17

0.17

-0.17

0.03

-0.31

0.05

-0.21

-0.30

-0.36

0.03

0.14

5. Mean CJ SC fractions

-0.45

-0.59

-0.18

-0.07

0.54

0.09

0.11

0.43

0.42

0.24

-0.18

-0.19

6. Mean CJ LC fractions

-0.14

-0.59

-0.09

0.15

0.72

-0.10

0.50

-0.09

-0.34

-0.14

0.32

-0.15

7. Mean familiarity SC fractions

-0.30

-0.31

-0.08

-0.02

0.20

0.08

0.15

0.27

0.35

0.41

-0.05

0.07

8. Mean familiarity LC fractions

-0.10

-0.10

< 0.01

-0.15

0.16

0.11

0.30

-0.21

-0.30

-0.28

0.06

-0.16

9. Within-individual CJ-PAE gamma

-0.14

0.15

-0.02

0.01

0.07

-0.21

-0.04

-0.07

0.57

0.56

-0.23

-0.02

10. Within-individual CJ-Familiarity gamma

-0.39

-0.16

-0.28

-0.24

0.29

-0.14

0.23

-0.20

0.12

0.55

-0.57

-0.15

11. Within-individual Familiarity-PAE gamma

-0.40

0.08

0.27

-0.21

0.15

-0.21

-0.01

-0.11

0.46

0.29

-0.08

-0.13

12. Within-individual CJ-control gamma

0.16

-0.07

0.11

0.45

-0.06

0.26

0.09

0.20

-0.27

-0.44

-0.25

0.06

13. Within-individual CJ-PAE spatial-scaling gamma

-0.35

-0.08

-0.18

-0.12

0.10

-0.15

0.09

0.11

0.02

0.19

0.16

-0.22

PAE = percent absolute error, with higher values indicating less accurate estimates. CJ = confidence judgments. LC = large components, SC = small components. The large-component fractions in the diagnostic-cue condition were all raised to a power of two or three, but in the non-diagnostic cue condition they were not. Bolded correlations at least p <.05. Italicized correlations p <.10

Consistent with hypotheses that we tested in linear mixed effects models below, indices of monitoring accuracy, cue use, and cue diagnosticity were all greater in the diagnostic-cue condition compared to the non-diagnostic cue condition when considering all items together (Table 3). However, when considering small- or large-component items independently, there were few differences in metacognitive monitoring, cue use, and cue diagnosticity between conditions. This highlights the importance of variability and contrast for monitoring accuracy, a point we revisit in the discussion.

As can be seen in Table 4, number-line estimation accuracy was positively related between small- and large-component fractions in the non-diagnostic cue condition, but not in the diagnostic-cue condition. In the diagnostic-cue condition, large-component fractions were raised to a power of two or three. As can be seen in Fig. 3, adults estimated fractions raised to a power of 2 or 3 as approximately half, suggesting they didn't understand these numbers. This lack of understanding the underlying magnitude is likely why there was no relation between estimation accuracy for small- and large-component fractions in the diagnostic cue condition. Additionally, comparing Tables 3 and 4 highlights the importance of considering within-person relations when evaluating monitoring accuracy and cue use. For example, the average within-person gamma correlation between familiarity and estimation accuracy for small-component fractions is reliably different from zero in both conditions (Mγ = 0.17 and Mγ = 0.26), but the average Pearson correlation was non-significant in the experimental group (r = -0.08) but significant in the control group (r = -0.30). Thus, it is important to account for individual differences in average metacognitive judgments and item-level variability when considering monitoring accuracy and cue use, something our mixed-effects models reported below account for (Murayama et al., [49]).

Graph: Fig. 3Estimated magnitude by to-be-estimated magnitude in the non-diagnostic cue (left) and diagnostic-cue conditions (right). The solid line represents perfect estimation accuracy (i.e., a slope of 1). Large-component fractions in the diagnostic-cue condition (right) were raised to a power of two or three

We next tested whether the manipulation of item type impacted estimation accuracy, confidence, and familiarity between conditions in a series of 2 (condition: experimental vs. control) × 2 (item type: standard and power fractions) between-within ANOVAs. Unless otherwise noted, the effects reported below replicated in analogous mixed-effects models with by-item and by-subject random intercepts, and a by-subject random slope for item type. We report the more parsimonious mixed ANOVAs. As can be seen in Fig. 4, participants were more accurate in estimating, more confident in their estimates of, and were more familiar with, small-component fractions compared to large-component fractions. As expected, these differences were larger in the experimental condition in which large-component fractions were raised to a power of two or three.

Graph: Fig. 4Effects of item type and experimental condition on PAE, confidence, and familiarity. PAE = percent absolute error, which is inversely related to accuracy. In the experimental condition, large-component fractions were raised to a power of two or three. Results displayed are model estimated means and error bars represent one standard error from the mean based on the model

Number-line estimation PAE

There was a main effect of condition, F(1, 135) = 105.99, p <.001, ηp2 = 0.44, an effect of item type, F(1, 135) = 320.45, p <.001, ηp2 = 0.70, and a condition by item-type interaction, F(1, 135) = 122.89, p <.001, ηp2 = 0.48 on estimation accuracy. Estimation accuracy was greater in the non-diagnostic cue compared to the diagnostic-cue condition and for small-component fractions compared to large-component fractions. The interaction revealed that participants estimated small-component fractions with similar levels of accuracy in the experimental and control groups (p =.204), but participants estimated large-component fractions more accurately in the non-diagnostic cue condition than in the diagnostic-cue condition, p <.001 (Fig. 4). Additionally, the difference in estimation accuracy between small- and large-component fractions was larger in the diagnostic-cue condition (Mdif = 18.29%, SE = 0.88%, p <.001) than in the non-diagnostic condition (Mdif = 4.30%, SE = 0.90%, p <.001). In a similar mixed-effects model that included by-item and by-subject random intercepts and a by-subject random slope for item type, there was only a marginal difference in estimation accuracy between small- and large-component fractions in the non-diagnostic cue condition (Mdif = 4.30%, SE = 1.69%, p =.076).

Confidence and familiarity judgments

Participants were more confident, F(1, 135) = 23.43, p <.001, ηp2 = 0.15, in their estimates of fractions in the diagnostic-cue compared to non-diagnostic cue condition. There was no effect of condition on familiarity judgments, F(1, 135) < 0.01, p =.980, ηp2 < 0.001. However, participants were more confident in estimating, F(1, 135) = 506.19, p <.001, ηp2 = 0.79, and were more familiar with, F(1, 135) = 705.10, p <.001, ηp2 = 0.84, small-component relative to large-component fractions. A condition by item-type interaction on confidence, F(1, 135) = 41.25, p <.001, ηp2 = 0.23, and familiarity, F(1, 135) = 26.28, p <.001, ηp2 = 0.16, indicated that the difference in confidence and familiarity between small- and large-component fractions was smaller in the non-diagnostic cue condition compared to the diagnostic-cue condition. Additionally, participants were equally confident in their estimates of small-component fractions between conditions, p =.488. However, familiarity with small-component fractions was greater in the diagnostic compared to non-diagnostic condition. This could be because participants saw small-component fractions along with fractions raised to a power of two or three, which made small-component fractions seem more familiar.

Cue use, monitoring accuracy, and control decisions

We tested our primary hypotheses about cue use, cue diagnosticity, and monitoring accuracy in a series of pre-registered linear mixed-effects models as described by Murayama et al. ([49]), using the lme4 package (Bates et al., [3]) in R (R Core Team, [61]; Version 4.1.2). In each of these models, we included by-item and by-subject random intercepts, as well as a by-subject random slope for the continuous predictor (which varied by model). All random effects were justified in nested model comparisons of deviance. All models were fit with restricted maximum likelihood estimation (REML), and degrees of freedom were estimated using Satterthwaite's method for approximation using the lmerTest package (Kuznetsova et al., [40]). Given that we thought the variability between items would be important for metacognition (as shown in Table 3), we did not examine cue use and monitoring within each item type.

To account for individual differences in the predictor of interest, we mean centered the continuous predictor within each individual participant (i.e., familiarityij—mean-familiarityj where familiarityij is the item-level familiarity for item i and participant j, and mean-familiarityj is participant j's average familiarity). Condition was entered into the model where control = -0.5 and experimental = 0.5. When there were significant interactions with condition, we examined the simple slope of the continuous predictor within each experimental condition. The most complex models on cue use and monitoring accuracy led to convergence issues due to scaling of the variables. Thus, we rescaled familiarity and confidence by dividing each of the within-person mean centered values by 100. This led to convergence in all cases.

Familiarity cue use

We hypothesized that familiarity would be a cue participants used to rate their confidence across both conditions (H1a). We predicted item-level confidence from item-level familiarity (within-person mean centered), experimental condition, and their interaction in a linear mixed effects model. If participants relied on familiarity as a cue, then item-level familiarity would predict item-level confidence, such that participants were more confident when they estimated a number that they also rated as more familiar.

As expected, participants relied on familiarity as a cue to make their confidence judgments, but the relation between familiarity and confidence was stronger in the diagnostic-cue condition and non-significant in the non-diagnostic cue condition. There was an effect of familiarity, b = 14.27, t(320.93) = 6.76, p <.001, of condition, b = -11.24, t(134.92) = 4.82, p <.001, and a condition by familiarity interaction, b = 19.68, t(125.17) = 6.21, p <.001. The effect of familiarity indicated that participants were 0.14 units higher in confidence when they were one unit greater in familiarity on a particular item than their own average familiarity.[3]The interaction indicated that familiarity predicted confidence in the diagnostic, b = 24.11, t(206.52) = 9.41, p <.001, but not in the non-diagnostic condition, b = 4.43, t(229.16) = 1.63, p =.104 (Fig. 5). Thus, participants in the diagnostic-cue condition were more confident when they estimated items that they were more familiar with. However, participants in the non-diagnostic cue condition were not. This lack of relation between familiarity and confidence in the non-diagnostic cue condition is inconsistent with prior work (Fitzsimmons & Thompson, [23]; Fitzsimmons et al., [25]) and with the results when examining gamma correlations (see Table 3). We revisit this point in the discussion.

Graph: Fig. 5Predicted confidence from person mean centered familiarity (left) and predicted PAE from familiarity (right) by condition. Higher percent absolute error reflects lower accuracy. Effects are based on modeled estimates. Gray around the lines represents ± 1 standard error

We also hypothesized that familiarity would predict accuracy across conditions (i.e., would be diagnostic of performance), but would be a stronger predictor of accuracy in the diagnostic-cue compared to non-diagnostic cue condition (H1b). To test this hypothesis, we re-ran the mixed model described above predicting item-level estimation accuracy (as PAE). Participants in the experimental group estimated less accurately than those in the control group, b = 8.91, t(135.01) = 10.30, p <.001. In line with our hypothesis (H1b), familiarity predicted performance, b = -7.63, t(327.13) = 6.20, p <.001, and the effect of familiarity on performance varied by condition, b = -17.50, t(121.76) = 10.52, p <.001. The interaction between condition and familiarity indicated that familiarity related to estimation accuracy in the diagnostic-cue condition, b = -16.38, t(216.02) = 11.51, p <.001, but not in the non-diagnostic cue condition, b = 1.12, t(266.15) = 0.72, p =.471 (comparison of slopes, p <.001see Fig. 5). This suggests that participants' confidence judgments should accurately predict their performance, especially in the diagnostic-cue condition because these participants' familiarity predicted both confidence and performance.

Monitoring accuracy

We predicted that monitoring accuracy would be greater in the diagnostic-cue than the non-diagnostic cue condition (H2a) and that monitoring accuracy would be greater after removing noise due to spatial-localization skills (H2b). We tested these predictions by predicting item-level PAE from item-level confidence (within-person mean centered), condition, and their interaction. Accurate monitoring would be indicated by a significant relation between item-level confidence judgments and item-level PAE. As above, the final model included by-item and by-subject random intercepts, and a by-subject random slope for confidence judgments. We also rescaled confidence by dividing person mean centered confidence by 100 in order to get the final model to converge.

As expected, adults could accurately monitor their estimation performance, but they were better able to judge their estimation accuracy in the diagnostic-cue than the non-diagnostic cue condition. Participants estimated more accurately in the non-diagnostic cue condition than in the diagnostic-cue condition, b = 8.89, t(135.19) = 10.28, p <.001. As predicted, participants were more accurate (i.e., had lower PAE) when they were more confident, b = -14.73, t(271.38) = 9.92, p <.001, indicating accurate metacognitive monitoring. And, as expected, a condition by confidence interaction, b = -15.65, t(133.90) = 6.62, p <.001, indicated that the relation between confidence and performance varied by condition. Simple slope tests indicated that participants' confidence better predicted their performance in the diagnostic-cue condition, b = -22.56, t(166.08) = 12.76, p <.001, than in the non-diagnostic-cue condition, b = -6.90, t(237.89) = 3.42, p =.001 (comparison of slopes, p <.001; see Fig. 6). Thus, as expected, adults were better able to judge when they were more and less accurate in the condition where familiarity was more predictive of performance than the condition in which familiarity was less predictive of performance.

Graph: Fig. 6Predicted percent absolute error from within-person mean centered confidence by condition (i.e., monitoring accuracy). Higher percent absolute error reflects lower accuracy. Effects are based on modeled estimates. Gray around the lines represents ± 1 standard error

We also expected that monitoring accuracy would be limited in the number-line estimation task due to variability in performance due to spatial-localization skills rather than number knowledge. We tested whether reducing noise in estimation performance improved indices of monitoring accuracy by adding estimation error from the scaled and unscaled trials from the spatial-localization task to the mixed-effects model described above. There were 89 missing trials (participants were not forced to respond to the spatial-localization trials) from the spatial-localization task (44 for scaled, 45 for unscaled), so we re-ran the model on monitoring accuracy only for those who had complete data so we could more directly compare the coefficients between models with and without controlling for spatial-localization accuracy.

When adding percentage of estimation error from the scaled and unscaled versions of the spatial-localization tasks as covariates to the model with confidence judgments, condition, and the confidence by condition interaction, there was little change in the strength of the relation between confidence and PAE. In the model without spatial-localization, person mean centered confidence predicted PAE, b = -15.04, p <.001. When adding spatial-localization to the model, the coefficient was nearly identical, b = -15.03, p <.001. This suggests that monitoring accuracy, as operationalized in the current study, may not be limited due to estimation error associated with spatial skills.

However, recall that spatial-localization PAE was unrelated to number-line estimation PAE in the diagnostic cue condition for large-component fractions (Table 3). Additionally, spatial-scaling PAE, b = 10.77, t(2193.78) = 1.51, p =.132, and unscaled PAE, b = 7.94, t(2210.70) = 0.83, p =.409, were both unrelated to number-line estimation PAE in the mixed-effects model. Thus, it could be that the relation between confidence and PAE increased only in the control condition or only for small-component fractions. However, simple slope tests indicated that the slope for confidence predicting PAE within the control condition was similar between models controlling for localization performance or not (without localization: b = -7.13, p =.001; controlling for localization: b = -7.09, p =.001). Similarly, the slope between confidence and PAE for small-component fractions only across conditions was similar without controlling for localization accuracy, b = -13.28, p <.001, or when controlling for localization accuracy, b = -13.22, p <.001.

Thus, spatial-localization skills in number-line estimation do not appear to limit metrics of people's monitoring accuracy. It could be that people take into account some degree of their estimation error when they judge their confidence in the number-line estimation task. We revisit this issue and alternative explanations in the discussion.

The upper limit of monitoring accuracy

We tested whether dichotomizing estimation performance into accurate (< 5% error) or inaccurate (> 5% error) increased the gamma correlations between performance and confidence judgments as an index of monitoring accuracy in the number-line estimation and spatial-localization tasks. Dichotomizing accuracy led to a stronger gamma correlation than when gamma was calculated with continuous error in the non-diagnostic cue condition (dichotomized: Mγ = 0.36, SD = 0.29; continuous: Mγ = 0.24, SD = 0.16), t(66) = 4.76, p < 0.001, and diagnostic-cue condition (dichotomized: Mγ = 0.50, SD = 0.31; continuous: Mγ = 0.36, SD = 0.17), t(69) = 5.45, p < 0.001. However, there was not a reliable increase in gamma correlations for the spatial-localization task. Gamma correlations were similar when performance on the spatial-localization task was dichotomized or measured as percentage of error in the non-diagnostic cue condition (dichotomized: Mγ = 0.11, SD = 0.35; continuous: Mγ = 0.08, SD = 0.15), t(64) = 0.90, p = 0.371, and diagnostic-cue condition (dichotomized: Mγ = 0.12, SD = 0.36; continuous: Mγ = 0.06, SD = 0.15), t(67) = 1.86, p = 0.067.

We also estimated the maximum upper limit of monitoring accuracy in the number-line estimation task. We expected that there are likely individual differences in the cutoff point for which an individual may consider an estimate accurate or inaccurate. For each individual participant, we recalculated gamma correlations between confidence and accuracy where accuracy was coded in increasing 1% increments from 1% error to 20% error. We calculated gamma correlations between confidence and accuracy for each cutoff point of accuracy and selected the highest gamma value. For example, a cutoff of 3% error may lead to the highest gamma for one participant, but a cutoff of 4% error may lead to the highest for another. Thus, in these analyses, a variety of cutoff values for accuracy were used across participants to account for potential individual differences in their judgments. We estimated the maximum possible gamma in this task as highly accurate in the diagnostic-cue condition (Mγ = 0.80, SD = 0.20) and accurate in the non-diagnostic cue condition (Mγ = 0.73, SD = 0.24). Thus we estimate the upper level of monitoring accuracy in this task as quite high at approximately 0.80.

Although these results seem promising, dichotomizing performance on the number-line estimation task inherently reduces the potential variability among estimates that are close to the correct location. One reason the improvement is likely greater in the number-line estimation task compared to the spatial-localization task is because there are underlying differences in adults' understanding of the to-be-estimated number, and there are cues (e.g., the to-be-estimated number, the component size, whether the item is raised to a power or not) that may activate some of their understanding in the number-line task that are unavailable in the spatial-localization task. This is further supported by the difference in monitoring accuracy between the diagnostic- and non-diagnostic cue conditions when performance was dichotomized.

Control behaviors

We next evaluated whether participants' confidence and number-line estimation accuracy predicted the likelihood they would ask for help on particular items. Participants saw their prior estimates and indicated whether they would like a hint or not on each individual item. We hypothesized that greater confidence would predict a greater likelihood of help-seeking behaviors for all participants (H3a), but that estimation accuracy would be a stronger predictor of the likelihood of seeking help in the diagnostic compared to non-diagnostic condition (H3b). To evaluate these hypotheses, we predicted the likelihood participants sought help on an individual item (coded as 0 = no help and 1 = requested help) from their confidence on the item, PAE on the item, condition, the interaction between confidence and condition, and the interaction between PAE and condition. Confidence and performance were both within-person mean centered. Our final model, as justified by nested model comparisons, included by-item and by-subject random intercepts, and a by-subject random slope for confidence judgments. To achieve convergence, we used the Nelder Mead optimizer as implemented in lme4. The model with by-subject random slopes for both confidence and performance did not converge. The fixed effects were similar between the model with a by-subject random slope for estimation accuracy and the model with a bysubject random slope for confidence. We report effects from the model with by-subject random slopes for confidence. For the effects reported from this mixed-effects logistic model, we report Type III Wald Chi-squared significance tests (the results do not differ when reporting z-tests).

As expected, participants were more likely to ask for help on items that they rated with lower confidence than their own individual average confidence, b = -2.34, X2(1) = 26.24, OR = 0.10, p < 0.001. However, an interaction between condition and confidence, b = -2.99, X2(1) = 13.72, p <.001, indicated that confidence was more strongly related to the likelihood of seeking help in the diagnostic-cue condition, b = -3.83, SE = 0.65, OR = 0.02, compared to non-diagnostic cue condition, b = -0.84, SE = 0.57, OR = 0.22 (comparison of slopes, p <.001see Fig. 7). Across conditions, adults were more likely to ask for help on problems that they estimated with more error, b = 3.44, X2(1) = 26.45, OR = 31.29, p <.001, and this effect did not vary between conditions, b = 0.12, X2(1) = 0.01, p = 0.929 (see Fig. 7).

Graph: Fig. 7Predicted probability of help seeking from confidence (left) and accuracy (as percent absolute error; right) by condition. Higher percent absolute error reflects lower accuracy. Effects are based on modeled estimates. Gray around the lines represents ± 1 standard error. We extended the range of confidence and percent absolute error for these figures to estimate predicted values to clearly illustrate when the probability of help seeking approaches zero and 100%. The observed range of within-person mean centered confidence was from -68.60 to 68.15. The observed range of within-person mean centered percent absolute error was -28.69 to 66.36

As expected, participants were more likely to seek help in the diagnostic-cue condition than the non-diagnostic cue condition, b = 1.20, X2(1) = 5.81, OR = 3.33, p = 0.016. This greater frequency of help seeking was largely driven by participants seeking help more often on the large-component trials in the diagnostic-cue condition. For example, participants asked for help on more large-component trials in the diagnostic cue condition (M = 8.59, SD = 2.96), than the non-diagnostic cue condition (M = 5.68, SD = 3.43), t(121.28) = 5.18, p < 0.001, but participants asked for help on a similar number of small-component trials in the diagnostic-cue (M = 1.73, SD = 3.11) compared to the non-diagnostic cue condition (M = 1.89, SD = 2.79), t(129.97) = 0.31, p = 0.758. And, in a logistic mixed-effects model predicting the likelihood of help seeking from item type (small- and large-component), condition, estimation accuracy (as PAE), all two- and three-way interactions, with by-item and by-subject random intercepts and a by-subject random slope for item type, participants were more likely to ask for help on large-component than small-component items, b = 7.51, SE = 0.95, X2(1) = 63.03, OR = 1,822.28, p <.001. The effect of item type on the likelihood of seeking help also held in a model that included within-person mean centered confidence judgments, b = 6.85, p <.001. Thus, even when accounting for estimation accuracy or confidence judgments, participants were influenced by the surface-level features of the problems when deciding which items to ask for help on.

We also explored the average PAE for items that people asked for help and did not ask for help on between conditions. There were eight participants who did not respond to at least one control trial, and an additional 21 participants asked for help invariably. Of these 21 participants, eight asked for help on all items, 10 did not ask for help, and three asked for help on some items but did not respond to others. To more accurately estimate PAE by help-seeking and condition while including these invariant or missing responses, we ran a linear mixed effects model predicting item-level PAE from control decisions (did not ask for help: -0.5 and asked for help: 0.5), condition (non-diagnostic cue: -0.5 and diagnostic-cue: 0.5) and their interaction. We included by-item and by-subject random intercepts, and a by-subject random slope for control decisions, as justified in nested model comparisons of fit. Participants estimated less accurately on items that they asked for help on compared to items they did not ask for help on, b = 6.88, t(225.14) = 8.67, p <.001, and in the diagnostic-cue compared to non-diagnostic cue condition, b = 8.67, t(111.68) = 9.25, p <.001. However, an interaction between condition and help seeking on PAE revealed important trends, b = 10.95, t(116.35) = 8.03, p <.001. Follow-up comparisons of estimated marginal means indicated that there was no difference in estimation accuracy in the non-diagnostic cue condition between items that participants asked for help on (M = 10.47%, SE = 1.44%) and did not ask for help on (M = 9.07%, SE = 1.22%), p = 0.543. However, in the diagnostic cue condition, participants were much less accurate on problems that they asked for help on (M = 24.62%, SE = 1.38%) compared to problems they did not ask for help on (M = 12.27%, SE = 1.25%), p <.001 (see Fig. 8).

Graph: Fig. 8Estimated percent absolute error (PAE) by help seeking and experimental condition. Percent absolute error is inversely related to accuracy (i.e., lower PAE is more accurate). Effects are based on modeled estimates. Gray around the lines represents ± 1 standard error

Discussion

In the current study, we evaluated why monitoring accuracy is often so poor during number-line estimation, whether it was greater when there is a salient cue that is predictive of performance or when accounting for spatial skills, and the relation between monitoring judgments and control decisions. Metacognitive monitoring, the ability to recognize when one is accurate or not, is important because judgments of one's own performance or knowledge can be related to control decisions, such as asking for help on a problem (Nelson & Fyfe, [51]; Nelson & Narens, [50]; O'Leary & Sloutsky, [54], [55]; Roebers & Spiess, [65]; Roebers et al., [66]; Wall et al., [77]). We theorized that people are poor at accurately monitoring their number-line estimation performance because (a) they lack diagnostic cues to distinguish when they are more and less accurate, and (b) there is additional variability in estimation accuracy unassociated with number-magnitude representations.

Consistent with our first hypothesis and inferential-judgment models of monitoring (Brunswik, [9]; Koriat, [36]), we found that monitoring accuracy was greater in the diagnostic-cue condition compared to a non-diagnostic cue condition. In the diagnostic-cue condition, we manipulated the surface features of fractions (i.e., some raised to a power of two or three) so that participants' subjective familiarity systematically covaried with accuracy. Thus, the same cue, familiarity with stimuli, can sometimes be diagnostic and other times non-diagnostic of performance. Accurate judgments of performance were important because adults asked for help on items that they were less confident in estimating, consistent with theories of self-regulated learning (e.g., Dunlosky & Hertzog, [17]; Metcalfe, [42], [43]; Nelson & Narens, [50]). As a result of their confidence being more strongly related to their actual performance, help-seeking choices better distinguished between more and less accurate estimates among participants in the diagnostic-cue than in the non-diagnostic cue condition.

Although monitoring accuracy was greater in the diagnostic-cue compared to the non-diagnostic cue condition, it was still poor in the current study. For example, in the diagnostic-cue condition, the average within-individual gamma correlation was only 0.36, and the slope for confidence predicting PAE was -0.23. A slope of -0.23 suggests that when participants were 1 unit higher than their own average confidence, they were 0.23 points lower in PAE.[4]

We hypothesized that part of the reason for poor monitoring in this task is due to noise in estimation unassociated with people's knowledge of numbers that was due, in part, to their spatial-localization skills (Gilligan et al., [29]; Jirout et al., [34]; Möhring et al., [45]). Unexpectedly, we did not find support for this hypothesis, potentially because spatial-scaling skills were only related to number-line estimation accuracy for small-component fractions (see Table 4). Even when limiting analyses to only small-component fractions, or only the control condition, accounting for spatial-scaling skills did not increase the confidence to PAE slope in the mixed-effects models reported here. However, reducing noise in performance by dichotomizing estimation accuracy increased indices of monitoring accuracy. This suggests that indices of monitoring accuracy may underestimate people's monitoring ability due to ways the task is measured and noise in performance. Performance on the number-line estimation task is measured as a degree of error, and adults may have struggled to distinguish among estimates that only varied by a few percentage points of error (e.g., which of two estimates were more accurate if they were estimated with 2% and 3% error). The role of noise in performance in number-line estimation (i.e., small differences in error) is similar to how noise due to correct guessing on multiple-choice tests can negatively bias indices of people's monitoring ability (Vuorre & Metcalfe, 2019).

To our surprise, there were small differences in estimation accuracy and familiarity for small-component items between conditions, despite adults in both conditions estimating the exact same items. We reason that differences in familiarity for small-component fractions between conditions highlights the role of anchoring and the comparative nature of metacognitive judgments. That is, 3/8 is perceived as more familiar when in a list with highly unfamiliar fractions raised to a power of two or three, than in a list that includes slightly less familiar items, such as 13/18. This relative judgment effect is similar to adults rating medium difficulty paired associates as easier to remember in the context of difficult word pairs relative to when the same medium difficulty paired associates were presented in the context of easy word pairs (Laursen & Fiacconi, 2021). We are unsure why adults estimated small-component fractions more accurately in the non-diagnostic cue than the diagnostic-cue condition. One possibility is that the perceived difficulty of estimating fractions raised to a power of two or three interfered with their ability to estimate the small-component fractions or decreased their motivation to put in effortful processing on these items. However, we avoid overinterpreting this effect given that differences in PAE for small-component fractions were small and not central to our research questions or hypotheses.

We were also surprised that the relation between confidence and help seeking differed between the diagnostic and non-diagnostic cue conditions (see Fig. 7 and Table 3). Although we are not sure why confidence was more strongly related to the probability of asking for help on an item in the diagnostic compared to non-diagnostic cue condition, we provide two speculative, and interrelated, explanations. First, we suspect this interaction might be a statistical artifact due to the heterogeneity in help seeking between conditions (see Mood, [2010] for a discussion of issues related to between-group effects and heterogeneity in logistic regression). Participants asked for help more frequently in the diagnostic compared to non-diagnostic cue condition, and participants' average confidence when they asked for help was lower in the diagnostic cue compared to non-diagnostic cue condition.[5]We suspect that the stronger relation between confidence and the probability of help seeking in the diagnostic- compared to non-diagnostic cue condition is due to the greater frequency of help-seeking and lower confidence in the diagnostic- compared to non-diagnostic cue condition.

Second, participants may have based their control decisions on information beyond their confidence judgments alone. For example, item type predicted the probability of help seeking, even when accounting for confidence judgments. Thus, participants may have decided to ask for help on items based on the features of the fractions (i.e., size of the components). Given that both confidence and help seeking were predicted by condition, item type, and the condition by item type interaction, it could be that confidence mediates the relation between experimental condition and help seeking. That is, condition predicted differences in confidence between item types, and in turn confidence differentially predicted the probability of help seeking.

Theoretical and practical implications of metacognitive monitoring and control

Metacognitive monitoring

Better monitoring (i.e., a closer relation between confidence and performance) when one of the cues (i.e., familiarity) participants relied on did, in fact, predict performance, is consistent with inferential-judgment models (Brunswik, [9]; Koriat, [36]) and other work evaluating the role of cues in monitoring accuracy (e.g. Koriat & Ackerman, [38]; Mueller & Dunlosky, [47]; Mueller et al., [48]; Roebers et al., [66]). When cues are not valid predictors of performance, monitoring accuracy is hindered. For example, the font size of words in paired-associate learning does not affect recall, yet when participants believed that it did and relied on this cue to make their judgments of learning, their monitoring accuracy was hindered (e.g., Mueller et al., [48]). In contrast to metacognitive illusions like font-size, some cues, such as normed difficulty or response latency, are often valid predictors of performance and memory. Second graders increasingly rely on their response latency to judge their recall performance when learning the meaning of Japanese characters (Kanji), and as a result of increasingly relying on this cue to make their judgments of accuracy, they became increasingly accurate at judging their memory (Roebers et al., [66]).

We specifically designed our stimuli in order to align subjective familiarity and participants' estimation accuracy as a test of inferential-judgment models of monitoring. However, there are likely other cues that participants may rely on when monitoring their performance that may covary with their subjective familiarity. For example, fluency (e.g., length of time spent studying word pairs; Koriat & Ackerman, [38]), item difficulty (e.g., Koriat, [37]; Koriat et al., [39]; Undorf et al., 2013), and perceived effort or difficulty (e.g., Baars et al., [2]; Nussinson & Koriat, [53]) are common cues that relate to participants' metacognitive judgments. In the current study, familiarity may have covaried with participants' perceived effort in both conditions: large-component items, regardless of whether they were raised to a power of two or three or not, were likely perceived as more difficult and less familiar than small-component items. However, in the non-diagnostic cue condition, participants could use strategies to estimate large-component fractions accurately, whereas those in the diagnostic-cue condition could not. Thus, an important goal for researchers and learners is to identify cues that are predictive of performance that can be used to better monitor learning and performance.

In addition to the role of cues for accurate monitoring, our exploratory analyses suggested that variability in estimation accuracy was important for monitoring accuracy. For example, adults' standard deviation in estimation accuracy was correlated with greater monitoring accuracy as measured by gamma correlations (r = 0.25, p = 0.003).[6]Additionally, monitoring accuracy, cue use, and control were similar between conditions when considering small- and large-component fractions only (Table 3). It is likely that participants struggle to judge small differences in their estimation accuracy; monitoring accuracy was likely greater in the diagnostic cue condition because the difficulty of items is clearly dichotomized and systematic with participants' familiarity. In other words, adults may be able to easily judge that they estimated numbers like 2/9 more accurately than numbers like (12/15)2, but not as easily distinguish whether they estimated 2/9 more accurately than 5/9.

Although our study focused on number-line estimation, the role of variability in monitoring accuracy has implications in other tasks as well. For example, college students' judgments of their learning of cued-paired associates were more accurate when considering a mixture of related (e.g., dog-cat) and unrelated (e.g., dog-spoon) pairs than when only considering related or unrelated pairs (e.g., Dunlosky & Matvey, [15]). In other words, students could predict that they would do better on related than unrelated paired recall, but did less well at distinguishing which related and unrelated pairs they would better recall within a given list type. Similarly, relative monitoring accuracy in metacomprehension is often poor (see Yang et al., [79] for a meta-analysis). Although there are several hypotheses explaining why metacomprehension is poor, homogeneity in performance between text passages may limit metacomprehension. For example, Weaver and Bryant ([78]) found that metacomprehension was greatest for medium difficulty texts relative to easy or hard texts. Furthermore, Dunlosky and Lipko ([14]) noted that poor comprehension may be another factor that limits people's metacomprehension accuracy. In other words, it is hard for people to judge which texts they comprehend more or less well if they struggle to comprehend all of them. However, when a mix of easy and hard items are combined (i.e., heterogeneity in difficulty of content), they may be able to better distinguish which content they know more and less well.

Adults' ability to recognize their difficulty understanding unfamiliar numbers, like fractions raised to a power (Braithwaite & Siegler, [6]), is important because they sometimes encounter and need to reason about uncommon numbers. For example, people may encounter especially large numbers like the U.S. national debt (> $28,000,000,000,000) and geological time (e.g., the Paleozoic phase began approximately 542,000,000 years ago; see Resnick et al., [63]); or they may encounter especially small numbers, such as molecules as small as 1/1,000,000,000 of a meter. Our findings suggest that people have some awareness when they know number magnitudes more well (e.g., ½) compared to less well (e.g., [12/15]2). An open question is whether the same awareness of a lack of knowledge or ability is present in younger children who are just learning about fractions.

Metacognitive control

The current study aligns with models of self-regulated learning and prior work in which people asked for help when they were less confident (e.g., Dunlosky & Hertzog, [17]; Fyfe et al., [28]; Metcalfe, [42], [43]; Nelson & Fyfe, [51]; Nelson & Narens, [50]; Wall et al., [77]). However, surface-level characteristics of the problems (i.e., the fraction components or whether fractions were raised to a power of two or three) influenced people's confidence and help seeking: participants were more likely to seek help on large-component than on small-component items, even when controlling for estimation performance or confidence judgments. Our findings align with prior work (e.g., Nelson & Fyfe, [51]; Wall et al., [77]) in which participants' help-seeking behaviors were better aligned with their confidence judgments than their performance. For example, first, second, and fourth graders' likelihood of seeking help was better predicted by their confidence than their estimation performance (Wall et al., [77]). In the current study, there was little difference in estimation accuracy when participants in the non-diagnostic cue condition asked for help compared to when they did not ask for help. These participants asked for help even when they estimated accurately. It is likely that the surface-level features of the problems led them to believe their estimates were less accurate than they actually were. It is also possible that demand characteristics of the task influenced people's help seeking in the non-diagnostic cue condition. That is, they may not have asked for help on individual questions had they not been prompted to do so. In the diagnostic-cue condition, however, participants' confidence and accuracy differed between items that they asked for help on compared to when they did not. Given that they were not very accurate when they asked for help, their help seeking could have impacted their performance and learning had help been provided.

Future work may examine whether help seeking leads to improved performance. To understand when control behaviors can lead to learning gains, Dunlosky and colleagues (2021) recently outlined the contingent-efficacy hypothesis. According to the contingent-efficacy hypothesis, restudy choices lead to learning gains when three criteria are met: (1) when restudy can produce large learning gains, (2) when there is an achievable learning goal (i.e., not ceiling or floor prior knowledge), and (3) when people can accurately identify when they are correct or not. According to this hypothesis, participants in the non-diagnostic cue condition may not have gained much by asking for help: their performance was already accurate and their judgments were only moderately good at distinguishing when they were more or less accurate. In the diagnostic-cue condition, participants may have benefited substantially from help seeking (had it been provided): they asked for help when they were less accurate, and they likely had enough prior knowledge to benefit from help seeking. The degree to which they would benefit from asking for help would thus depend on the effectiveness of the help provided. Furthermore, accurately asking for help when one misunderstands numbers in financial or health domains could be related to better financial and health decisions, a possibility that could be explored in future work.

Implications for number-magnitude estimation and spatial-scaling skills

To our knowledge, this is also the first study to evaluate whether adults can quickly estimate the magnitude of fractions raised to a power and the relation between spatial-scaling skills and fraction magnitude estimation accuracy. Adults sometimes overextend properties from integer arithmetic to fractions (i.e., whole number bias; Alibali & Sidney, [1]). For instance, adults sometimes think that multiplication makes larger and division makes smaller, which is generally true when the multiplier and divisor are integers, but not when they are fractions. Similarly, an integer raised to a power greater than one will make the magnitude larger (e.g., 52 = 25). However, a fraction raised to a power greater than one will result in a smaller magnitude. As can be seen in Fig. 3, adults did not seem to use this heuristic strategy as they failed to differentially estimate fractions raised to a power of 2 or 3 and instead placed these large-component fractions at approximately the halfway point of the line.

We also tested whether spatial-localization skills played a role in fraction magnitude estimation accuracy in adults, similar to how these skills are related to number-magnitude estimation in children (Gilligan et al., [29]; Jirout et al., [34]; Möhring et al., [45]). Similar to the effects observed with whole-numbers in children, we found that there were reliable relations between spatial-scaling estimation accuracy and number-line estimation accuracy for small-component fractions (see Table 3). It makes sense that there would be a relation between these tasks given that understanding a fraction magnitude requires that one understands the proportional relation between the fraction magnitude (e.g., 1/2 = 0.5) and the endpoints of the number line. Additionally, spatial scaling is related to proportional reasoning skills (e.g., Möhring et al., [45]), and proportional reasoning skills are related to formal fraction understanding (e.g., Möhring et al., [44]).

However, scaling accuracy was unrelated to number-line estimation accuracy for large-component fractions in the zero-order correlations and in the mixed effects models. It is unsurprising that estimation accuracy for fractions raised to a power was unrelated to spatial-scaling skills given that participants did not know how to accurately estimate the magnitude of these numbers within a three-second time constraint (see Fig. 3). However, spatial scaling was also unrelated to magnitude estimation of large-component fractions in the non-diagnostic cue condition, even though these participants estimated these numbers accurately. It is possible adults used strategies to reason about these fractions without thinking about them spatially (e.g., Alibali & Sidney, [1]). For example, adults often transform large-component fractions into other forms, such as a simplified fraction or percentage, and use of these transformation strategies is related to more accurate estimates (Fitzsimmons et al., [25]; Siegler et al., [70]).

Limitations and future directions

We were surprised that familiarity did not predict confidence judgments in the linear mixed effects model for participants in the non-diagnostic cue condition, which is inconsistent with prior work (Fitzsimmons & Thompson, [23]; Fitzsimmons et al., [24]). One potential reason for this lack of a relation is that participants used adaptive strategies (e.g., such as transforming the number into a different form or segmenting the line) to estimate numbers that led to greater confidence in their answer, despite estimating a number they were unfamiliar with. For participants in the diagnostic cue condition, they likely failed to use an adaptive strategy to estimate unfamiliar fractions (i.e., fractions raised to a power of 2 or 3) within the three-second per trial time constraint. Even though participants' familiarity was unrelated to their confidence in the non-diagnostic cue condition in the mixed effects model, the average within-individual gamma correlation between confidence and familiarity was reliably different from zero (Table 3) and medium (My = 0.43, SD = 0.21). This suggests that there is an ordered relation between confidence and familiarity, such that adults were more confident in estimating numbers they were more familiar with.

An open question is whether an objective measure of familiarity would better relate to participants' confidence than the subjective measure used in the current study. For instance, Reder and Ritter ([62]) manipulated the number of times participants saw arithmetic operations (e.g., 23 + 42) and evaluated how the number of exposures influenced whether they felt they could recall the answer or that they needed to calculate the answer (i.e., feeling of knowing judgments). Although they did not measure familiarity in their study, they found that people were more likely to say they knew the answer (i.e., could recall it) on problems that had similar surface features (e.g., 42 X 23), even when they needed to calculate the answer instead. In this way, Reder and Ritter's manipulation of experiencing items changed participants' objective number of experiences with items. Similarly, Fitzsimmons and Thompson ([23]) found that children's confidence in estimating, comparing, and categorizing fractions increased after playing board games for two weeks but that familiarity and confidence did not increase after a brief (5 min) interaction with unfamiliar fractions. Thus, participants may need a substantial amount of experience with fractions in order to impact their familiarity and confidence. Future research could examine how the distribution of fractions across textbooks and in classrooms covaries with participants' metacognitive judgments, and if the distribution of problems bias participants' monitoring accuracy similar to the ways in which problems can bias children's arithmetic accuracy (Braithwaite & Siegler, [6]).

Concluding remarks

In the current study, adults were better at monitoring their estimation accuracy when (a) there was systematic variability in their estimation performance, and (b) there was a cue that identified this systematic variability. However, indices of monitoring accuracy suggested adults' monitoring was still limited in this task. We found that low variability in estimation accuracy may limit indices of monitoring accuracy. Additionally, in the same way whole-number components often bias people's reasoning about fraction magnitudes, components of the ratios also bias people's metacognitive judgments of familiarity, confidence, and monitoring accuracy. This is problematic because people rely on their metacognitive judgments to ask for help on problems. Our data suggest that the way the target task is measured (i.e., estimation accuracy measured continuously vs. dichotomously as right/wrong) can impact monitoring accuracy, that the same cue can be diagnostic (i.e., predictive) in some instances but misleading in others, but that monitoring accuracy is greater when a cue corresponds to systematic variability in performance.

Acknowledgements

Support for this research was provided in part by the U.S. Department of Education, Institute of Education Sciences Grant R305A160295 and R305U200004 to Clarissa A. Thompson. We would like to thank members of the first authors' dissertation, John Dunlosky, Nora Newcombe, Brad Morris, Chris Was, and Jeff Ciesla for their valued input on this work.

Funding

This research was funded by Institute of Education Sciences (IES) grants R305A160295 and R305U200004.

Declarations

Conflicts of interest

The authors declare no conflicts of interest.

Ethical approval

This study was approved by the Kent State University (protocol #: 21–289).

Informed consent

Adults provided informed consent via the online survey platform in line with the IRB protocol.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References 1 Alibali MW, Sidney PG. Variability in the natural number bias: Who, when, how, and why. Learning and Instruction. 2015; 37: 56-61. 10.1016/j.learninstruc.2015.01.003 2 Baars M, Wijnia L, de Bruin A, Paas F. The relation between students' effort and monitoring judgments during learning: A meta-analysis. Educational Psychology Review. 2020; 32: 979-1002. 10.1007/s10648-020-09569-3 3 Bates D, Maechler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software. 2015; 67; 1: 1-48. 10.18637/jss.v067.i01 4 Booth JL, Siegler RS. Developmental and individual differences in pure numerical estimation. Developmental Psychology. 2006; 42; 1: 189. 10.1037/0012-1649.41.6.189 5 Booth JL, Siegler RS. Numerical magnitude representations influence arithmetic learning. Child Development. 2008; 79; 4: 1016-1031. 10.1111/j.1467-8624.2008.01173.x 6 Braithwaite DW, Siegler RS. Children learn spurious associations in their math textbooks: Examples from fraction arithmetic. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2018; 44; 11: 1765 7 Braithwaite DW, Siegler RS. Developmental changes in the whole number bias. Developmental Science. 2018; 21; 2. 10.1111/desc.12541 8 Bröder A, Undorf M. Metamemory viewed through the judgment lens. Acta Psychologica. 2019; 197: 153-165. 10.1016/j.actpsy.2019.04.011 9 Brunswik E. Representative design and probabilistic theory in a functional psychology. Psychological Review. 1955; 62; 3: 193-217. 10.1037/h0047470 Chesney DL, Matthews PG. Knowledge on the line: Manipulating beliefs about the magnitudes of symbolic numbers affects the linearity of line estimation tasks. Psychonomic Bulletin & Review. 2013; 20; 6: 1146-1153. 10.3758/s13423-013-0446-8 Clarke S, Beck J. The number sense represents (rational) numbers. Behavioral and Brain Sciences. 2021; 44: 1-62. 10.1017/S0140525X21000571 Dehaene S. Varieties of numerical abilities. Cognition. 1992; 44; 1–2: 1-42. 10.1016/0010-0277(92)90049-n Dehaene, S. (2011). The number sense: How the mind creates mathematics (Rev. and updated ed.). Oxford University Press. Dunlosky J, Lipko AR. Metacomprehension: A brief history and how to improve its accuracy. Current Directions in Psychological Science. 2007; 16; 4: 228-232. 10.1111/j.1467-8721.2007.00509.x Dunlosky J, Matvey G. Empirical analysis of the intrinsic–extrinsic distinction of judgments of learning (JOLs): Effects of relatedness and serial position on JOLs. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2001; 27; 5: 1180 Dunlosky J, Mueller ML, Morehead K, Tauber SK, Thiede KW, Metcalfe J. Why does excellent monitoring accuracy not always produce gains in memory performance?. Zeitshcrift Fur Psychologie. 2021; 229; 2: 104-119. 10.1027/2151-2604/a000441 Dunlosky, J, & Hertzog, C. (1998). Training programs to improve learning in later adulthood: Helping older adults educate themselves (pp. 263–290). Routledge. Dunlosky, J, & Metcalfe, J. (2009). Metacognition. Sage Publications. Ebersbach M, Luwel K, Verschaffel L. The relationship between children's familiarity with numbers and their performance in bounded and unbounded number line estimations. Mathematical Thinking and Learning. 2015; 17; 2–3: 136-154. 10.1080/10986065.2015.1016813 Efklides A. Interactions of metacognition with motivation and affect in self-regulated learning: The MASRL model. Educational Psychologist. 2011; 46; 1: 6-25. 10.1080/00461520.2011.538645 Fazio LK, Bailey DH, Thompson CA, Siegler RS. Relations of different types of numerical magnitude representations to each other and to mathematics achievement. Journal of Experimental Child Psychology. 2014; 123: 53-72. 10.1016/j.jecp.2014.01.013 Fazio LK, DeWolf M, Siegler RS. Strategy use and strategy choice in fraction magnitude comparison. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2016; 42; 1: 1. 10.1037/xlm0000153 Fitzsimmons CJ, Thompson CA. Developmental differences in monitoring accuracy and cue use when estimating whole-number and fraction magnitudes. Cognitive Development. 2022; 61: 101148. 10.1016/j.cogdev.2021.101148 Fitzsimmons CJ, Thompson CA, Sidney PG. Confident or familiar? The role of familiarity ratings in adults' confidence judgments when estimating fraction magnitudes. Metacognition and Learning. 2020; 15; 2: 215-231. 10.1007/s11409-020-09225-9 Fitzsimmons CJ, Thompson CA, Sidney PG. Do adults treat equivalent fractions equally? Adults' strategies and errors during fraction reasoning. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2020. 10.1037/xlm0000839 Fitzsimmons, C, Morehead, K, Thompson, C. A, Buerke, M, & Dunlosky, J. (2021). Does studying worked examples improve numerical magnitude estimation? The Journal of Experimental Education. 1–26. https://doi.org/10.1080/00220973.2021.1891009 Frick A, Newcombe NS. Getting the big picture: Development of spatial scaling abilities. Cognitive Development. 2012; 27; 3: 270-282. 10.1016/j.cogdev.2012.05.004 Fyfe ER, Byers C, Nelson LJ. The benefits of a metacognitive lesson on children's understanding of mathematical equivalence, arithmetic, and place value. Journal of Educational Psychology. 2022; 114; 6: 1292. 10.1037/edu0000715 Gilligan KA, Hodgkiss A, Thomas MS, Farran EK. The developmental relations between spatial cognition and mathematics in primary school children. Developmental Science. 2019; 22; 4: e12786. 10.1111/desc.12786 Green P, MacLeod CJ. simr: An R package for power analysis of generalized linear mixed models by simulation. Methods in Ecology and Evolution. 2016; 7; 4: 493-498. 10.1111/2041-210X.12504 Hammer, J. (n.d.). Best laptop sizes: For which lifestyle does each one fit? Retrieved from: https://gizbuyerguide.com/average-laptop-screen-size-with-6-examples/. Accessed July 2021 Hammond KR, Stewart TR. The essential Brunswik: Beginnings, explications, applications. 2001; Oxford University Press Jirout JJ, Newcombe NS. Mazes and maps: Can young children find their way?. Mind, Brain, and Education. 2014; 8; 2: 89-96. 10.1111/mbe.12048 Jirout JJ, Holmes CA, Ramsook KA, Newcombe NS. Scaling up spatial development: A closer look at children's scaling ability and its relation to number knowledge. Mind, Brain, and Education. 2018; 12; 3: 110-119. 10.1111/mbe.12182 Jirout, J, & Newcombe, N. S. (2018). How Much as Compared to What: Relative Magnitude as a Key Idea in Mathematics Cognition. In Visualizing Mathematics (pp. 3–24). Springer, Cham. Koriat A. Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General. 1997; 126; 4: 349. 10.1037/0096-3445.126.4.349 Koriat A. Easy comes, easy goes? The link between learning and remembering and its exploitation in metacognition. Memory & Cognition. 2008; 36: 416-428. 10.3758/MC.36.2.416 Koriat A, Ackerman R. Choice latency as a cue for children's subjective confidence in the correctness of their answers. Developmental Science. 2010; 13; 3: 441-453. 10.1111/j.1467-7687.2009.00907.x Koriat A, Ackerman R, Lockl K, Schneider W. The easily learned, easily remembered heuristic in children. Cognitive Development. 2009; 24; 2: 169-182. 10.1016/j.cogdev.2009.01.001 Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest Package: Tests in Linear Mixed EffectsModels. Journal of Statistical Software. 2017; 82; 13: 1-26. 10.18637/jss.v082.i13 Leibovich T, Katzin N, Harel M, Henik A. From "sense of number" to "sense of magnitude": The role of continuous magnitudes in numerical cognition. Behavioral and Brain Sciences. 2017; 40: 1-62. 10.1017/S0140525X16000960 Metcalfe J. Is study time allocated selectively to a region of proximal learning?. Journal of Experimental Psychology: General. 2002; 131; 3: 349. 10.1037/0096-3445.131.3.349 Metcalfe J. Metacognitive Judgments and Control of Study. Current Directions in Psychological Science. 2009; 18; 3: 159-163. 10.1111/j.1467-8721.2009.01628.x Möhring W, Newcombe NS, Levine SC, Frick A. Spatial proportional reasoning is associated with formal knowledge about fractions. Journal of Cognition and Development. 2016; 17; 1: 67-84. 10.1080/15248372.2014.996289 Möhring W, Frick A, Newcombe NS. Spatial scaling, proportional thinking, and numerical understanding in 5- to 7-year-old children. Cognitive Development. 2018; 45: 57-67. 10.1016/j.cogdev.2017.12.001 Mood C. Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review. 2010; 26; 1: 67-82. 10.1093/esr/jcp006 Mueller ML, Dunlosky J. How beliefs can impact judgments of learning: Evaluating analytic processing theory with beliefs about fluency. JOurnal of Memory and Language. 2017; 93: 245-258. 10.1016/j.jml.2016.10.008 Mueller ML, Dunlosky J, Tauber SK, Rhodes MG. The font-size effect on judgments of learning: Does it exemplify fluency effects or reflect people's beliefs about memory?. Journal of Memory and Language. 2014; 70: 1-12. 10.1016/j.jml.2013.09.007 Murayama K, Sakaki M, Yan VX, Smith GM. Type I error inflation in the traditional by-participant analysis to metamemory accuracy: A generalized mixed-effects model perspective. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2014; 40; 5: 1287. 10.1037/a0036914 Nelson, T. O, & Narens, L. (1990). Metamemory: A Theoretical Framework and New Findings. In Psychology of Learning and Motivation (Vol. 26, pp. 125–173). Elsevier. https://doi.org/10.1016/S0079-7421(08)60053-5 Nelson LJ, Fyfe ER. Metacognitive monitoring and help-seeking decisions on mathematical equivalence problems. Metacognition and Learning. 2019; 14; 2: 167-187. 10.1007/s11409-019-09203-w Newcombe NS, Levine SC, Mix KS. Thinking about quantity: The intertwined development of spatial and numerical cognition. Wiley Interdisciplinary Reviews: Cognitive Science. 2015; 6; 6: 491-505 Nussinson R, Koriat A. Correcting experience-based judgments: The perseverance of subjective experience in the face of the correction of judgment. Metacognition and Learning. 2008; 3: 159-174. 10.1007/s11409-008-9024-2 O'Leary AP, Sloutsky VM. Carving Metacognition at Its Joints: Protracted Development of Component Processes. Child Development. 2017; 88; 3: 1015-1032. 10.1111/cdev.12644 O'Leary AP, Sloutsky VM. Components of metacognition can function independently across development. Developmental Psychology. 2018; 55; 2: 315-328. 10.1037/dev0000645 Opfer JE, DeVries JM. Representational change and magnitude estimation: Why young children can make more accurate salary comparisons than adults. Cognition. 2008; 108; 3: 843-849. 10.1016/j.cognition.2008.05.003 Opfer JE, Siegler RS, Young CJ. The powers of noise-fitting: Reply to Barth and Paladino. Developmental Science. 2011; 14; 5: 1194-1204. 10.1111/j.1467-7687.2011.01070.x Opfer JE, Thompson CA, Kim D. Free versus anchored numerical estimation: A unified approach. Cognition. 2016; 149: 11-17. 10.1016/j.cognition.2015.11.015 Peeters D, Verschaffel L, Luwel K. Benchmark-based strategies in whole number line estimation. British Journal of Psychology. 2017; 108; 4: 668-686. 10.1111/bjop.12233 Peters E. Beyond comprehension: The role of numeracy in judgments and decisions. Current Directions in Psychological Science. 2012; 21; 1: 31-35. 10.1177/0963721411429960 R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Reder LM, Ritter FE. What Determines Initial Feeling of Knowing? Familiarity With Question Terms, Not With the Answer. Journal of Experimental Psychology: Learning, Memory, and Cognition. 1992; 18; 3: 435-451. 10.1037/0278-7393.18.3.435 Resnick I, Davatzes A, Newcombe NS, Shipley TF. Using analogy to learn about phenomena at scales outside human perception. Cognitive Research: Principles and Implications. 2017; 2; 1: 21. 10.1186/s41235-017-0054-7 Rivers ML, Fitzsimmons CJ, Fisk SR, Dunlosky J, Thompson CA. Gender differences in confidence during number-line estimation. Metacognition and Learning. 2021; 16; 1: 157-178. 10.1007/s11409-020-09243-7 Roebers CM, Spiess M. The development of metacognitive monitoring and control in second graders: A short-term longitudinal study. Journal of Cognition and Development. 2017; 18; 1: 110-128. 10.1080/15248372.2016.1157079 Roebers CM, Mayer B, Steiner M, Bayard NS, van Loon MH. The role of children's metacognitive experiences for cue utilization and monitoring accuracy: A longitudinal study. Developmental Psychology. 2019; 55; 10: 2077. 10.1037/dev0000776 Scheibe, D, Fitzsimmons, C. J, Mielicki, M. K, Taber, J. M, Sidney, P. G, Coifman, K, & Thompson, C. A. (in press). Confidence in COVID problem solving: What factors predict adults' item-level metacognitive judgments on health-related math problems before and after an educational intervention? Metacognition and Learning. Sidney PG, Thalluri R, Buerke ML, Thompson CA. Who uses more strategies? Linking mathematics anxiety to adults' strategy variability and performance on fraction magnitude tasks. Thinking & Reasoning. 2019; 25; 1: 94-131. 10.1080/13546783.2018.1475303 Siegler RS, Thompson CA. Numerical landmarks are useful—except when they're not. Journal of Experimental Child Psychology. 2014; 120: 39-58. 10.1016/j.jecp.2013.11.014 Siegler RS, Thompson CA, Schneider M. An integrated theory of whole number and fractions development. Cognitive Psychology. 2011; 62; 4: 273-296. 10.1016/j.cogpsych.2011.03.001 Siegler RS, Duncan GJ, Davis-Kean PE, Duckworth K, Claessens A, Engel M. Early predictors of high school mathematics achievement. Psychological Science. 2012; 23; 7: 691-697. 10.1177/0956797612440101 Sullivan JL, Juhasz BJ, Slattery TJ, Barth HC. Adults' number-line estimation strategies: Evidence from eye movements. Psychonomic Bulletin & Review. 2011; 18; 3: 557-563. 10.3758/s13423-011-0081-1 Thompson, C. A, Sidney, P. G, Fitzsimmons, C. J, Mielicki, M, Schiller, L. K, Scheibe, D. A, Opfer, J. E, & Siegler, R. S. (2022). Comments regarding Numerical estimation strategies are correlated with math ability in school-age children. Cognitive Development. Thompson CA, Opfer JE. Costs and benefits of representational change: Effects of context on age and sex differences in symbolic magnitude estimation. Journal of Experimental Child Psychology. 2008; 101; 1: 20-51. 10.1016/j.jecp.2008.02.003 Undorf M, Erdfelder E. Separation of encoding fluency and item difficulty effects on judgements of learning. Quarterly Journal of Experimental Psychology. 2013; 66; 10: 2060-2072. 10.1080/17470218.2013.777751 Vuorre M, Metcalfe J. Measures of relative metacognitive accuracy are confounded with task performance in tasks that permit guessing. Metacognition and Learning. 2021; 17: 269-291. 10.1007/s11409-020-09257-1 Wall JL, Thompson CA, Dunlosky J, Merriman WE. Children can accurately monitor and control their number-line estimation performance. Developmental Psychology. 2016; 52; 10: 1493. 10.1037/dev0000180 Weaver CA, Bryant DS. Monitoring of comprehension: The role of text difficulty in metamemory for narrative and expositorytext. Memory and Cognition. 1995; 23: 12-22. 10.3758/BF03210553 Yang C, Zhao W, Yuan B, Luo L, Shanks DR. Mind the Gap Between Comprehension and Metacomprehension: Meta-Analysis of Metacomprehension Accuracy and Intervention Effectiveness. Review of Educational Research. 2023; 93; 2: 143-194. 10.3102/00346543221094083 Footnotes Note that screen resolution and CSS pixels often differ because screen resolution, which reflects screen size multiplied by pixel density, has increased in recent years. CSS pixel width and height is a more accurate reflection of screen size than resolution, or total number of pixels. If participants exceeded the time constraint, the window flashed and prompted them to: "Please respond within the time limit" but still permitted them to answer the item. Thus, there were no missing data on the number-line estimation task. Recall that continuous predictors were rescaled by dividing the within-person mean centered variables by 100. We rescaled here for interpretation. We rescaled confidence here to be on the same 100-point scale as PAE. Recall that in the primary analyses, we rescaled within-person mean centered confidence by dividing by 100. We predicted confidence judgments from condition and help seeking (yes or no) in a mixed effects model. Confidence was lower in the diagnostic than non-diagnostic cue conditions, b = -10.53, p < .001, and when participants asked for help compared to when they did not, b = -12.09, p < .001. An interaction (b = -15.41, p < .001) revealed that confidence was similar for small-component items between conditions but lower when participants asked for help in the diagnostic (M = 42.0, SE = 3.72) than non-diagnostic cue (M = 60.20, SE = 3.79) conditions. And, in the non-diagnostic cue condition, confidence was similar when they asked for or did not (M = 64.6, SE = 3.59) ask for help p = .096. Note that we multiplied gamma values by -1 because our measure of estimation accuracy was a degree of error. Thus, gamma values closer to 1 reflect more accurate monitoring with this coding.

By Charles J. Fitzsimmons and Clarissa A. Thompson

Reported by Author; Author

Titel:
Why Is Monitoring Accuracy so Poor in Number Line Estimation? The Importance of Valid Cues and Systematic Variability For U.S. College Students
Autor/in / Beteiligte Person: Fitzsimmons, Charles J. ; Thompson, Clarissa A.
Link:
Zeitschrift: Metacognition and Learning, Jg. 19 (2024), Heft 1, S. 21-52
Veröffentlichung: 2024
Medientyp: academicJournal
ISSN: 1556-1623 (print) ; 1556-1631 (electronic)
DOI: 10.1007/s11409-023-09345-y
Schlagwort:
  • Descriptors: Metacognition Progress Monitoring Cues Spatial Ability Mental Computation Self Control Help Seeking Accuracy Numbers
Sonstiges:
  • Nachgewiesen in: ERIC
  • Sprachen: English
  • Language: English
  • Peer Reviewed: Y
  • Page Count: 32
  • Sponsoring Agency: Institute of Education Sciences (ED)
  • Contract Number: R305A160295 ; R305U200004
  • Document Type: Journal Articles ; Reports - Evaluative
  • Abstractor: As Provided
  • IES Funded: Yes
  • Entry Date: 2024

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

oder
oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

oder
oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.

xs 0 - 576
sm 576 - 768
md 768 - 992
lg 992 - 1200
xl 1200 - 1366
xxl 1366 -