Data Descriptives

Demographics
Individual Difference Metrics

Cognitive Style Index (CSI)
AMAS

Accuracy Analysis

Main Effects
Interaction Effects
Simple Effects on Approach

Efficiency Analysis

Confidence Analysis

Response Confidence
Method Confidence

Effect of Mathematical Tests

Version 1 – Correctness Tests
Version 1 – Efficiency Tests
Version 2 – Correctness Tests

Factors affecting comprehension of contribution links in goal models: an experiment. (ER 2019)

Sotirios Liaskos, Wisal Tambosi

09 April, 2019

Abstract

This page accompanies the ER 2019 paper submission. The instruments used in this study can be found in the following links: Version 1 Students pretest, symbolic and numeric, Version 1 Mturk symbolic and numeric (pretest is embedded) and Version 2 Mturk pretest (CSI Index only) and symbolic and numeric. Readers/reviewers are free to use the instruments but please kindly specify “REVIEWER” when student or Mturk identification is requested. Data and PsyToolkit scripts to be available one year after presentation of the paper – earlier upon request.

Data Descriptives

Demographics

A total of $102$ participants are included in the analysis: $27$ students ( $21$ males and $6$ females) and $75$ MT participants ( $41$ males and $34$ females).

A few collected responses are excluded on the basis of not attending the training videos properly, as established by the limited time spend in the appropriate screens.

##         MTurk Students
##                       
## Female     34        6
## Male       41       21

##         High-school Post-Secondary Masters PhD Other
##                                                     
## Female            6             28       3   3     0
## Male             15             37       6   1     3

##         STEM Business/Econ Health Social Humanities Arts Education Other
##                                                                         
## Female    16             7      1      7          4    3         2     0
## Male      33            15      2      3          4    3         1     1

The distribution of subjects across the two conditions is as follows.

##                 Group Symbolic Numeric
## Sample   Sex                          
## MTurk    Female             14      20
##          Male               23      18
## Students Female              2       4
##          Male               11      10

Individual Difference Metrics

Cognitive Style Index (CSI)

The overall CSI average was $47.91$ which is above reported averages in the literature (about $44.53$ according to the CSI manual and Hmieleski & Corbett (2006) studying US college students). For the specific samples it is as follows:

## Sample
##    MTurk Students 
## 48.42667 46.48148

Number of cases per high vs. low CIS index (wrt. population average)

## csi_level
##  Low High 
##   41   61

In general the Mechanical Turk sample appeared to be more “Analytical” (higher CSI index) than the Student sample. The averages are both higher than the normative average mentioned above, indicated here a dashed line.

##                   csi_level Low High
## Sample   Group                      
## MTurk    Symbolic            13   24
##          Numeric             17   21
## Students Symbolic             4    9
##          Numeric              7    7

AMAS

The overall AMAS average is $20.28$ which is just a bit below the reported averages in the literature (about $21.1$ according to D.R. Hopko et al. 2003). The score is lower among graduates than it is among current students. For the specific samples it is as follows:

## Sample
##    MTurk Students 
## 19.88000 21.40741

Again these are the numbers of cases above and below the population average.

## amas_level
## High  Low 
##   44   58

The dashed line represents the population average:

Accuracy Analysis

We now turn into the exploration of the various factors that affect accuracy. Accuracy is measured in a scale $[0,12]$ based on how many correct optimal solutions participants have been able to identify out of the twelve ( $12$ ) models/problems they were exposed to. Correctness is established when there is agreement with the optimal solution the authoritative semantics prescribe.

Main Effects

We first attempt a first glimpse of the main effects of (a) AMAS level, (b) CSI Score and (c) approach taken, to the overall correctness.

Looking at the first column, it becomes visible that:

The numeric group makes more accurate decisions that the symbolic group – comparing the blue versus red lines.
AMAS level appears to correlate negatively with accuracy for both groups.
CSI score appears to have a very mild negative correlation with accuracy.
There is an interaction between Approach and Representation (symbolic vs. numeric) with respect to both accuracy but also AMAS and CSI score.

A close-up can be seen below.

We can attempt to fit a first linear model as follows, noting that “Group” refers to the type of model representation.

full <- lm(Correct ~ Group + amas_score + csi_score + Approach,dd)
Anova(full,type=3) # Type 3 because cells are unbalanced

## Anova Table (Type III tests)
## 
## Response: Correct
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 209.11  1 44.0934 1.812e-09 ***
## Group       342.31  1 72.1802 2.345e-13 ***
## amas_score   26.81  1  5.6539   0.01938 *  
## csi_score    15.49  1  3.2662   0.07382 .  
## Approach     26.38  1  5.5625   0.02035 *  
## Residuals   460.02 97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We observe that the group differences (symbolic vs. numeric) is indeed highly unusual $p<0.01$ if we assume that the groups are equally accurate. Furthermore the effect of AMAS score is also very unusual $p<0.01$ under the null hypothesis that no such effect exists. The effect of CSI score is relatively more likely under the null hypothesis $p<0.1$ but still raises some suspicion. Finally the approach the participants follow seems to have an effect, working methodically leading generally to more accuracy $p<0.05$ . However, approach seems to interact with other variables (and is also heteroscedastic..) so we return to it below.

Effect sizes are calculated and displayed below with respect to a discretization of the AMAS and CSI scores into Low and High based on their relationship to the population averages described above:

gd1<-effsize::cohen.d(data = dd,Correct ~ Group)
gd2<-effsize::cohen.d(data = dd,Correct ~ amas_level)
gd3<-effsize::cohen.d(data = dd,Correct ~ csi_level)
# For approach we need robust test due to violation of assumptions:
gd4<-akp.effect(Correct~Approach,data = dd,
           EQVAR = FALSE)

Representation (numeric vs. symbolic) explains a mean difference of $3.53$ correct decisions (out of $12$ maximum, meaning that those who used the symbolic models performed $3.53$ more mistakes that those who used the numeric $(d=-1.51)$ – which is a large effect.
AMAS Level explains a mean difference of $0.96$ correct decisions $(d=-0.33)$ which is a small effect.
CSI Score explains a mean difference of $0.93$ correct decisions $(d=0.32)$ which is a small effect.
Approach explains a mean difference of $1.32$ correct answers – robust $d$ between $-0.4$ and $-0.86$ .

The following are some diagnostics of the main effects model created, which do not seem to raise red flags except for a possible deviation of normality. The ANOVA models we develop here are generally considered to be fairly robust to slight deviations from normality especially in large sample sizes like ours ( $102$ ).

leveneTest(dd$Correct,dd$Group,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  1.4319 0.2343
##       100

leveneTest(dd$Correct,dd$csi_level,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1   0.024 0.8773
##       100

leveneTest(dd$Correct,dd$amas_level,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0635 0.8016
##       100

leveneTest(dd$Correct,dd$Approach,center=median) # Problem, hence the robust tests...

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   1  8.7893 0.003788 **
##       100                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To eliminate any suspicion of issues due to lack of normality, the following are some robust tests for representation and AMAS Level, showing that show that both effects can be safely stated (even considering Type I error correction to $p<0.001$ ). The CSI level effect is not significant in these tests.

## Call:
## pbad2way(formula = Correct ~ Group + amas_level, data = dd, est = "median", 
##     nboot = 5000)
## 
##                  p.value
## Group             0.0000
## amas_level        0.0248
## Group:amas_level  0.8600

## Call:
## t2way(formula = Correct ~ Group + amas_level, data = dd)
## 
##                    value p.value
## Group            57.2130   0.001
## amas_level        7.2435   0.010
## Group:amas_level  0.0861   0.771

## Call:
## pbad2way(formula = Correct ~ Group + csi_level, data = dd, est = "median", 
##     nboot = 5000)
## 
##                 p.value
## Group            0.0000
## csi_level        0.5998
## Group:csi_level  0.3510

## Call:
## t2way(formula = Correct ~ Group + csi_level, data = dd)
## 
##                   value p.value
## Group           52.3929   0.001
## csi_level        0.7031   0.406
## Group:csi_level  0.0197   0.889

A conservative conclusion is to therefore not consider the effect of CSI.

Prior to moving on we also explore what would happen to the main effects model if model size (small versus large) were included as a factor. Small models have two alternatives and large models three. For that we resort to repeated-measures analysis via MANOVA as seen below.

m <- lm(cbind(Correct_Large, Correct_Small) ~ Group + amas_score + csi_score + Approach, data = dd)
Complexity <- ordered(c("Correct_Large", "Correct_Small"))
idata<-data.frame(Complexity)
modAn<-Manova(m,idata = idata, idesign  = ~ Complexity,type = "III")
print(modAn)

## 
## Type III Repeated Measures MANOVA Tests: Pillai test statistic
##                       Df test stat approx F num Df den Df    Pr(>F)    
## (Intercept)            1   0.31251   44.093      1     97 1.812e-09 ***
## Group                  1   0.42665   72.180      1     97 2.345e-13 ***
## amas_score             1   0.05508    5.654      1     97   0.01938 *  
## csi_score              1   0.03258    3.266      1     97   0.07382 .  
## Approach               1   0.05424    5.563      1     97   0.02035 *  
## Complexity             1   0.00051    0.050      1     97   0.82420    
## Group:Complexity       1   0.01306    1.284      1     97   0.26003    
## amas_score:Complexity  1   0.00295    0.287      1     97   0.59368    
## csi_score:Complexity   1   0.00754    0.737      1     97   0.39266    
## Approach:Complexity    1   0.02395    2.380      1     97   0.12616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We do not conclude any role of size or even interaction with the other terms; the effects discussed earlier re-appear as expected.

Interaction Effects

As noticed above the Approach taken by the participants seems to be interacting with group. The following graphs explore various interactions between Group, Approach and CSI Score/Level.

There are a few things to hypothesize:

When participants work Methodically (by their declaration) that seems to significantly improve their accuracy but only in numeric models.
Intuitive types have lots to gain when working methodically instead of intuitively – analytical types less so.
The CSI level does not seem to be a predictor of how the participants will work (Methodically vs. Intuitively)

At the same time our hypothesis that AMAS may have an interaction vis-a-vis representation type, low AMAS types benefiting from numeric representation, does not seem to emerge. Likewise CSI does not seem to interact with representation type.

The interactions above between CSI Score and Approach and Group and Approach are visible as is the absence of interaction CSI and AMAS score versus representation. We may try to explore whether these are statistically notable points.

## Anova Table (Type III tests)
## 
## Response: Correct
##                     Sum Sq Df F value   Pr(>F)   
## (Intercept)          30.87  1  7.1783 0.008758 **
## Sample                0.13  1  0.0309 0.860893   
## Group                 6.45  1  1.4999 0.223851   
## csi_score             4.27  1  0.9928 0.321700   
## amas_score           12.84  1  2.9869 0.087330 . 
## Approach              2.54  1  0.5912 0.443930   
## Group:csi_score       5.54  1  1.2875 0.259492   
## Group:amas_score      1.05  1  0.2449 0.621879   
## Group:Approach       28.67  1  6.6683 0.011410 * 
## csi_score:Approach   14.19  1  3.3001 0.072569 . 
## amas_score:Approach   3.48  1  0.8088 0.370850   
## Residuals           391.31 91                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction between representation and Approach seems to be significant. A test for the homogeneity of variance can be seen below.

leveneTest(dd$Correct,interaction(dd$Approach,dd$Group),center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  1.0118 0.3909
##       98

If we are still unconfident about normality and the heteroscedacity with CSI level robust tests are as follows.

## Call:
## pbad2way(formula = Correct ~ Group + Approach, data = dd, est = "median", 
##     nboot = 5000)
## 
##                p.value
## Group           0.0000
## Approach        0.0004
## Group:Approach  0.0480

The effect size is as follows:

s<-aggregate(dd$Correct, by = list(dd$Group,dd$Approach),FUN = mean)
s2<-aggregate(dd$Correct, by = list(dd$Group,dd$Approach),FUN = sd)
s3<-aggregate(dd$Correct, by = list(dd$Group,dd$Approach),FUN = length)
spooled1 = sqrt(((s3[2,3]-1)*s2[2,3]^2 + (s3[1,3]-1)*s2[1,3]^2)/(s3[2,3] + s3[1,3]-2))
spooled2 = sqrt(((s3[4,3]-1)*s2[4,3]^2 + (s3[3,3]-1)*s2[3,3]^2)/(s3[4,3] + s3[3,3]-2))
sp1N = s3[2,3] + s3[1,3]
sp2N = s3[4,3] + s3[3,3]
spooled = sqrt(((sp1N-1)*spooled1^2 + (sp2N-1)*spooled2^2)/(sp1N=sp2N-2))
meanDiff = ((s[4,3]-s[3,3]) - (s[2,3]-s[1,3]))
cohen.d_Interaction <- (meanDiff)/spooled
cohen.d.ci(cohen.d_Interaction,n1=sp1N,n2=sp2N)

##          lower   effect    upper
## [1,] 0.9928222 1.378529 1.758009

meanDiff

## [1] 3.377564

Thus, working methodically buys one 3.38 more correct answers in numeric models than in symbolic ones – a very large effect indeed, and more confidently assumption-satisfying.

Simple Effects on Approach

The flowing tests can help us figure if the representation effect is actually observed in both approaches.

full <- lm(Correct ~ Group ,dd[dd$Approach == "Intuitively",])
Anova(full,type="III")

## Anova Table (Type III tests)
## 
## Response: Correct
##             Sum Sq Df  F value    Pr(>F)    
## (Intercept) 409.60  1 108.2276 9.653e-10 ***
## Group         5.48  1   1.4479    0.2423    
## Residuals    79.48 21                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

full <- lm(Correct ~ Group ,dd[dd$Approach == "Methodically",])
Anova(full,type="III")

## Anova Table (Type III tests)
## 
## Response: Correct
##              Sum Sq Df F value    Pr(>F)    
## (Intercept) 1500.62  1 308.854 < 2.2e-16 ***
## Group        375.75  1  77.337 2.991e-13 ***
## Residuals    374.12 77                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Efficiency Analysis

Efficiency is operationalized as the number of correct answers per unit of time (per minute). As reliable response time data require supervision during administration we need to focus on our student sample, where the experiment happened in a lab.

We attempt a first exploration of CSI level, AMAS Level and Representation.

Approach (Intuitively vs. Methodically) is ignored due to its being highly unbalanced in this sample.

## 
##  Intuitively Methodically 
##            3           24

We can attempt a close-up at the seeming interaction between AMAS level and Efficiency.

No such effect seems to exist. In general the best fit model is one that only accounts for Representation. Below is a step-wise model selection method by AIC criterion.

## Start:  AIC=80.99
## Efficiency ~ Group + csi_score + amas_score + Group:amas_score + 
##     Group:csi_score
## 
##                    Df Sum of Sq    RSS    AIC
## - Group:amas_score  1    0.0111 347.58 78.990
## - Group:csi_score   1   23.8252 371.40 80.779
## <none>                          347.57 80.989
## 
## Step:  AIC=78.99
## Efficiency ~ Group + csi_score + amas_score + Group:csi_score
## 
##                   Df Sum of Sq    RSS    AIC
## - amas_score       1    7.1734 354.76 77.541
## - Group:csi_score  1   25.8382 373.42 78.926
## <none>                         347.58 78.990
## 
## Step:  AIC=77.54
## Efficiency ~ Group + csi_score + Group:csi_score
## 
##                   Df Sum of Sq    RSS    AIC
## - Group:csi_score  1    23.992 378.75 77.308
## <none>                         354.76 77.541
## 
## Step:  AIC=77.31
## Efficiency ~ Group + csi_score
## 
##             Df Sum of Sq    RSS    AIC
## - csi_score  1    16.464 395.21 76.457
## <none>                   378.75 77.308
## - Group      1   131.516 510.27 83.356
## 
## Step:  AIC=76.46
## Efficiency ~ Group
## 
##         Df Sum of Sq    RSS    AIC
## <none>               395.21 76.457
## - Group  1    119.09 514.30 81.568

## Anova Table (Type III tests)
## 
## Response: Efficiency
##             Sum Sq Df F value  Pr(>F)  
## (Intercept)  21.53  1  1.3617 0.25425  
## Group       119.09  1  7.5332 0.01105 *
## Residuals   395.21 25                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However:

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  5.4063 0.02848 *
##       25                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given the small sample size and the above heterogeneity results it is preferable that we consider a simple non-parametric independent samples test such as Yuen’s t as seen below, followed by exploration of effect sizes:

(y <- yuen(Efficiency~Group,data = stus))

## Call:
## yuen(formula = Efficiency ~ Group, data = stus)
## 
## Test statistic: 3.7775 (df = 9.41), p-value = 0.00403
## 
## Trimmed mean difference:  -3.07083 
## 95 percent confidence interval:
## -4.8976     -1.244

akp.effect(Efficiency~Group,data = stus,
           EQVAR = (leveneTest(stus$Efficiency,stus$Group)$`Pr(>F)`[1] > 0.05))

## [1] -6.6172781 -0.9320995

aggregate(Efficiency~Group,data = stus,FUN=mean)

##      Group Efficiency
## 1 Symbolic   1.286816
## 2  Numeric   5.490043

We then conclude that there is a statistically significant effect of representation format to efficiency. The effect size (assuming normality) is substantial ( $3.07$ correct answers per minute) which is large.

Confidence Analysis

Response Confidence

Confidence questions are, again, introduced for a subset of the data, namely $45$ data points from the Mechanical Turk population. As usual we begin with a scatter-plot matrix to explore possible interactions.

We may wish to explore effects by trying different models (using AIC step-wise model selection method):

## Start:  AIC=-27.65
## Response_Confidence ~ Group + csi_score + amas_score + Group:amas_score + 
##     Group:csi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - Group:csi_score   1   0.15266 18.799 -29.279
## <none>                          18.646 -27.645
## - Group:amas_score  1   1.91124 20.558 -25.254
## 
## Step:  AIC=-29.28
## Response_Confidence ~ Group + csi_score + amas_score + Group:amas_score
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          18.799 -29.279
## - Group:amas_score  1    1.9876 20.787 -26.756
## - csi_score         1    2.2469 21.046 -26.198

## Anova Table (Type III tests)
## 
## Response: Response_Confidence
##                   Sum Sq Df F value    Pr(>F)    
## (Intercept)      21.3851  1 45.5024 4.247e-08 ***
## Group             1.3756  1  2.9269   0.09486 .  
## csi_score         2.2469  1  4.7809   0.03469 *  
## amas_score        1.2842  1  2.7324   0.10616    
## Group:amas_score  1.9876  1  4.2291   0.04630 *  
## Residuals        18.7991 40                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To visually explore these effects:

Highly analytical respondents have lower response confidence (as expected by Hammond et al.) and, at the same time, mathematic anxiety seems to correlate with reduced confidence in quantitative models, but not in qualitative models.

Effect sizes assuming normality:

(egE<- effsize::cohen.d(data = d3,Response_Confidence ~ Group))

## 
## Cohen's d
## 
## d estimate: 0.2357723 (small)
## 95 percent confidence interval:
##      lower      upper 
## -0.3677225  0.8392671

(cE <- effsize::cohen.d(data = d3,Response_Confidence ~ csi_level))

## 
## Cohen's d
## 
## d estimate: -0.4175216 (small)
## 95 percent confidence interval:
##      lower      upper 
## -1.0290050  0.1939617

# Effect size of the interaction between Group and AMAS Level
s<-aggregate(d3$Correct, by = list(d3$Group,d3$amas_level),FUN = mean)
s2<-aggregate(d3$Correct, by = list(d3$Group,d3$amas_level),FUN = sd)
s3<-aggregate(d3$Correct, by = list(d3$Group,d3$amas_level),FUN = length)
spooled1 = sqrt(((s3[2,3]-1)*s2[2,3]^2 + (s3[1,3]-1)*s2[1,3]^2)/(s3[2,3] + s3[1,3]-2))
spooled2 = sqrt(((s3[4,3]-1)*s2[4,3]^2 + (s3[3,3]-1)*s2[3,3]^2)/(s3[4,3] + s3[3,3]-2))
sp1N = s3[2,3] + s3[1,3]
sp2N = s3[4,3] + s3[3,3]
spooled = sqrt(((sp1N-1)*spooled1^2 + (sp2N-1)*spooled2^2)/(sp1N=sp2N-2))
meanDiff = ((s[2,3]-s[1,3]) - (s[4,3]-s[3,3]))
cohen.d_Interaction <- (meanDiff)/spooled
cohen.d.ci(cohen.d_Interaction,n1=sp1N,n2=sp2N)

##          lower      effect     upper
## [1,] -0.563163 -0.03927379 0.4853665

So the interaction effects is fairly negligible.

We test for homogeneity of variance:

leveneTest(d3$Response_Confidence,d3$amas_level,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.0077 0.3211
##       43

leveneTest(d3$Response_Confidence,d3$csi_level,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.2033 0.6543
##       43

leveneTest(d3$Response_Confidence,d3$Group,center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.3814 0.5401
##       43

leveneTest(d3$Response_Confidence,interaction(d3$Group,d3$amas_level),center=median)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  0.8108 0.4953
##       41

Given the patterns residuals seem to exhibit it is probably advised to also perform robust ANOVA.

## Call:
## pbad2way(formula = Response_Confidence ~ Group + amas_level, 
##     data = d3, est = "median", nboot = 5000)
## 
##                  p.value
## Group             0.0062
## amas_level        0.1388
## Group:amas_level  0.0060

## Call:
## yuen(formula = Response_Confidence ~ csi_level, data = d3)
## 
## Test statistic: 0.2849 (df = 23.87), p-value = 0.77817
## 
## Trimmed mean difference:  -0.08333 
## 95 percent confidence interval:
## -0.6872     0.5205

The disappearance of the CSI effect is probably due to the discretization of scores into the binary “Low”/“High” scale necessary for performing ANOVA robust tests.

Method Confidence

We run similar procedures for method confidence but find no effects.

full <- lm(Method_Confidence ~ Group 
           + csi_score
           + amas_score
           + Group:amas_score
           + Group:csi_score
           , d3)
auto <-stepAIC(full)

## Start:  AIC=-24.93
## Method_Confidence ~ Group + csi_score + amas_score + Group:amas_score + 
##     Group:csi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - Group:amas_score  1   0.00894 19.815 -26.910
## - Group:csi_score   1   0.61527 20.421 -25.554
## <none>                          19.806 -24.930
## 
## Step:  AIC=-26.91
## Method_Confidence ~ Group + csi_score + amas_score + Group:csi_score
## 
##                   Df Sum of Sq    RSS     AIC
## - amas_score       1   0.26934 20.084 -28.302
## - Group:csi_score  1   0.60838 20.423 -27.549
## <none>                         19.815 -26.910
## 
## Step:  AIC=-28.3
## Method_Confidence ~ Group + csi_score + Group:csi_score
## 
##                   Df Sum of Sq    RSS     AIC
## - Group:csi_score  1   0.60994 20.694 -28.956
## <none>                         20.084 -28.302
## 
## Step:  AIC=-28.96
## Method_Confidence ~ Group + csi_score
## 
##             Df Sum of Sq    RSS     AIC
## - Group      1   0.04222 20.736 -30.864
## <none>                   20.694 -28.956
## - csi_score  1   1.21680 21.911 -28.385
## 
## Step:  AIC=-30.86
## Method_Confidence ~ csi_score
## 
##             Df Sum of Sq    RSS     AIC
## <none>                   20.736 -30.864
## - csi_score  1    1.2635 22.000 -30.203

Anova(auto,type=3)

## Anova Table (Type III tests)
## 
## Response: Method_Confidence
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 38.811  1 80.4793 2.094e-11 ***
## csi_score    1.264  1  2.6201    0.1128    
## Residuals   20.736 43                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Effect of Mathematical Tests

The analysis of the effect of the mathematical tests is more complex. There are two versions of the mathematical tests. All contain calculations between numbers in the interval [0.00,0.99] with two decimal digits precision. They are as follows:

Version 1 contains the following exercises:
- Four addition exercises, as in, 0.76 + 0.19 = ____ (user inputs result).
- Four multiplication exercises, as in, 0.28 x 0.27 = ____ (user inputs result).
- Four division exercises, as in, 0.46 / 0.62 = ____ (user inputs result).
- Four addition comparison exercises as in ``Which one is larger’’ followed by a choice between 0.92 + 0.16 and 0.25 + 0.89 (~0.05 distance between options at all times).
- Four subtraction comparison exercises as in ``Which one is larger’’ followed by a choice between 0.66 - 0.46 and 0.81 - 0.66 (~0.05 distance between options at all times).
- Four multiplication comparison exercises as in ``Which one is larger’’ followed by a choice between 0.51 x 0.41 and 0.32 x 0.82 (~0.05 distance between options at all times).
- Four division comparison exercises as in ``Which one is larger’’ followed by a choice between 0.54 / 0.73 and 0.69 / 0.88 (~0.05 distance between options at all times).
Version 2 contains the following exercises:
- Four multiplication exercises, as in, 0.28 x 0.27 = ____ (user inputs result).
- Four multiplication comparison exercises as in ``Which one is larger’’ followed by a choice between 0.51 x 0.41 and 0.32 x 0.82 (~0.05 distance between options at all times).
- Two linear combination comparison exercises as in ``Which one is larger’’ followed by a choice between 0.28 x 0.38 + 0.33 x 0.67 and 0.42 x 0.91 + 0.37 x 0.54 (~0.25 distance between options at all times).
- A second version of the above with (a) different numbers (b) 20sec or 15sec deadline.

Correct answers are summed up. Questions requiring direct input are scored in [0..10] based on the exponentially decayed distance of the given value from the correct value, as in the following function.

decayRate = 6
mathScoring = 
  data.frame(Score = round(exp(1)^{-decayRate*c(0,(1:99))/100}*10,1),
             Distance = c(0,(1:99))/100)

getDistanceScore <- function(auth, given, scores){
  if(!is.na(given)){
    abD = abs(auth - given)
    vals = c(0,(1:99))/100
    return(mean(scores[which(abs(vals-abD)==min(abs(vals-abD)))]))
  } else {
    return(0)
  }
}

The following shows how [0..10] scoring relates to distance. 0.75 absolute distance and above gets a zero score.

There are three analyses we can perform.

Version 1 – Correctness Tests

Focusing only on correctness on Version 1, we can include the student sample plus the initial Mechanical Turk sample that was exposed to the first version of the instrument both amounting to $57$ cases (students are $27$ of them).

We may simply attempt to compare the performance in the mathematical tests with the accuracy in the tasks vis-a-vis also representation:

An effect does not appear to be visible for any of the two major correctness dimensions (calculations and comparisons).

## Anova Table (Type III tests)
## 
## Response: Correct
##                  Sum Sq Df F value    Pr(>F)    
## (Intercept)     130.913  1 22.2740 1.767e-05 ***
## Group             7.259  1  1.2351    0.2714    
## arith_abs         2.707  1  0.4606    0.5003    
## Group:arith_abs   0.945  1  0.1608    0.6900    
## Residuals       311.502 53                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Anova Table (Type III tests)
## 
## Response: Correct
##                 Sum Sq Df F value   Pr(>F)   
## (Intercept)     52.659  1  8.9529 0.004199 **
## Group            1.312  1  0.2231 0.638633   
## comp_abs         0.034  1  0.0058 0.939601   
## Group:comp_abs   1.396  1  0.2373 0.628176   
## Residuals      311.732 53                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Version 1 – Efficiency Tests

If we want to incorporate time in our mathematical ability tests, we need to restrict ourselves to the student sample.

There does not seem to be any important effect of notable interaction. Indeed:

## Start:  AIC=54.52
## Correct ~ Group + arith + comp + Group:arith + Group:comp
## 
##               Df Sum of Sq    RSS    AIC
## - Group:comp   1    0.6939 131.10 52.664
## <none>                     130.41 54.521
## - Group:arith  1   10.8718 141.28 54.683
## 
## Step:  AIC=52.66
## Correct ~ Group + arith + comp + Group:arith
## 
##               Df Sum of Sq    RSS    AIC
## - comp         1    0.0137 131.12 50.667
## <none>                     131.10 52.664
## - Group:arith  1   10.2133 141.32 52.690
## 
## Step:  AIC=50.67
## Correct ~ Group + arith + Group:arith
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     131.12 50.667
## - Group:arith  1      10.2 141.32 50.690

## Anova Table (Type III tests)
## 
## Response: Correct
##              Sum Sq Df F value    Pr(>F)    
## (Intercept) 100.781  1 17.6784 0.0003382 ***
## Group         1.293  1  0.2268 0.6384010    
## arith         0.464  1  0.0814 0.7779667    
## Group:arith  10.200  1  1.7892 0.1941060    
## Residuals   131.119 23                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Focusing strictly on response time does not appear to yield anything except from a possible interaction of the calculation exercises.

It turns out to not be statistically significant.

## Start:  AIC=54.62
## Correct ~ Group + arith_time + Group:comp_time
## 
##                   Df Sum of Sq    RSS    AIC
## - Group:comp_time  2   1.40251 142.38 50.892
## - arith_time       1   0.44049 141.42 52.709
## <none>                         140.98 54.625
## 
## Step:  AIC=50.89
## Correct ~ Group + arith_time
## 
##              Df Sum of Sq    RSS    AIC
## - arith_time  1     0.856 143.24 49.054
## <none>                    142.38 50.892
## - Group       1   117.107 259.49 65.097
## 
## Step:  AIC=49.05
## Correct ~ Group
## 
##         Df Sum of Sq    RSS    AIC
## <none>               143.24 49.054
## - Group  1    124.76 268.00 63.969

## Anova Table (Type III tests)
## 
## Response: Correct
##                  Sum Sq Df F value   Pr(>F)   
## (Intercept)      52.024  1  8.1185 0.009328 **
## Group            22.251  1  3.4723 0.075807 . 
## arith_time        0.440  1  0.0687 0.795621   
## Group:comp_time   1.403  2  0.1094 0.896827   
## Residuals       140.978 22                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Version 2 – Correctness Tests

We can suspect that correctness of the untimed portion may have an effect without interaction and the timed correctness measure may present an interaction. Trying the statistical tests:

## Anova Table (Type III tests)
## 
## Response: Correct
##                  Sum Sq Df F value  Pr(>F)  
## (Intercept)      33.791  1  6.6636 0.01351 *
## Group            13.866  1  2.7344 0.10584  
## arith_abs        12.939  1  2.5517 0.11786  
## Group:arith_abs   0.222  1  0.0439 0.83515  
## Residuals       207.907 41                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Anova Table (Type III tests)
## 
## Response: Correct
##             Sum Sq Df F value   Pr(>F)   
## (Intercept)  79.43  1  9.6957 0.003281 **
## arith_abs    45.49  1  5.5520 0.023086 * 
## Residuals   352.29 43                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                  2.5 %    97.5 %
## (Intercept) 1.67777879 7.8459954
## arith_abs   0.05904135 0.7602961

## 
##  Shapiro-Wilk normality test
## 
## data:  full$residuals
## W = 0.97551, p-value = 0.4505

Only one main effect can be observed; statistical significance must be appreciated in the context of a possible experiment-wise error. The co-efficient $0.41$ gives an idea of the effect size: a participant that scores $2.4/12$ more points, may get one more correct answer in the decision exercises.

The important observation to be made here is the same to the one made for AMAS: neither anxiety nor ability seem to correlate with the choice of representation.

## Anova Table (Type III tests)
## 
## Response: Correct
##                    Sum Sq Df F value  Pr(>F)  
## (Intercept)        35.334  1  6.6177 0.01382 *
## Group              24.125  1  4.5185 0.03960 *
## arith_timed         7.556  1  1.4152 0.24104  
## Group:arith_timed   2.906  1  0.5443 0.46487  
## Residuals         218.911 41                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1