Skip to main content

Study Results

Iris has been evaluated in three peer-reviewed studies, progressing from an initial survey-based assessment through a small mixed-methods study to a large randomized controlled trial. This page presents the findings from each study with exact numbers from the published papers.

ITiCSE 2024: Initial Survey Evaluation (N=121)

Bassner, Frankford & Krusche (2024). Survey-based evaluation conducted across three CS1-level courses at TUM (Management & Technology, Informatics, Information Engineering). Of 1,655 enrolled students, 221 engaged with Iris (10+ messages), and 121 completed the survey (55% response rate).

Perceived Effectiveness (RQ1)

Survey ItemAgreement
Iris comprehends my inquiries well (Q1)46%
Iris directly helps with exercises (Q2)44%
Iris enhanced understanding of programming concepts (Q3)50%
Interactions with Iris are engaging (Q4)60%

Comfort Compared to Human Tutors (RQ2)

Survey ItemAgreement
Comfortable asking questions without judgment (Q6)92%
Feel safe asking sensitive questions (Q7)62%

Subjective Reliance (RQ3)

43% of students reported they would find it challenging to solve exercises without Iris (Q10). Students predominantly viewed Iris as a complement to, not a replacement for, human tutors.

Limitations

  • Self-report data only; no objective learning measures
  • Selection bias: only students who used Iris substantially participated
  • Single institution (TUM), CS1 courses only

Koli Calling '25: Mixed-Methods Study (N=33)

Bassner, Lottner & Krusche (2025). Exploratory randomized between-subjects study where 33 students implemented the Burrows-Wheeler Transform in Java under three conditions: Iris, ChatGPT, or No AI. Combined quantitative performance measures with systematic qualitative analysis of post-task interviews.

Quantitative Results

No statistically significant differences were found between conditions in:

  • Learning gains (pre-test to post-test)
  • Task completion time
  • Code accuracy
note

The small sample size (N=33) limits statistical power. The study was designed as exploratory, with the qualitative component as its primary contribution.

Qualitative Themes

Five themes emerged from the interview analysis:

  1. Time pressure dominated tool selection. Students prioritized efficiency over learning when under time constraints, regardless of condition. This suggests that the study setting itself may influence how students interact with AI tools.

  2. Context-aware guidance was universally appreciated. Students in the Iris condition valued that the tutor already understood their exercise and code, eliminating the need to provide context manually.

  3. Polarized scaffolding preferences. Some students wanted more explicit hints from Iris, while others appreciated the restraint. This suggests that a one-size-fits-all scaffolding level may not serve all learners equally.

  4. ChatGPT users sought external verification more. Students using ChatGPT more frequently sought confirmation of AI-generated answers from other sources, suggesting lower trust in the correctness of responses compared to Iris users.

  5. Over-reliance concerns were prevalent. Students across AI conditions expressed worry about becoming dependent on AI assistance, with ChatGPT users expressing stronger concerns about this than Iris users.

Limitations

  • Small sample size (N=33) limits generalizability
  • Single programming task (Burrows-Wheeler Transform)
  • Lab setting with time pressure may not reflect naturalistic use

C&E:AI 2026: Randomized Controlled Trial (N=275)

Bassner, Lenk-Ostendorf, Beinstingel, Wasner & Krusche (2026). Three-arm RCT conducted in a CS1 course at TUM. Students completed a 90-minute concurrent programming exercise (parallel sum with threading). After quality filters, 275 participants remained: Iris (n=91), ChatGPT (n=88), No AI (n=96).

This is the largest and most rigorous evaluation of Iris to date.

Finding 1: AI as Performance Enhancer, Not Learning Enhancer

Both AI tools significantly boosted exercise performance compared to the No AI condition, but neither improved learning outcomes.

Exercise Performance:

ConditionMean (%)SD
ChatGPT71.8439.65
Iris57.5037.36
No AI29.8536.17

ANOVA: p < .001, eta^2 = .179

ComparisonCohen's dp
ChatGPT vs No AI1.10< .001
Iris vs No AI0.76< .001
ChatGPT vs Iris0.38.031

Knowledge Assessment (Learning): No significant group differences (p = .311). All three groups improved significantly from pre-test to post-test (p < .001). AI tools did not differentially affect knowledge acquisition.

Code Comprehension: No significant differences between conditions (p = .136).

Key Implication

Higher exercise scores with AI assistance do not necessarily indicate deeper learning. The dissociation between performance and learning is a central finding of this study.

Finding 2: Iris Balances Scaffolding and Cognitive Challenge

The distribution of exercise scores differed qualitatively between conditions:

  • ChatGPT scores clustered at the high end of the scale
  • No AI scores clustered at the low end
  • Iris scores spread across the full range

This pattern suggests that Iris preserved individual performance variation --- students who understood the material performed well, while those who struggled still had to engage with the problem. ChatGPT, by contrast, compressed performance toward the top of the scale.

Finding 3: Iris Uniquely Improves Intrinsic Motivation

Frustration (lower is better):

ConditionMeanSD
ChatGPT3.131.21
Iris3.211.18
No AI4.091.00
ComparisonCohen's dp
ChatGPT vs No AI−0.87< .001
Iris vs No AI−0.81< .001
ChatGPT vs Iris−0.07.886

Both AI tools significantly reduced frustration compared to No AI. There was no meaningful difference in frustration between ChatGPT and Iris.

Intrinsic Motivation (higher is better):

ConditionMeanSD
Iris2.820.70
ChatGPT2.640.65
No AI2.420.74
ComparisonCohen's dp
Iris vs No AI0.55< .001
ChatGPT vs No AI0.32.076
ChatGPT vs Iris−0.25.234

Only Iris significantly increased intrinsic motivation compared to the No AI condition. ChatGPT did not reach significance (p = .076). This is notable because both tools reduced frustration equally, but only Iris additionally enhanced engagement.

Finding 4: ChatGPT as a "Comfort Trap"

Students rated ChatGPT more favorably on several perception measures:

Perception ItemChatGPTIrisCohen's dp
Easy to use4.363.970.45.003
Feedback helpful3.983.540.44.004
Helped resolve exercise issues4.083.490.63< .001
General helpfulnessn.s.n.s.---.351

Despite these more favorable perceptions, ChatGPT users did not achieve better learning outcomes than Iris users. The authors characterize this pattern as a "comfort trap" --- students preferred the tool that felt easier and more helpful, but these subjective preferences aligned with greater reductions in learning-related cognitive processing rather than with actual learning gains.

Limitations

  • Single institution, single course, single programming task (concurrent programming)
  • 90-minute lab setting may not reflect semester-long usage patterns
  • Attrition from 452 to 275 participants after quality filters
  • No long-term follow-up on retention or transfer

Summary Across Studies

DimensionITiCSE '24Koli '25C&E:AI '26
DesignSurveyMixed-methods RCTThree-arm RCT
N12133275
PerformanceSelf-reported benefitNo significant differencesIris > No AI (d = 0.76)
LearningNot measuredNo significant differencesNo significant differences
Motivation60% found interactions engagingNot measuredIris > No AI (d = 0.55)
FrustrationNot measuredNot measuredIris < No AI (d = −0.81)
Comfort92% comfortable (vs human tutors)Context awareness valuedChatGPT rated easier to use

For full citations and BibTeX entries, see Publications.