2026/5/27
Hamid Varmazyari

Hamid Varmazyari

Academic rank: Assistant Professor
ORCID: https://orcid.org/0000-0002-4694-8599
Education: PhD.
H-Index:
Faculty: Literature and Languages
ScholarId:
E-mail: h-varmazyari [at] araku.ac.ir
ScopusId: View
Phone:
ResearchGate:

Research

Title
A Comparative Analysis of AI and Human Raters in Writing Assessment of IELTS Task 2: A Case Study of ChatGPT, Deepseek, and Perplexity
Type
Thesis
Keywords
Artificial Intelligence, Automated Essay Evaluation (AEE), ChatGPT, DeepSeek, IELTS Writing Task 2, Large Language Models (LLM), Perplexity
Year
2026
Researchers Hamid Varmazyari(PrimaryAdvisor)، Roya Kavehie(Student)

Abstract

Despite growing interest in using large language models (LLMs) for automated essay scoring, limited research has compared multiple LLMs, particularly DeepSeek and Perplexity, against human raters with differing certification levels in the context of Iranian EFL learners. This study examined the comparability of artificial intelligence (AI) and human raters in assessing IELTS Writing Task 2 essays, focusing on three LLMs (ChatGPT, DeepSeek, Perplexity) and three human raters with differing expertise: an official IELTS examiner, a CELTA‑certified instructor, and an experienced non‑certified teacher. A total of 199 essays written by Iranian EFL learners during a mock IELTS examination were rated according to official IELTS band descriptors, and the results were analyzed using non‑parametric statistical tests (Kruskal‑Wallis and post‑hoc pairwise comparisons). Findings showed that ChatGPT closely aligned with human raters in overall and Task Response scores (p ≥ 0.484), indicating potential reliability for supplementary assessment; however, component‑level divergences emerged (e.g., in Coherence and Cohesion, Grammatical Range and Accuracy, Lexical Resource) when compared with certified raters. In contrast, DeepSeek and Perplexity produced significantly lower and more conservative scores across most components (p ≤ 0.001). Human raters displayed strong overall agreement but diverged in certain analytic components, particularly grammatical accuracy. Identified AI limitations included prompt dependency, batch‑processing constraints, and external referencing, reducing operational stability. The study highlights the potential of hybrid human–AI assessment models to enhance efficiency while preserving fairness and validity in L2 writing evaluation.