While teaching is wonderful, worthwhile and rewarding it is a highly demanding and stressful profession. So it is little surprise that there is high rate of staff turnover with nearly one in 10 teachers are leaving the teaching profession in
English schools each year citing burnout, overwork and stress as the principal reasons (Department for Education). To improve teacher retention a better work life
balance is needed. In fact, reducing high workload was one of the motivations for
the industrial action of 2023 by the NEU.
One area where large improvements can be made in work-life balance is the marking of student work.
Teachers spend 9 hours per week marking student work (EEF, 2016) and if any reductions can be
achieved in this area then we can go a long way to improving working conditions for teachers. Some
efforts have been made to automate grading using for instance self-marking online
multiple-choice tools like www.diagnosticquestions.com or Micorsoft Forms that allows keyword responses to be marked. But
these tools cannot mark open free-form questions that require higher levels of thinking.
Multiple choice and keyword responses only help support learning and assessment
at the lower end of Blooms’ taxonomy. For
higher order thinking that requires analysis, evaluation and synthesis this is
not possible using these approaches to marking that are computationally very simple to automate.
With the advent of artificial intelligence (AI), we are getting closer to the day where it will be possible that freeform open-ended answers can be marked automatically by a computer thereby alleviating some of the pressure on teachers. The introduction of ChatGPT in 2022 and Google's Bard in 2023 has brought into the consciousness of people the capabilities of natural language processing. Teachers have been experimenting with ChatGPT using it to help plan lessons, write reports and complete other administrative tasks but this is tinkering around the edges. We believe there is the capability of AI to transform teaching and to drastically reduce the need for marking by teachers. Here we present an approach that uses machine learning to automate the grading of short answer questions. This approach uses a transformer as discussed in the seminal paper "attention is all you need" which underpins the BERT model of Devlin et al. Reimers present a sentence transformer SBERT that is a modification of the BERT model.
We use sentence transform model all-MiniLM-L6-v2 that has been trained on 1 billion pairs of sentences. Sentence transformers reduces the computational complexity of using the unmodified BERT model. The model works by transforming the student response to a 384 dimensional vector which are compared with a corresponding set of embeddings for the anchor (mark scheme) response. We use the cosine similarity measure to calculate the angle angle between the two vectors which returns a value between 0 and 1. A value towards 1 indicates a high semantic similarity between the vectors and threfore a good response, while a low value towards 0 indicates low semantic similarity between the vectors and therefore an incorrect response. In our scheme it is possible to have multiple anchors and the students respons is compared to each of these and returns the result with the highest semantic similarity.
We can use the system in low stakes environment but the power of such a system to automatically mark public examinations would be transformative. The system would need to read hand-written responses so that it would need to be coupled with optical recognition technology and be at least as accurate as human examiners.
We have tested the system on a Year 7 baseline test for computing. The test is out of 52 and includes multiple choice, keyword response and short open answer responses of 1 sentence. Out of nearly 200 students the difference between teacher marking and autograding came within 1 mark or less in all cases, with 85% of students achieving the same mark using the autograding as the teacher marked responses. Remember this is conflated with the responses to the extended multiple choice and keyword response questions. We are specifically interested in the the effectiveness of the system in grading the open questions. For simplicity, students can be awarded either 0 or 1 mark for these questions, As mentioned the semantic similarity gives a value between 0 and 1 so for each question a threshold needs to identified above which we award 1 mark for the questions and below which we award 0 marks. This threshold is individual to each question and was identified on a small test sample of responses. For the 3 questions we looked at in detail, all achieved an accuracy of more than 90% (See table below).
This approach looks promising and we have begun on a simple case of short answer responses to questions that do not need semantically dense tier 3 vocabulary and where students can be awarded either 0 or 1 marks. So there are constraints to the system as it stands but we will look to apply this approach and modifications of this approach to more challenging situations.
Our model is available for public use at www.mangolearning.academy/ai-auto-marking. This tool comes with the caveat that this is very much an experimental setup, but there is huge scope for development across all subjects by fine tuning with additional datasets to improve accuracy. The model needs additional fine tuning if it is to be effective dealing with a specialist lexicon as at A level for instance, where the model has not been trained on extensive datasets. We will need to fine tune the model on more specialist languages for each of the school subjects.
References
EEF, 2016, A Marked improvement, https://educationendowmentfoundation.org.uk/evidence/evidence-on-marking/
Wilinato, D and Girsang, A.S Autiomatic Short Answer Grading on High School's E-:earning using Semantic Similarity Methods,TEM Journal. Volume 12, Issue 1, pages 297‐302, ISSN 2217‐8309, DOI: 10.18421/TEM121‐37, February 2023.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 pp. 4171-4186
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, pp. 3982-3992.
Vaswani, A., et al 2017. Attention is All You Need, 31st Conference on NIPS 2017, 1-11.
Comments
Post a Comment