Skip to main content

Automatic Marking and Grading

While teaching is wonderful, worthwhile and rewarding it is a highly demanding and stressful profession.  So it is little surprise that there is high rate of staff turnover with nearly one in 10 teachers are leaving the teaching profession in English schools each year citing burnout, overwork and stress as the principal reasons (Department for Education). To improve teacher retention a better work life balance is needed. In fact, reducing high workload was one of the motivations for the industrial action of 2023 by the NEU.  One area where large improvements can be made in work-life balance is the marking of student work. Teachers spend 9 hours per week marking student work (EEF, 2016) and if any reductions can be achieved in this area then we can go a long way to improving working conditions for teachers.  Some efforts have been made to automate grading using for instance self-marking online multiple-choice tools like www.diagnosticquestions.com or Micorsoft Forms that allows keyword responses to be marked. But these tools cannot mark open free-form questions that require higher levels of thinking. Multiple choice and keyword responses only help support learning and assessment at the lower end of Blooms’ taxonomy.  For higher order thinking that requires analysis, evaluation and synthesis this is not possible using these approaches to marking that are computationally very simple to automate.

With the advent of artificial intelligence (AI), we are getting closer to the day where it will be possible that freeform open-ended answers can be marked automatically by a computer thereby alleviating some of the pressure on teachers.  The introduction of ChatGPT in 2022 and Google's Bard in 2023 has brought into the consciousness of people
 the capabilities of natural language processing.  Teachers have been experimenting with ChatGPT using it to help plan lessons, write reports and complete other administrative tasks but this is tinkering around the edges.  We believe there is the capability of AI to transform teaching and to drastically reduce the need for marking by teachers. Here we present an approach that uses machine learning to automate the grading of short answer questions. This approach uses a transformer as discussed in the seminal paper "attention is all you need" which underpins the BERT model of Devlin et al. Reimers present a sentence transformer SBERT that is a modification of the BERT model.  

We use sentence transform model all-MiniLM-L6-v2 that has been trained on 1 billion pairs of sentences. Sentence transformers reduces the computational complexity of using the unmodified BERT model.    The model works by transforming the student response to a 384 dimensional vector which are compared with a corresponding set of embeddings for the anchor (mark scheme) response.  We use the cosine similarity measure to calculate the angle angle between the two vectors which returns a value between 0 and 1.  A value towards 1 indicates a high semantic similarity between the vectors and threfore a good response, while a low value towards 0 indicates low semantic similarity between the vectors and therefore an incorrect response.  In our scheme it is possible to have multiple anchors and the students respons is compared to each of these and returns the result with the highest semantic similarity.


We can use the system in low stakes environment but the power of such a system to automatically mark public examinations would be transformative. The system would need to read hand-written responses so that it would need to be coupled with optical recognition technology and be at least as accurate as human examiners.

We have tested the system on a Year 7 baseline test for computing.  The test is out of 52 and includes multiple choice, keyword response and short open answer responses of 1 sentence. Out of nearly 200 students the difference between teacher marking and autograding came within 1 mark or less in all cases, with 85% of students achieving the same mark using the autograding as the teacher marked responses.  Remember this is conflated with the responses to the extended multiple choice and keyword response questions.  We are specifically interested in the the effectiveness of the system in grading the open questions. For simplicity, students can be awarded either 0 or 1 mark for these questions, As mentioned the semantic similarity gives a value between 0 and 1 so for each question a threshold needs to identified above which we award 1 mark for the questions and below which we award 0 marks.  This threshold is individual to each question and was identified on a small test sample of responses.    For the 3 questions we looked at in detail, all achieved an accuracy of more than 90% (See table below).
 

This approach looks promising and we have begun on a simple case of short answer responses to questions that do not need semantically dense tier 3 vocabulary and where students can be awarded either 0 or 1 marks.  So there are constraints to the system as it stands but we will look to apply this approach and modifications of this approach to more challenging situations.

Our model is available for public use at www.mangolearning.academy/ai-auto-marking.  This tool comes with the caveat that this is very much an experimental setup, but there is huge scope for development across all subjects by fine tuning with additional datasets to improve accuracy. The model needs additional fine tuning if it is to be effective dealing with a specialist lexicon as at A level for instance, where the model has not been trained on extensive datasets. We will need to fine tune the model on more specialist languages for each of the school subjects.

References


Wilinato, D and Girsang, A.S Autiomatic Short Answer Grading on High School's E-:earning using Semantic Similarity Methods,TEM Journal. Volume 12, Issue 1, pages 297‐302, ISSN 2217‐8309, DOI: 10.18421/TEM121‐37, February 2023.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 pp. 4171-4186

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, pp. 3982-3992.

Vaswani, A., et al 2017. Attention is All You Need, 31st Conference on NIPS 2017, 1-11.


Comments

Popular posts from this blog

Mango Learning

We are a community of teachers that have developed extensive computing resources primarily aimed at the English secondary school curriculum that can be accessed here: www.mangolearning.academy .  Mango learning empowers teachers to deliver great lessons that explain complex ideas using clear and highly scaffolded teaching and learning resources. We are very excited to offer these resources for free to the community. These teaching and learning resources for computing are made by teachers for teachers and we understand the day-to-day challenges that teacher face.   The resources incorporate general and computing specific evidence-based pedagogy. We incorporated spaced retrieval practice though knowledge organisers, diagnostic questions and quizzes, for instance. We also incorporate ideas from cognitive load theory through lots of worked examples.   To help with coding we use PRIMM and block to text based pedagogical approaches.   To support literacy we address vocabulary head on, enco

Semantic Waves

In the previous post we looked at the transfer of learning from block based coding to text based languages.  Semantic waves offer a theory that help us to structure our lessons to support transfer of learning (Maton, Waite et al).  When we present concrete examples in single contexts transfer of learning is going to be weak.  We need to present multiple examples in a range of context.  This allows us to abstract out the underlaying features.  This idea of moving along a continuum between the abstract and concrete is given by the term semantic gravity.  For instance, if we talk about an algorithm in abstract terms we might say that it is a sequence of steps to solve a problem.  At this stage we have presented it as an abstract idea so has low semantic garvity.  In a lesson we might then go on and write algorithms for drawing squares.  This represents a concrete episode with high semantic gravity.  In a good lesson we might also want to give multiple examples of algorithm in different co

Teaching Children to Read Code using Evidence-based Approaches

Before students can write code, they need to be able to read code. Computer science pedagogy is often based around the ideas of Piaget’s constructivism - where pupils develop their knowledge through exploration, and Papert’s constructionism - where pupils learn through creating artifacts. However, evidence has shown that learners need guidance to gain useful knowledge efficiently and to organise that knowledge in a clear and logical way. They need to be able to break a problem down, remove the unnecessary detail, find patterns and think algorithmically before they can start to write programs for solving problems. Just as we wouldn’t expect a young child to write prose before they can read, we need to provide guided approaches that use direct instruction and scaffolding to help our students read code before they can be expected to write code themselves. These guided approaches are needed just as much as, if not more than, creative discovery activities. Explain the code My first approach