CS109a

PavlosOS:Desktop ~ $ cd /DataScience

PavlosOS:DataScience ~ $ pwd

/Users/Stephen&Noah/DataScience

PavlosOS:DataScience ~ $ ls

models.ipynb

training.csv

testing.csv

pavlos.py

PavlosOS:DataScience ~ $ jupyter notebook

> Serving notebooks from local directory

Automatic Essay Grading

Stephen Turban & Noah Yonack, Harvard College (CS109a/Stat121a)

Introduction

To allocate funding and target educational support appropriately, we need to accurately measure how well our children perform academically. To assess general writing ability, written essays are an invaluable tool. Unfortunately, grading millions of essays across the United States is costly and time-consuming for teachers. As a result, school systems turn to easier-to-grade alternatives, like multiple choice tests.

One solution to this problem is automatic essay grading. Automatic essay grading uses prior data on written essays (i.e. grades assigned by human judges) to predict the grades of future essays. This saves a significant amount of time and educational resources and helps make essays a more feasible part of standardized testing.

DuQuesne University notes that teachers generally grade essays using one of two methods: a holistic approach or an analytical approach. A holistic grading approach involves reading all responses to an essay question and then assigning grades using a measure of relative quality. The analytical approach, on the other hand, is more systematic: this involves making a list of the expected qualities (a.k.a. features) of a good essay response, and then evaluating essay responses against that list.

Because the analytical approach to essay scoring is more systematic and involves a more intuitive mapping onto statistical modeling, we've decided to a build a model which scores essays analytically instead of holistically. We describe the intricacies of our model later on this website.

For this project, we looked at roughly 12,000 essays from 8 separate standardized tests, each essay written by students between grades 7 and 10. These written assessments were responses to open ended questions posed by an instructor (e.g. “When have you shown patience?”). Each of the essays was graded by a human on a numerical scale specific to the essay prompt. In other words, some essays were graded on a scale from 0 to 5, and others from 0 to, say, 10. These essays were cleaned of personal identifying information using the Named Entity Recognizer from the Stanford NLP group, both for anonymity and because of a potential for bias against certain names.

Data

We downloaded our data from a Kaggle competition hosted by The Hewlett Foundation. We use this section to describe the structure of the data (i.e. the data dictionary), and we perform preliminary exploration of this data in the next section (see Data Exploration below).

The data is comprised of 5 main columns (there are a handful of other columns as well, but they are only partially complete and are less important than the ones listed here): essay_set, essay, rater1_domain1, rater2_domain1, and domain1_score.

The essay_set column contains a number between 1 and 8 which indicates the prompt for the given essay — because we know that essay sets don't necessarilly have the same grading rubric, it seems clear that this column will be valuable as a feature in our model.
The essay column contains the raw essay text, which, as mentioned in the previous section, has been cleaned of personal identifying information. Immediately, we know that this column can't be used as a model feature without some sort of preprocessing. In the sections below, we describe our preprocessing method, known as tf-idf vectorization.
The rater1_domain1 and rater2_domain1 columns represent grades for the given essay by two independent expert raters, both human. It is common practice to involve two independent human graders in the task of grading essays from standardized tests, as a means of reducing bias and error. The domain1_score consolidates the scores from both judges by adding the two. We treat this composite number as our response variable.

Data Exploration

As mentioned, our essays come from 8 different sets, each of which has a different score range. As you can see from the histogram below, each essay set has roughly the same prevalence within the dataset, except for essay set 8.

We began the exploration phase of the data-science process by looking at fundamental properties of the dataset: histograms of essay scores, scatterplots between essay length and score, and scatterplots between the number of unique words per essay and score. Finding a relationship between these basic features and our response variable could mean better predictions when we build our model later down the road.

We first plotted a histogram of all essay scores. While some essay sets are graded on scales from, say, 0 - 5, others have scales that go up to 30 (and, when a student gets a score of 30 from both independent graders, her domain1_score is 60). Notice in the histogram below the multiple local modes — ones at ~4, ~9, ~40. This is because we are plotting the distribution of all scores from all essay sets. Because we're seeing multiple 'smaller' distributions on the same axis, we know that including essay_set in our model will likely help us predict score by limiting the range of our predicted response.

We also looked at the relationship between essay length and our response variable. Notice the multiple `fingers` in the scatterplot below, indicating at least moderate-strength correlations between essay length and score.

Breaking up this scatterplot into its constituent parts (i.e. by essay set), these correlations become more obvious. For instance, notice the relationship between essay length vs. score for essay set 1. These two variables have a Spearman's rho correlation coefficient of 0.80, which is considered to be strong. Other essay sets have similar scatterplots and correlation coefficients.

Lastly, we also looked at essay score as a function of the number of unique words in that essay. Unsurprisingly, essay length is an almost perfect proxy of the number of unique words per essay, meaning the correlation between the two variables is almost 1. Thus, using both essay length and the number of unique words as features in our model is redundant.

Model

Setting up the data: To break our data set into manageable components, we used two methods: one called “text feature extraction”, specifically Term Feature Inverse Document Frequency, and another called n-grams. These two approaches allowed us to extract characteristics of the essay and prompt text that we will use later for predictions.

Term Feature Inverse Document Frequency: TF-IDF is a numerical statistic that is meant to reflect the relative importance of a word in a document or, in this case, a grouping of documents. It does this by following two pieces of intuition. First, it assumes that words that appear often are important. Second, it assumes that, controlling for number of appearances, words that appear in more documents are less important than words that appear only in a few documents. This captures the intuition that words like “the”, “a”, or “that” don’t have give much meaning because they are filler words. Mathematically, the TF-IDF is calculated as follows:

The tf is calculated as follows. Note we divide by the max number of appearances of a word in a list in order to avoid biasing against longer documents.

The idf is calculated as follows.

N-grams Models: N-gram models break up text into separate chunks for analysis. For example, for a one-gram (a.k.a. unigram) model, we would break up the sentence “That is a cat” Into four components: "That", "is", "a", “cat”. Below you can see extensions of this into different sized grams.

Throughout our modeling we kept in mind that we might capture different information using different sizes of n-grams. For example, if we used unigrams, we would break up the sentence “I do not love you.” into “I”, “do”, “not”, “love”, and “you”, whereas with two grams we would be able to capture the concept of “not love.”

Latent Semantic Analysis: After breaking up our documents into TF-IDF weighted n-gram components, we then applied Latent Semantic Analysis to draw out different information about the relationship between words.

But first, what is Latent Semantic Analysis, also known as LSA? Well, LSA at a high level allows us to capture underlying similarities between words that we wouldn’t otherwise be able to capture. Using LSA, we can condense our features into components that reflect concepts, which is something that an n-gram approach could not do. Below is a visual that demonstrates how two words, dogs and cats, might combine into one concept of a pet.

What is LSA under the hood? Recall that LSA is trying to find underlying relationships between different words. One of the assumptions of LSA is that words that appear close to each other – in the same document, paragraph, etc. – will be more similar in meaning. In this case, the TF-IDF matrix that is given to LSA is appropriate as it gives back a document-word matrix with a higher weighting given to more rare words.

LSA then finds a low-rank approximation of the occurrence matrix given back by TF-IDF. This is useful when we believe that our data set has too large of a dimension (and thus needs to be reduced) or we believe that the data set is noisy (i.e. there are anecdotal occurrences of words that should be ignored).

Finally, words are compared with LSA by finding the cosine of the angle between their vectors (or the dot product between the normalizations of the two vectors). Values that are near 1 represent very similar words and values near 0 represent very dissimilar words.

Comparing the cosine distance of the angle:

After applying LSA and n-grams, we extracted a couple of features about the essays, namely number of words and essay type. We then used these features as predictors in our model. The model that we used was a linear LASSO model. We decided to use LASSO as opposed to other regularization techniques because of its dimensionality reduction characteristic. When we use an n-gram model over our essay corpus, we have over 50,000 predictors. LASSO allows us to eliminate many of those predictors that are not predictive of essay score. We also performend K-fold CV to show that LASSO performed better than a ridge approach.

Cross validation of tuning parameter: We used 4-fold validation to find the optimal tuning parameter for our LASSO model. We then did tuning across models without using LSA (i.e. just using the n-gram model) and then compared that to models that employed LSA. Finally, tuned across various sized n-grams to find our optimal model, which ended up using n-grams of size 15 (roughly the size of the average sentence).

Our accuracy metric - Spearman’s rho correlation: To evaluate the goodness of our models, we used Spearman’s rank correlation, which is a Pearson correlation between the sorted values of two variables. It is more appropriate than Pearson’s correlation because it can measure non-linear relationships.

Results: Ultimately, we found that a 15-gram linear LASSO with LSA applied (100 components) produced the best model, yielding an accuracy of about 0.91. We learned two main things in the model building phase: Non-reduced unigram models perform better when LSA is not employed. Also, LSA gives us increased performance in models with bigrams and above. This is probably because information is encoded in bigram models in ways that it is not for unigram models (e.g. it picks up on the difference between “not good” and “good”).

Note that all of this analysis can be found in a GitHub repository. The final notebook can be found at `analysis/Automatic Essay Grading.ipynb`. It will use datasets found in `analysis/datasets/`.

Future Work

There are many avenues for future work. For one, a more sophisticated model might consider the implications of biased training data. One of the greatest advantages of grading essays using a computer in lieu of a human is that computers aren't subject to same psychological biases that humans are — or are they? When we train our l1-regularized linear model on essays graded by humans, we "learn" the biases inherent in the grading system.

What kinds of biases are we talking about? There are many: research suggests that typed essays tend to get better grades than handwritten essays, even when the text is the exact same for both! Similarly, human judgement is riddled with biases: we know that judges give more lenient sentences to criminals in the morning and early afternoon — right after breakfast and lunch, respectively. When teachers grade students' essays, they naturally encode their biases directly into the grade, albeit unintentionally. How can we correct for biases like this?

Without changing the way that teachers grade, we'll have trouble getting our hands on unbiased data. On the other hand, we may be able to prejudice our model using priors. If we know that a certain bias is pervasive and that it causes grades to go up, for instance, we can include a prior in our model which controls for this unnencessary shift up. We may have to think intuitively about how a certain bias might manifest itself, or we can collect more granular data (e.g. the time of day at which the essay was graded) to do more rigorous analysis of the effects of bias.

Another improvement to our model might involve pulling out more sophisticated features. For instance, one property of an essay which is likely to be valuable is its gramaticallity. There are certainly many ways to calculate an essay's grammaticallity (for instance, see this paper by the ETS, which uses ridge regression). One simpler approach might involve calculating the proportion of mispelled words (by using a dictionary, for instance). Either way, pulling out grammaticallity would surely help improve the accuracy of our model.

Lastly, we might also consider swapping our regression model for a neural net. There are many examples of models which have used Long-Short Term Memory networks to solve the problem of automatic essay scoring. That said, even the most sophisticated approaches involving neural networks seem to yield accuracies that are no better than our regularized LASSO regression.