Model
Setting up the data: To break our data set into manageable components, we used two methods: one called “text feature extraction”, specifically Term Feature Inverse Document Frequency, and another called n-grams. These two approaches allowed us to extract characteristics of the essay and prompt text that we will use later for predictions.
Term Feature Inverse Document Frequency: TF-IDF is a numerical statistic that is meant to reflect the relative importance of a word in a document or, in this case, a grouping of documents. It does this by following two pieces of intuition. First, it assumes that words that appear often are important. Second, it assumes that, controlling for number of appearances, words that appear in more documents are less important than words that appear only in a few documents. This captures the intuition that words like “the”, “a”, or “that” don’t have give much meaning because they are filler words. Mathematically, the TF-IDF is calculated as follows:
The tf is calculated as follows. Note we divide by the max number of appearances of a word in a list in order to avoid biasing against longer documents.
The idf is calculated as follows.
N-grams Models: N-gram models break up text into separate chunks for analysis. For example, for a one-gram (a.k.a. unigram) model, we would break up the sentence “That is a cat” Into four components: "That", "is", "a", “cat”. Below you can see extensions of this into different sized grams.
Throughout our modeling we kept in mind that we might capture different information using different sizes of n-grams. For example, if we used unigrams, we would break up the sentence “I do not love you.” into “I”, “do”, “not”, “love”, and “you”, whereas with two grams we would be able to capture the concept of “not love.”
Latent Semantic Analysis: After breaking up our documents into TF-IDF weighted n-gram components, we then applied Latent Semantic Analysis to draw out different information about the relationship between words.
But first, what is Latent Semantic Analysis, also known as LSA? Well, LSA at a high level allows us to capture underlying similarities between words that we wouldn’t otherwise be able to capture. Using LSA, we can condense our features into components that reflect concepts, which is something that an n-gram approach could not do. Below is a visual that demonstrates how two words, dogs and cats, might combine into one concept of a pet.
What is LSA under the hood? Recall that LSA is trying to find underlying relationships between different words. One of the assumptions of LSA is that words that appear close to each other – in the same document, paragraph, etc. – will be more similar in meaning. In this case, the TF-IDF matrix that is given to LSA is appropriate as it gives back a document-word matrix with a higher weighting given to more rare words.
LSA then finds a low-rank approximation of the occurrence matrix given back by TF-IDF. This is useful when we believe that our data set has too large of a dimension (and thus needs to be reduced) or we believe that the data set is noisy (i.e. there are anecdotal occurrences of words that should be ignored).
Finally, words are compared with LSA by finding the cosine of the angle between their vectors (or the dot product between the normalizations of the two vectors). Values that are near 1 represent very similar words and values near 0 represent very dissimilar words.
Comparing the cosine distance of the angle:
After applying LSA and n-grams, we extracted a couple of features about the essays, namely number of words and essay type. We then used these features as predictors in our model. The model that we used was a linear LASSO model. We decided to use LASSO as opposed to other regularization techniques because of its dimensionality reduction characteristic. When we use an n-gram model over our essay corpus, we have over 50,000 predictors. LASSO allows us to eliminate many of those predictors that are not predictive of essay score. We also performend K-fold CV to show that LASSO performed better than a ridge approach.
Cross validation of tuning parameter: We used 4-fold validation to find the optimal tuning parameter for our LASSO model. We then did tuning across models without using LSA (i.e. just using the n-gram model) and then compared that to models that employed LSA. Finally, tuned across various sized n-grams to find our optimal model, which ended up using n-grams of size 15 (roughly the size of the average sentence).
Our accuracy metric - Spearman’s rho correlation: To evaluate the goodness of our models, we used Spearman’s rank correlation, which is a Pearson correlation between the sorted values of two variables. It is more appropriate than Pearson’s correlation because it can measure non-linear relationships.
Results: Ultimately, we found that a 15-gram linear LASSO with LSA applied (100 components) produced the best model, yielding an accuracy of about 0.91. We learned two main things in the model building phase: Non-reduced unigram models perform better when LSA is not employed. Also, LSA gives us increased performance in models with bigrams and above. This is probably because information is encoded in bigram models in ways that it is not for unigram models (e.g. it picks up on the difference between “not good” and “good”).
Note that all of this analysis can be found in a GitHub repository. The final notebook can be found at `analysis/Automatic Essay Grading.ipynb`. It will use datasets found in `analysis/datasets/`.