Cosine Similarity

Long article this week

We’ve got two things to discuss in this post: design changes and a new rating function.

Design Change

After a discussion with my adviser, Dr. Charles Romney, it seems that the rating function has too much responsibility in the current design.

Conceptually, the rating function must evaluate both the sentence structure and the imitation of the original corpus. Sentences with 15 verb phrases are not typical English, so they should be rated poorly. But sentences with a normal number of verb phrase which have words about hot-air balloons should also be rated poorly.

This means the rating function is doing too much.

Partition the Evaluation

To combat this issue, I have offloaded the sentence structure part of the evaluation process back to the generation model. I did this by simplifying the language generation model: Rather than expanding from the ‘ROOT’ nonterminal via a random walk to terminal symbols, the generation system now expands only the lowest level of nonterminal symbols to terminal symbols. The figure below shows the new starting point circled in red.

Expanding from these nonterminal symbols has the effect of taking sentence structures directly from the original corpus but limits the flexibility of the CFG generation model. That tradeoff results in more realistic sentences before the imitation rating ever takes place.

New Rating Method

Now that we have sentences with better grammar to evaluate, we can rate them.

To avoid the length bias with the previous rating method (sum of bigram frequency values), I implemented a cosine-similarity-based rating function for generated sentences.

The text of all sentences in the corpus is tokenized as unigrams, bigrams, trigrams, or even quadrigrams. That’s a document. Then each generated sentence is tokenized in the same way. Those are the other documents.

The tokens are transformed into a term-document matrix, and cosine similarity comparison can be done between the whole corpus (document 1) and each generated sentence (documents 1 - N). (Scikit-learn has a great package for Python.)

Corpus · Doc 1 = score 1
Corpus · Doc 2 = score 2
...
Corpus · Doc N = score N

It seems that quadrigrams are too infrequent between a generated sentence and the corpus (as we’d expect), and trigram scores tend to cluster at the same few values. That seems to indicate that the same few trigrams are being scored in each sentence. Bigrams have a wider range of score values, so I think that bigrams may be the most specific token which is broadly applicable.

Maybe we can tie all this together next week.

Sample Generated Sentences

Unigram-Based Cosine Similarity

Score	Sentence
0.384333	anti-Trump and @usairforce think kneeling more If The manner.The , and my parents demand YOU ? .@POTUS
0.379415	GREAT But A.M. are taking larger FOR the Rocket , plus our clips am you ? Year

Bigram-Based Cosine Similarity

Score	Sentence
0.065988	careful & research Am bringing stronger of The Administration , and his echoes kneel I ! relief
0.060698	Little or PR continue noticing more of the honor , and their odds ‘RE He ! help

Trigram-Based Cosine Similarity

Score	Sentence
0.008517	Courageous & economy demand getting stronger By The @USCGSoutheast , plus our crews tune him ! player
0.008517	good & luncheon commend analyzing tougher by that #FEMA , and our @KellyannePolls am you ! tomorrow

Quadrigram-Based Cosine Similarity

Score	Sentence
0.004711	Courageous & economy demand getting stronger By The @USCGSoutheast , plus our crews tune him ! player
0.000000	Great But recruitment INSPIRE analyzing longer despite no @GOPChairwoman , and My Democrats demand She ! COS