NLP research regularly utilizes lexical resources to impart some linguistic intelligence to software. See Ignatow and Mihalcea, 2017 (chapter 4) for a basic introduction. One very important resource is WordNet and it’s derived resources, such as WordNet-Affect and SentiWordNet (GitHub repo here).

Text processing in NLP builds on raw text to create a more information-rich corpus. This enhanced corpus serves as a better foundation for analysis of a corpus. See Ignatow and Mihalcea, 2017 (chapter 5) for an introduction.

WordNet

WordNet and it’s related resources classify words in the English language according to synsets and relations. Each Synsets, or synonym set, is related to other synsets by relations. A few selected WordNet relations are listed below for relation Y of word X:

Relation Part of Speech Definition
Hypernym Noun & Verb Hypernym Y is a superset of X
Hyponym Noun & Verb Hyponym Y is a subset of X
Troponym Verb Troponym Y is a means of doing X
Antonym Adjective Antonym Y has opposite meaning of *X

Text Processing

There are several potentially useful resources for enhancing a corpus of tweets. Part of speech tagging, syntactic parsing, and sentiment analysis top the list.

There are many schemes for part of speech (POS) tagging. One of the more complex sets of tags comes from the Penn TreeBank. A POS tagger steps through a corpus and assigns tags from the tool’s set to words/chunks/sentences in the corpus.

Syntactic parsing builds on a corpus to identify dependencies or sentence structure.

Semantic classification identifies the sentiment of a piece of text, for example if a tweet is happy or sad or mad.

These 3 tools will be my focus of research for this week (17-23 Sept.)

Markov Probabilities

WordNet

If it is possible to calculate transition-state probabilities for the transition from a given bigram or trigram to any word in the corpus, it should be possible to compute those transition probabilities at a hypernym level or whole synset level. Essentially, we could create a probabilistic representation of the kinds of things a corpus writes about.

Text Processing

Similarly as for WordNet word classification, we should be able to describe the sentiment, syntactic structure and part-of-speech trends for an individual probabilistically using Markov chains.

Taken all together, Markov chains may have the potential to describe more than the transition properties between an n-gram window and the following word in a corpus. If we can describe both semantic and syntactic information in the corpus, maybe we can enforce grammatical construction on generated text while maintaining the sense of authorship.

References

Gabriel Ignatow and Rada Mihalcea. 2017. Text mining: a guidebook for the social sciences. SAGE, Los Angeles, 1st edition.