Spring 2018 will be the implementation period for this M.S. Project. To recap, I’ll be developing a text generation system based on a Context-Free Grammar (CFG), generated from a publicly available corpus of tweets.

System Components

  1. Twitter API Interface – download tweet corpus
  2. Constituency Parser – Stanford CoreNLP will probably do the trick.
  3. CFG Production Generator – This creates “rules” for legal speech
  4. Quasi-Random Sentence Generator – Create sentences
  5. Markov Probability Function – Make sure the tweet sounds like the corpus

Issues

  1. The Stanford CoreNLP library can do constituency parsing, but it’s written in Java, so it’ll have to be interfaced with Python.
  2. Since productions in a CFG can be defined recursively, I’ll need to write an algorithm to ensure we exit the Generator.
  3. The Markov probability function will need tuning. A machine learning classifier will be investigated if time permits.

I’ll be starting with finding a constituency parser and going down the list. The Twitter API interface should be straight forward, so that’ll be last.