Implementation Phase
Spring 2018 will be the implementation period for this M.S. Project. To recap, I’ll be developing a text generation system based on a Context-Free Grammar (CFG), generated from a publicly available corpus of tweets.
System Components
- Twitter API Interface – download tweet corpus
- Constituency Parser – Stanford CoreNLP will probably do the trick.
- CFG Production Generator – This creates “rules” for legal speech
- Quasi-Random Sentence Generator – Create sentences
- Markov Probability Function – Make sure the tweet sounds like the corpus
Issues
- The Stanford CoreNLP library can do constituency parsing, but it’s written in Java, so it’ll have to be interfaced with Python.
- Since productions in a CFG can be defined recursively, I’ll need to write an algorithm to ensure we exit the Generator.
- The Markov probability function will need tuning. A machine learning classifier will be investigated if time permits.
I’ll be starting with finding a constituency parser and going down the list. The Twitter API interface should be straight forward, so that’ll be last.