Twitter API and Parsing
The last week began the process of exploratory analysis, to borrow a terms from the machine learning discipline. I learned a bit about the Twitter API (actually more about a python client for it), and started the parsing process. There are several types of parsing in NLP:
- Part of Speech tagging
- Syntactic Parsing
- Semantic Parsing
We can attempt to use all these methods to construct slots for a Tweet-generating dialog agent’s frame.
Twitter API with Python Clients
python-twitter
from GitHub user bear is an Apache 2.0 licensed python client for communicating with the Twitter API. Another python client is tweepy
from tweepy, which is MIT-licensed. Both of these clients seem to be good options for acquiring tweet data from Twitter’s API programmatically.
Since, they seem to be equal, I will go with python-twitter
unless something changes my mind. (Maybe the MIT license would be more attractive in a production or commercial system. I’ll think on that).
Twitter API Limits
Twitter exposes limited data through its API to standard users. My impression is that they do not maintain the infrastructure to provide more than a sampling of tweets to API users.
However, Twitter does partner with other organizations to provide full-archive search for Enterprise users.
Basically, search is limited to a sample (undefined) of tweets for the past 7 days. Luckily for us, each user’s tweet history is directly retrievable for the past 3,200 tweets, called the user_timeline
. Response is JSON formatted, and temporal restrictions apply:
- Requests / 15-min window (user auth): 900
- Requests / 15-min window (app auth): 1500
I don’t know how python-twitter
handles the OAuth authentication, so I will need to check that out. The authentication process seems to be what determines the user/app auth status.
Twitter OAuth Docs & python-twitter
Docs
Tweet Data
Using a simple script, I was able to download the 3,200 max tweets from @realDonaldTrump’s user_timeline
and pickle
them for later analysis.
import twitter
import pickle
api = twitter.Api(consumer_key='XXXXX',
consumer_secret='XXXXX',
access_token_key='XXXXX',
access_token_secret='XXXXX')
trump = 'realDonaldTrump'
def get_timeline_page(screen_name, max_id):
statuses = api.GetUserTimeline(screen_name=screen_name, count=200,
trim_user=True, max_id=max_id)
return statuses
# Initialize trump_tweets list with max_id=None
trump_tweets = api.GetUserTimeline(screen_name='realDonaldTrump',
count=200, trim_user=True)
# get 2000 tweets
for x in range(20):
new_max_id = trump_tweets[len(trump_tweets) - 1].id
trump_tweets += get_timeline_page(trump, new_max_id)
with open('trump_tweets.pkl', 'wb') as f:
pickle.dump(trump_tweets, f)
tpkl = open('trump_tweets.pkl', 'rb')
tt = pickle.load(tpkl)
tpkl.close()
Parsing
There are two major python packages for NLP available to me: Natural Language Toolkit (NLTK) and spaCY. Both of which have benefits. I will attempt to learn both and decide which serves my needs better.
NLTK
NLTK is primarily a pedagogical tool, from the Natural Language Processing with Python textbook from Steven Bird, Ewan Klein, and Edward Loper, available online. The standard Anaconda installation of Python 3 includes the nltk
package, as well, so it is accessible.
spaCY
spaCy is an open-source, commercially-focused python package from a former academic resarcher. It seems to outperform the NLTK package, especially in syntactic parsing speed, so it might be a better fit.
Next up, learn the APIs and parse some tweets. (and read more….and more….)