I am trying to scrape latitude and longitude of user from Twitter with respect to user names. The user name list is a csv file with more than 50 names in one input file. The below are two trials that I have made by far. Neither of them seems to be working. Corrections in any one of the program or an entirely new approach is welcome.
I have list of User_names
and I am trying to lookup user profile and pull the geolocation
from the profile or timeline. I could not find much of samples anywhere over Internet.
I am looking for a better approach to get geolocations of users from Twitter. I could not even find a single example that shows harvesting User location with reference to User_name or user_id. Is It even possible in first place?
Input: The input files have more than 50k rows
AfsarTamannaah,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#weneverletyoudownstr
ndtv,6.80E+17,12/24/2015,#chennaifloods
1andonlyharsha,6.79E+17,12/21/2015,#chennaifloods
Shashkya,6.79E+17,12/21/2015,#moneyonmobile
Shashkya,6.79E+17,12/21/2015,#chennaifloods
timesofindia,6.79E+17,12/20/2015,#chennaifloods
ANI_news,6.78E+17,12/20/2015,#chennaifloods
DrAnbumaniPMK,6.78E+17,12/19/2015,#chennaifloods
timesofindia,6.78E+17,12/18/2015,#chennaifloods
SRKCHENNAIFC,6.78E+17,12/18/2015,#dilwalefdfs
SRKCHENNAIFC,6.78E+17,12/18/2015,#chennaifloods
AmeriCares,6.77E+17,12/16/2015,#india
AmeriCares,6.77E+17,12/16/2015,#chennaifloods
ChennaiRainsH,6.77E+17,12/15/2015,#chennairainshelp
ChennaiRainsH,6.77E+17,12/15/2015,#chennaifloods
AkkiPritam,6.77E+17,12/15/2015,#chennaifloods
Code:
import tweepy
from tweepy import Stream
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
import pandas as pd
import json
import csv
import sys
import time
CONSUMER_KEY = 'XYZ'
CONSUMER_SECRET = 'XYZ'
ACCESS_KEY = 'XYZ'
ACCESS_SECRET = 'XYZ'
auth = OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
api = tweepy.API(auth)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
data = pd.read_csv('user_keyword.csv')
df = ['user_name', 'user_id', 'date', 'keyword']
test = api.lookup_users(user_ids=['user_name'])
for user in test:
print user.user_name
print user.user_id
print user.date
print user.keyword
print user.geolocation
Error:
Traceback (most recent call last):
File "user_profile_location.py", line 24, in <module>
test = api.lookup_users(user_ids=['user_name'])
File "/usr/lib/python2.7/dist-packages/tweepy/api.py", line 150, in lookup_users
return self._lookup_users(list_to_csv(user_ids), list_to_csv(screen_names))
File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 197, in _call
return method.execute()
File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 173, in execute
raise TweepError(error_msg, resp)
tweepy.error.TweepError: [{'message': 'No user matches for specified terms.', 'code': 17}]
I understand every user does not share the geolocation, but those who keep the profile publicly open from the if I can get geolocation shall be great.
User locations as name and/or lat lon is what I am looking for.
If this approach isn't correct then I am open to alternatives also.
Update One: After some deep search I found this website that provides a very close solution, But I am getting error while trying to read the userName
from the input file.
This says only 100 user's information can be grabbed what is the better way to lift that limitation ?
Code:
import sys
import string
import simplejson
from twython import Twython
import csv
import pandas as pd
#WE WILL USE THE VARIABLES DAY, MONTH, AND YEAR FOR OUR OUTPUT FILE NAME
import datetime
now = datetime.datetime.now()
day=int(now.day)
month=int(now.month)
year=int(now.year)
#FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API
t = Twython(app_key='ABC',
app_secret='ABC',
oauth_token='ABC',
oauth_token_secret='ABC')
#INPUT HAS NO HEADER NO INDEX
ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])
#ACCESS THE LOOKUP_USER METHOD OF THE TWITTER API -- GRAB INFO ON UP TO 100 IDS WITH EACH API CALL
users = t.lookup_user(user_id = ids)
#NAME OUR OUTPUT FILE - %i WILL BE REPLACED BY CURRENT MONTH, DAY, AND YEAR
outfn = "twitter_user_data_%i.%i.%i.csv" % (now.month, now.day, now.year)
#NAMES FOR HEADER ROW IN OUTPUT FILE
fields = "id, screen_name, name, created_at, url, followers_count, friends_count, statuses_count, \
favourites_count, listed_count, \
contributors_enabled, description, protected, location, lang, expanded_url".split()
#INITIALIZE OUTPUT FILE AND WRITE HEADER ROW
outfp = open(outfn, "w")
outfp.write(string.join(fields, "\t") + "\n") # header
#THE VARIABLE 'USERS' CONTAINS INFORMATION OF THE 32 TWITTER USER IDS LISTED ABOVE
#THIS BLOCK WILL LOOP OVER EACH OF THESE IDS, CREATE VARIABLES, AND OUTPUT TO FILE
for entry in users:
#CREATE EMPTY DICTIONARY
r = {}
for f in fields:
r[f] = ""
#ASSIGN VALUE OF 'ID' FIELD IN JSON TO 'ID' FIELD IN OUR DICTIONARY
r['id'] = entry['id']
#SAME WITH 'SCREEN_NAME' HERE, AND FOR REST OF THE VARIABLES
r['screen_name'] = entry['screen_name']
r['name'] = entry['name']
r['created_at'] = entry['created_at']
r['url'] = entry['url']
r['followers_count'] = entry['followers_count']
r['friends_count'] = entry['friends_count']
r['statuses_count'] = entry['statuses_count']
r['favourites_count'] = entry['favourites_count']
r['listed_count'] = entry['listed_count']
r['contributors_enabled'] = entry['contributors_enabled']
r['description'] = entry['description']
r['protected'] = entry['protected']
r['location'] = entry['location']
r['lang'] = entry['lang']
#NOT EVERY ID WILL HAVE A 'URL' KEY, SO CHECK FOR ITS EXISTENCE WITH IF CLAUSE
if 'url' in entry['entities']:
r['expanded_url'] = entry['entities']['url']['urls'][0]['expanded_url']
else:
r['expanded_url'] = ''
print r
#CREATE EMPTY LIST
lst = []
#ADD DATA FOR EACH VARIABLE
for f in fields:
lst.append(unicode(r[f]).replace("\/", "/"))
#WRITE ROW WITH DATA IN LIST
outfp.write(string.join(lst, "\t").encode("utf-8") + "\n")
outfp.close()
Error:
File "user_profile_location.py", line 35, in <module>
ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1202, in __init__
ParserBase.__init__(self, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 918, in __init__
raise ValueError("cannot specify usecols when "
ValueError: cannot specify usecols when specifying a multi-index header
Tweepy
? Do you not know how to handle errors? – Sontaguser_ids=['user_name']
, which most likely fails since there is no twitter user nameduser_name
. – Familiar