Scrape User Location from Twitter
Asked Answered
F

1

7

I am trying to scrape latitude and longitude of user from Twitter with respect to user names. The user name list is a csv file with more than 50 names in one input file. The below are two trials that I have made by far. Neither of them seems to be working. Corrections in any one of the program or an entirely new approach is welcome.

I have list of User_names and I am trying to lookup user profile and pull the geolocation from the profile or timeline. I could not find much of samples anywhere over Internet.

I am looking for a better approach to get geolocations of users from Twitter. I could not even find a single example that shows harvesting User location with reference to User_name or user_id. Is It even possible in first place?

Input: The input files have more than 50k rows

AfsarTamannaah,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#weneverletyoudownstr
ndtv,6.80E+17,12/24/2015,#chennaifloods
1andonlyharsha,6.79E+17,12/21/2015,#chennaifloods
Shashkya,6.79E+17,12/21/2015,#moneyonmobile
Shashkya,6.79E+17,12/21/2015,#chennaifloods
timesofindia,6.79E+17,12/20/2015,#chennaifloods
ANI_news,6.78E+17,12/20/2015,#chennaifloods
DrAnbumaniPMK,6.78E+17,12/19/2015,#chennaifloods
timesofindia,6.78E+17,12/18/2015,#chennaifloods
SRKCHENNAIFC,6.78E+17,12/18/2015,#dilwalefdfs
SRKCHENNAIFC,6.78E+17,12/18/2015,#chennaifloods
AmeriCares,6.77E+17,12/16/2015,#india
AmeriCares,6.77E+17,12/16/2015,#chennaifloods
ChennaiRainsH,6.77E+17,12/15/2015,#chennairainshelp
ChennaiRainsH,6.77E+17,12/15/2015,#chennaifloods
AkkiPritam,6.77E+17,12/15/2015,#chennaifloods

Code:

import tweepy
from tweepy import Stream
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
import pandas as pd
import json
import csv
import sys
import time

CONSUMER_KEY = 'XYZ'
CONSUMER_SECRET = 'XYZ'
ACCESS_KEY = 'XYZ'
ACCESS_SECRET = 'XYZ'

auth = OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
api = tweepy.API(auth)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

data = pd.read_csv('user_keyword.csv')

df = ['user_name', 'user_id', 'date', 'keyword']

test = api.lookup_users(user_ids=['user_name'])

for user in test:
    print user.user_name
    print user.user_id
    print user.date
    print user.keyword
    print user.geolocation

Error:

Traceback (most recent call last):
  File "user_profile_location.py", line 24, in <module>
    test = api.lookup_users(user_ids=['user_name'])
  File "/usr/lib/python2.7/dist-packages/tweepy/api.py", line 150, in lookup_users
    return self._lookup_users(list_to_csv(user_ids), list_to_csv(screen_names))
  File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 197, in _call
    return method.execute()
  File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 173, in execute
    raise TweepError(error_msg, resp)
tweepy.error.TweepError: [{'message': 'No user matches for specified terms.', 'code': 17}]

I understand every user does not share the geolocation, but those who keep the profile publicly open from the if I can get geolocation shall be great.

User locations as name and/or lat lon is what I am looking for.

If this approach isn't correct then I am open to alternatives also.

Update One: After some deep search I found this website that provides a very close solution, But I am getting error while trying to read the userName from the input file.

This says only 100 user's information can be grabbed what is the better way to lift that limitation ?

Code:

import sys
import string
import simplejson
from twython import Twython
import csv
import pandas as pd

#WE WILL USE THE VARIABLES DAY, MONTH, AND YEAR FOR OUR OUTPUT FILE NAME
import datetime
now = datetime.datetime.now()
day=int(now.day)
month=int(now.month)
year=int(now.year)


#FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API
t = Twython(app_key='ABC', 
    app_secret='ABC',
    oauth_token='ABC',
    oauth_token_secret='ABC')

#INPUT HAS NO HEADER NO INDEX
ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])

#ACCESS THE LOOKUP_USER METHOD OF THE TWITTER API -- GRAB INFO ON UP TO 100 IDS WITH EACH API CALL

users = t.lookup_user(user_id = ids)

#NAME OUR OUTPUT FILE - %i WILL BE REPLACED BY CURRENT MONTH, DAY, AND YEAR
outfn = "twitter_user_data_%i.%i.%i.csv" % (now.month, now.day, now.year)

#NAMES FOR HEADER ROW IN OUTPUT FILE
fields = "id, screen_name, name, created_at, url, followers_count, friends_count, statuses_count, \
    favourites_count, listed_count, \
    contributors_enabled, description, protected, location, lang, expanded_url".split()

#INITIALIZE OUTPUT FILE AND WRITE HEADER ROW   
outfp = open(outfn, "w")
outfp.write(string.join(fields, "\t") + "\n")  # header

#THE VARIABLE 'USERS' CONTAINS INFORMATION OF THE 32 TWITTER USER IDS LISTED ABOVE
#THIS BLOCK WILL LOOP OVER EACH OF THESE IDS, CREATE VARIABLES, AND OUTPUT TO FILE
for entry in users:
    #CREATE EMPTY DICTIONARY
    r = {}
    for f in fields:
        r[f] = ""
    #ASSIGN VALUE OF 'ID' FIELD IN JSON TO 'ID' FIELD IN OUR DICTIONARY
    r['id'] = entry['id']
    #SAME WITH 'SCREEN_NAME' HERE, AND FOR REST OF THE VARIABLES
    r['screen_name'] = entry['screen_name']
    r['name'] = entry['name']
    r['created_at'] = entry['created_at']
    r['url'] = entry['url']
    r['followers_count'] = entry['followers_count']
    r['friends_count'] = entry['friends_count']
    r['statuses_count'] = entry['statuses_count']
    r['favourites_count'] = entry['favourites_count']
    r['listed_count'] = entry['listed_count']
    r['contributors_enabled'] = entry['contributors_enabled']
    r['description'] = entry['description']
    r['protected'] = entry['protected']
    r['location'] = entry['location']
    r['lang'] = entry['lang']
    #NOT EVERY ID WILL HAVE A 'URL' KEY, SO CHECK FOR ITS EXISTENCE WITH IF CLAUSE
    if 'url' in entry['entities']:
        r['expanded_url'] = entry['entities']['url']['urls'][0]['expanded_url']
    else:
        r['expanded_url'] = ''
    print r
    #CREATE EMPTY LIST
    lst = []
    #ADD DATA FOR EACH VARIABLE
    for f in fields:
        lst.append(unicode(r[f]).replace("\/", "/"))
    #WRITE ROW WITH DATA IN LIST
    outfp.write(string.join(lst, "\t").encode("utf-8") + "\n")

outfp.close()    

Error:

File "user_profile_location.py", line 35, in <module>
    ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1202, in __init__
    ParserBase.__init__(self, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 918, in __init__
    raise ValueError("cannot specify usecols when "
ValueError: cannot specify usecols when specifying a multi-index header
Faery answered 26/6, 2016 at 15:5 Comment(6)
What are you asking? Do you not understand the error you're getting from Tweepy? Do you not know how to handle errors?Sontag
Please read your code, you are asking for user_ids=['user_name'], which most likely fails since there is no twitter user named user_name.Familiar
@Familiar seriously would appreciate some help with the code to get the locations with username.Faery
Your question is completely unclear. Edit your code and state clearly what is your requirement. Then only peeps will try to help.Bolan
@Bolan Thank you for the comment.. Please check I did include the direct problem in the beginning of the content and coming to the code if I knew what is the problem with code I wouldn't have posted here.Faery
so, you've a list of "screen_names" and you want to know the geolocation of the tweets that they tweet?Bolan
B
5

Assuming that you just want to get the location of the user that is put up in his/her profile page, you can just use the API.get_user from Tweepy. Below is the working code.

#!/usr/bin/env python
from __future__ import print_function

#Import the necessary methods from tweepy library
import tweepy
from tweepy import OAuthHandler


#user credentials to access Twitter API 
access_token = "your access token here"
access_token_secret = "your access token secret key here"
consumer_key = "your consumer key here"
consumer_secret = "your consumer secret key here"


def get_user_details(username):
        userobj = api.get_user(username)
        return userobj


if __name__ == '__main__':
    #authenticating the app (https://apps.twitter.com/)
    auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

    #for list of usernames, put them in iterable and call the function
    username = 'thinkgeek'
    userOBJ = get_user_details(username)
    print(userOBJ.location)

Note: This is a crude implementation. Write a proper sleeper function to obey Twitter API access limits.

Bolan answered 2/7, 2016 at 22:56 Comment(2)
Exactly the same output along with all the input columns .. Let me check with full data. The list of usernames are more than 50k in one input. I hope it works for that. Will reply as soon as I check..Faery
The problem with the above two programs I wrote in question is about reading the users from file which has about 50K usernames. May I please request you to help me with that part.Faery

© 2022 - 2024 — McMap. All rights reserved.