filtering of tweets received from statuses/filter (streaming API)
Asked Answered
C

2

7

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.

Now the tweets that i will be receiving can be from ANY of the 3 keywords that i mentioned. The problem is that i want to resolve as to which tweet corresponds to which keyword. i.e. mapping between tweets and the keyword(s) (that are mentioned in the "track" argument).

Apparently, there is no way to do this without doing any processing on the tweets received.

So i was wondering what is the best way to do this processing? Search for keywords in the text of the tweet? What about case-insensitive? What about when multiple words are there in same keyword, e.g: "Katrina Kaif" ?

I am currently trying to formulate some regular expression...

I was thinking the BEST way would to use the same logic (regular expressions etc.) as is used originally be statuses/filter API. How to know what logic is used by Twitter API statuses/filter itself to match tweets to the keywords ?

Advice? Help?

P.S.: I am using Python, Tweepy, Regex, MongoDb/Apache S4 (for distributed computing)

Chelsea answered 17/5, 2013 at 6:5 Comment(4)
For larger N regular expression might be quite pain. The most simple way would be to transform the text into lower-case and for each keyword check tweet for its existence. If you wanna check for exact matching then you might tokenize your tweets and get the intersection of your keyword set and the token set. The intersection will be the keywords matching the tweet.Holifield
@Holifield : Currently, I have N = 100. It is preferable to search for keyword only in the "text" part of tweet, right?Chelsea
Yeah as far as I know twitter matches the text part of the tweet only, so checking the text part will be more suitable for you.Holifield
@Chelsea I have the same use case. Did you settle on a solution? If so, do you mind sharing your approach?Denitadenitrate
B
2

The first thing coming into my mind is to create a separate stream for every keyword and start it in a separate thread, like this:

from threading import Thread
import tweepy


class StreamListener(tweepy.StreamListener):
    def __init__(self, keyword, api=None):
        super(StreamListener, self).__init__(api)
        self.keyword = keyword

    def on_status(self, tweet):
        print 'Ran on_status'

    def on_error(self, status_code):
        print 'Error: ' + repr(status_code)
        return False

    def on_data(self, data):
        print self.keyword, data
        print 'Ok, this is actually running'


def start_stream(auth, track):
    tweepy.Stream(auth=auth, listener=StreamListener(track)).filter(track=[track])


auth = tweepy.OAuthHandler(<consumer_key>, <consumer_secret>)
auth.set_access_token(<key>, <secret>)

track = ['obama', 'cats', 'python']
for item in track:
    thread = Thread(target=start_stream, args=(auth, item))
    thread.start()

If you still want to distinguish tweets by keywords by yourself in a single stream, here's some info on how twitter uses track request parameter. There are some edge cases that could cause problems.

Hope that helps.

Burthen answered 17/5, 2013 at 11:41 Comment(3)
The thing is that twitter API suggests us that we should try to reduce the number of INDIVIDUAL streams as far as possible. Because if there are too many stream connections from same IP/account, then it will get blacklisted. See this: dev.twitter.com/discussions/921Chelsea
Yeah, right, this is not an option generally, thanks for sharing.Burthen
Hmm... well i guess, for now i will just have to stick to matching EACH keyword (after making it case-insensitive) with text of EACH tweet, so as to form mapping between tweet and keyword(s).Chelsea
B
0

Return list of any/all 'triggered' track terms

I had a very related issue and i solved it by list comprehension. That is, I had a list of raw tweets, and my track filter terms as 'listoftermstofind' and 'rawtweetlist'. Then you can run the following to return a list of lists of any and all track terms that were found in each tweet.

j=[x.upper() for x in listoftermstofind] #your track filters, but making case insensitive
ListOfTweets=[x.upper() for x in rawtweetlist] #converting case to upper for all tweets
triggers=list(map(lambda y: list(filter(lambda x: x in y, j)), ListOfTweets))

This works well, because the track filters in the API are specific (down to the character level) rather than any natural language search processing or anything like that. I recommend reading the API docs on filtering in detail, it goes through the usage quite well: https://dev.twitter.com/streaming/overview/request-parameters

Bremen answered 27/4, 2017 at 16:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.