I've set up a public stream via AWS to collect tweets and now want to do some preliminary analysis. All my data was stored on an S3 bucket (in 5mb files).
I downloaded everything and merged all the files into one. Each tweet is stored as a standard JSON object as per Twitter specifications.
Basically, the consolidated file contains multiple JSON objects. I added opening and closing square brackets ( [] ) to make it look like a list of dictionaries for when it gets read into Python. So the structure is kinda like this (I'm not sure if I can just post twitter data here):
[{"created_at":"Mon Sep 19 23:58:50 +000 2016", "id":<num>, "id_str":"<num>","text":"<tweet message>", etc.},
{same as above},
{same as above}]
After deleting the very first tweet, I put everything into www.jsonlint.com and confirmed that it is a valid JSON data structure.
Now, I'm trying to load this data into Python and hoping to do some basic counts of different terms in tweets (e.g. how many times is @HillaryClinton mentioned in the text of a tweet, etc.).
Previously with smaller datasets, I was able to get away with code like this:
import json
import csv
import io
data_json = open('fulldata.txt', 'r', encoding='utf-8')
data_python = json.load(data.json)
I then wrote the data for respective fields into a CSV file and performed my analyses that way. This worked for a 2GB file.
Now that I have a 7GB file, I am noticing that if I use this method, Python throws an error in the "json.load(data.json)" line saying "OSError: [Errno 22] Invalid Argument.
I'm not sure why this is happening but I anticipate that it might be because it's trying to load the entire file at once into memory. Is this correct?
So I was trying to use ijson which apparently lets you parse through the json file. I tried to write the following code:
import ijson
f = open('fulldata.txt', 'r', encoding='utf-8')
content = ijson.items(f, 'item')
for item in content:
<do stuff here>
With this implementation, I get an error on the line "for item in content" saying "ijson.backends.python.unexpectedsymbol: unexpected symbol '/u201c' at 1
I also tried to go through each line of the data file and go through it as a JSON lines format. So, assuming each line was a JSON object, I wrote:
raw_tweets = []
with open('full_data.txt', 'r', encoding='utf-8') as full_file:
for line in full_file:
raw_tweets.append(json.dumps(line))
print(len(raw_tweets)) #this worked. got like 2 million something as expected!
enter code here
But here, each entry into the list was a string and not a dictionary which made it really hard to parse the data I needed out of it. Is there a way to modify this last code to make it work as I need? But even then, wouldn't loading that whole dataset into a list make it still hard for future analyses given memory constraints?
I'm a little stuck about the best way to proceed with this. I really want to do this in Python because I'm trying to learn how to use Python tools for these kinds of analyses.
Does any have any experience with this? Am I being really stupid or misunderstanding something really basic?
EDIT:
So, I first went to www.jsonlint.com and pasted my entire dataset and found that after removing the first tweet, it was in valid JSON format. So for now I just excluded that one file.
I basically have a dataset in the format mentioned above ([{json1}, {json2}] where each entity in the {} represents a tweet.
Now that I confirmed that it was a valid JSON, my goal was to get it into python with each JSON being represented as a dictionary (so I could easily manipulate those files). can someone correct my thought-process here if it's inefficient?
To do so, I did:
raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
for line in full_file:
raw_tweets.append(json.dumps(line))
#This successfully wrote each line of my file into a list. Confirmed by checking length, as described previously.
#Now I want to write this out to a csv file.
csv_out = io.open("parsed_data.csv", mode = 'w', encoding='ISO-8859-1')
fields = u'created_at,text,screen_name,followers<friends,rt,fav'
csv_out.write(fields) #Write the column headers out.
csv_out.write(u'\n')
#Now, iterate through the list. Get each JSON object as a dictionary and pull out the relevant information.
for tweet in raw_tweets:
#Each "tweet" is {json#},\n'
current_tweet = json.loads(tweet) #right now everything is a list of strings in the {} format but it's just a string and not a dictionary. If I convert it to a JSON object, I should be able to make a dictionary form of the data right?
row = [current_tweet.get('created_at'), '"' + line.get('text').replace('"','""') + '"', line.get('user).get('screen_name')] #and I continue this for all relevant headers
Problem is, that last line where I say current_tweet.get isn't working because it keeps saying that 'str' has no attribute 'get' so I'm not sure why json.loads() isn't giving me a dictionary...
EDIT#2
A user recommended I remove the [ and ] and also the trailing commas so that each line has valid JSON. That way I could just json.loads() each line. I removed the brackets as suggested. For the commas, I did this:
raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
for line in full_file:
no_comma = line[:-2] #Printed this to confirm that final comma was removed
raw_tweets.append(json.dumps(line))
This is giving an error saying ValueError: Expecting ':' Delimiter: Line 1 Column 2305 (char 2304)
To debug this, I printed the first line (i.e. I just said print(no_comma)) and I noticed that what Python printed actually had multiple tweets inside... When I open it in an editor like "UltraEdit" I notice that each tweet is a distinct line so I assumed that each JSON object was separated by a newline character. But here, when I print the results after iterating by line, I see that it's pulling in multiple tweets at once.
Should I be iterating differently? Is my method of removing the commas appropriate or should I be pre-processing the file separately?
I'm pretty sure that my JSON is formatted poorly but I'm not sure why and how to go about fixing it. Here is a sample of my JSON data. If this isn't allowed, I'll remove it...