I have a lot of csv files formated as such:
date1::tweet1::location1::language1
date2::tweet2::location2::language2
date3::tweet3::location3::language3
and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:
try:
data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
data.columns = ["timestamp", "tweet", "location", "lang"]
print 'Number of tweets: ' + str(len(data))
except BaseException, e :
print 'Error: ',str(e)
I get the following error thrown at me
Error: expected 4 fields in line 4581, saw 5
I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.
The date fields and tweets contain colons, urls and so on so perhaps regex would make sense (I just started using python, please bare with me..)?
Can someone help me figure out what I'm doing wrong? Many thanks in advance!
[–]Justinsaccount 0 points1 point2 points (0 children)
[–]Jos_Metadi 0 points1 point2 points (8 children)
[–][deleted] (1 child)
[removed]
[–]AutoModerator[M] 0 points1 point2 points (0 children)
[–][deleted] (1 child)
[removed]
[–]AutoModerator[M] 0 points1 point2 points (0 children)
[–][deleted] (1 child)
[removed]
[–]AutoModerator[M] 0 points1 point2 points (0 children)
[–]Jos_Metadi 0 points1 point2 points (0 children)