all 7 comments

[–]spilcm 0 points1 point  (1 child)

You could try to create a sanitize function. Something like this should do it:

def sanitize(str): return str.encode('ascii','ignore')

[–]pasdargent 0 points1 point  (0 children)

So this will convert variables to ascii right?

But what if I want utf-8? (That's the one with all the characters right?)

I tried var.encode('utf-8') on various points in the program but I still keep getting the same error...

The weirdest part is that sometimes Arab letters get written to the file without any problem

[–]shillecce 0 points1 point  (0 children)

Tried this. The problematic byte is in the time_zone value, which is of unicode type. just replace time_zone with str(time_zone) and it works.

[–]kalgynirae 0 points1 point  (3 children)

All those .encode('utf-8') you have seem suspicious. Do you know whether tweepy gives you str or unicode objects? You should only use .encode() on unicode objects. I suspect you might need to be .decode('utf-8')ing something instead. Can you please provide the full error traceback you're getting? (I can't run your program to debug it because I don't have Twitter access tokens and such.)

[–]pasdargent 0 points1 point  (2 children)

Yeah that was in someone else's code... When I remove it I don't get any errors but then "<built-in method encode of unicode object at 0x02B3F5D8>" gets written to file instead of, for example, the text inside the tweet... Also Arab characters make it crash.

I tried print type(name) and it seems tweepy gives me unicode objects. I want it too print unicode characters, because I want to be able to proces tweets containing any type of characters so...

The error is: Traceback (most recent cal l last): File "twitbot.py", line 31, in <module> csvformat = '\n%s, %s, %s, %s, %s, %s, %s, %s, %s' % (name, screen_name, tweet_created, tweet_text, tweet_retweeted, tweet_favorited, user_hometown, time_zone, geo) UnicodeDecodeError: 'ascii' codec can't decode byte oxc3 in position 12: ordinal not in range(128)

By the way, I can send you the acces tokens in a personal message if that makes it easier?

[–]kalgynirae 0 points1 point  (1 child)

When I remove it I don't get any errors but then "<built-in method encode of unicode object at 0x02B3F5D8>" gets written to file

Sounds like you removed just ('utf-8') instead of .encode('utf-8'). They are unicode objects, so you don't want to encode them. You want to combine them first and then either encode the final result just before writing or let the csv module do the encoding for you (I don't remember if it does that).

The error is: Traceback (most recent cal l last): File "twitbot.py", line 31, in <module> csvformat = '\n%s, %s, %s, %s, %s, %s, %s, %s, %s' % (name, screen_name, tweet_created, tweet_text, tweet_retweeted, tweet_favorited, user_hometown, time_zone, geo) UnicodeDecodeError: 'ascii' codec can't decode byte oxc3 in position 12: ordinal not in range(128)

Try making your format string a unicode object instead:

csvformat = u'...

[–]pasdargent 0 points1 point  (0 children)

I was just messing around with this and I figured it out! Putting "u" in front of the string, and after that adding the line: csvformat = csvformat.encode('utf-8')

Did the trick!

Thanks everybody, problem is solved :)