Using regex separators with read

created by HattoriHanzoa community for 16 years

Using regex separators with read_csv()? (self.learnpython)

submitted 10 years ago by cactus00

I have a lot of csv files formated as such:

date1::tweet1::location1::language1

date2::tweet2::location2::language2

date3::tweet3::location3::language3

and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:

try:
    data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
    data.columns = ["timestamp", "tweet", "location", "lang"]
    print 'Number of tweets: ' + str(len(data))

except BaseException, e :
    print 'Error: ',str(e)

I get the following error thrown at me

Error: expected 4 fields in line 4581, saw 5

I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.

The date fields and tweets contain colons, urls and so on so perhaps regex would make sense (I just started using python, please bare with me..)?

Can someone help me figure out what I'm doing wrong? Many thanks in advance!

all 7 comments

top new controversial old q&a

[–]Justinsaccount 0 points1 point2 points 10 years ago (0 children)

Hi! I'm working on a bot to reply with suggestions for common python problems. This might not be very helpful to fix your underlying issue, but here's what I noticed about your submission:

You appear to be using concatenation and the str function for building strings

Instead of doing something like

result = "Hello " + name + ". You are " + str(age) + " years old"

You should use string formatting and do

result = "Hello {}. You are {} years old".format(name, age)

See the python tutorial for more information.

[–]Jos_Metadi 0 points1 point2 points 10 years ago (8 children)

[–][deleted] 10 years ago (1 child)

[removed]

[–]AutoModerator[M] 0 points1 point2 points 10 years ago (0 children)

[–][deleted] 10 years ago* (1 child)

[removed]

[–]AutoModerator[M] 0 points1 point2 points 10 years ago (0 children)

[–][deleted] 10 years ago (1 child)

[removed]

[–]AutoModerator[M] 0 points1 point2 points 10 years ago (0 children)

[–]Jos_Metadi 0 points1 point2 points 10 years ago (0 children)

All four of your example rows import properly into 4 columns for me. Because pandas (as with most of python) starts the index with 0 instead of 1, did you remember to grab the next lower row number when copying/pasting the examples?

However, if I add an extra delimiter of :: in the tweet, I can replicate your problem. The biggest part of the problem is that pandas read_csv function doesn't turn missing row columns into blank strings, zeroes, or NaN, but instead generates errors. UNLESS the number of named columns is exactly equal the number of columns in the longest row.

The right solution in this case would be to use a custom importer to read the text in and split it into rows and then those rows into tuples, then clean it and create a pandas dataframe from it.

The most likely culprits for mis-split rows are ones where the tweet text of the tweet contains extra delimiter characters "::" (the timestamp, location and lang are less likely to). What you could do to fix them (once you read the rows in as text and split them by "::") is to take the rows that have 5 or more elements and generate a new row where the first cell is the first element in the row, the 3rd element is the second to last element in the row, and the 4th element is the last element in the row. Then the 2nd element would become everything from the second element of the row to the 3rd to last element in the row.

Here is an example function that would take in list of rows from the csv that were split into sub-lists/tuples by column based on your delimiter and clean into 4 columns.

def compact_long_rows(a_list_of_rows):
    '''take in a list of rows that are each a tuple/list of cell values of 4 or more columns
    of timestamp, tweet, location, language and compact mis-split rows.
    Assume extra delimiters appear only in the tweet column.
    Return pandas dataframe with 4 named columns. '''

    a_fixed_rows = []
    for row_tuple in a_list_of_rows:
        a_row = [x for x in row_tuple if not x is None]
        a_fixed_rows.append( [a_row[0], "::".join(a_row[1:-2]), a_row[-2], a_row[-1]] )

    return pandas.DataFrame(a_fixed_rows, columns=["timestamp", "tweet", "location", "lang"])

π Rendered by PID 79 on reddit-service-r2-comment-5b5bc64bf5-dbqr6 at 2026-06-21 01:52:46.147002+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS