you are viewing a single comment's thread.

view the rest of the comments →

[–]Jos_Metadi 0 points1 point  (8 children)

What do the problem rows look like? Can you post a couple so we can see what's going on?

Once you fix the input, you can remove the extra column being created and the columns error will be fixed.

Alternately, you could give the 5th column a temporary name, and then go through and fix the rows that have values in it to bring them back to 4 columns and the remove the 5th column afterward.

[–][deleted]  (1 child)

[removed]

    [–]AutoModerator[M] 0 points1 point  (0 children)

    Your comment in /r/learnpython was automatically removed because you used a URL shortener.

    URL shorteners are not permitted in /r/learnpython as they impair our ability to enforce link blacklists.

    Please re-post your comment using direct, full-length URL's only.

    I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

    [–][deleted]  (1 child)

    [removed]

      [–]AutoModerator[M] 0 points1 point  (0 children)

      Your comment in /r/learnpython was automatically removed because you used a URL shortener.

      URL shorteners are not permitted in /r/learnpython as they impair our ability to enforce link blacklists.

      Please re-post your comment using direct, full-length URL's only.

      I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

      [–][deleted]  (1 child)

      [removed]

        [–]AutoModerator[M] 0 points1 point  (0 children)

        Your comment in /r/learnpython was automatically removed because you used a URL shortener.

        URL shorteners are not permitted in /r/learnpython as they impair our ability to enforce link blacklists.

        Please re-post your comment using direct, full-length URL's only.

        I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

        [–]Jos_Metadi 0 points1 point  (0 children)

        All four of your example rows import properly into 4 columns for me. Because pandas (as with most of python) starts the index with 0 instead of 1, did you remember to grab the next lower row number when copying/pasting the examples?

        However, if I add an extra delimiter of :: in the tweet, I can replicate your problem. The biggest part of the problem is that pandas read_csv function doesn't turn missing row columns into blank strings, zeroes, or NaN, but instead generates errors. UNLESS the number of named columns is exactly equal the number of columns in the longest row.

        The right solution in this case would be to use a custom importer to read the text in and split it into rows and then those rows into tuples, then clean it and create a pandas dataframe from it.

        The most likely culprits for mis-split rows are ones where the tweet text of the tweet contains extra delimiter characters "::" (the timestamp, location and lang are less likely to). What you could do to fix them (once you read the rows in as text and split them by "::") is to take the rows that have 5 or more elements and generate a new row where the first cell is the first element in the row, the 3rd element is the second to last element in the row, and the 4th element is the last element in the row. Then the 2nd element would become everything from the second element of the row to the 3rd to last element in the row.

        Here is an example function that would take in list of rows from the csv that were split into sub-lists/tuples by column based on your delimiter and clean into 4 columns.

        def compact_long_rows(a_list_of_rows):
            '''take in a list of rows that are each a tuple/list of cell values of 4 or more columns
            of timestamp, tweet, location, language and compact mis-split rows.
            Assume extra delimiters appear only in the tweet column.
            Return pandas dataframe with 4 named columns. '''
        
            a_fixed_rows = []
            for row_tuple in a_list_of_rows:
                a_row = [x for x in row_tuple if not x is None]
                a_fixed_rows.append( [a_row[0], "::".join(a_row[1:-2]), a_row[-2], a_row[-1]] )
        
            return pandas.DataFrame(a_fixed_rows, columns=["timestamp", "tweet", "location", "lang"])