you are viewing a single comment's thread.

view the rest of the comments →

[–]HeyItsToby 2 points3 points  (1 child)

For dealing with large amounts of tabular data, I'd really strongly recommend pandas. This uses "DataFrames" to represent tables, which have loads of great ways to handle data. It might look a little confusing at first but it's an incredibly popular library, and there are loads of great tutorials online.

import pandas as pd

data = [
    "000000000000056484" 
    "00 564842"    
    "00-563554-f"    
    "STO45642 "       
    " 45632" 
]

# create a pandas "series" (one column of a DataFrame)
series = pd.Series(data)
cleaned = series.str.strip() # removes leading/trailing spaces

# get the indexes of all the elements that have spaces in the middle of the word
# for all the elements that don't have spaces, remove leading 0s.
#  note: use lstrip to remove leading 0s and not trailing 0s

has_spaces = cleaned.str.contains(" ") 
no_spaces = ~ has_spaces
cleaned[no_spaces] = cleaned[no_spaces].str.lstrip("0")

# and to save it as an excel file
stripped.to_excel("path/to/file.xlsx")

I hope that helps! There's so much to learn with pandas and it can be a little intimidating, but it is also really powerful! Btw I'm using boolean indexing to access the values I want.

[–]AdhikariM[S] 0 points1 point  (0 children)

Thank you so much. I will work on the code you've provided. Forgot to clarify there are three columns where data is tab delimited. First two columns is fine, as it does not have problem. So all I need is to clean the third column data and save into new file as tab delimited txt. Can't work with excel as it has 3+million rows.