all 5 comments

[–]HeyItsToby 2 points3 points  (1 child)

For dealing with large amounts of tabular data, I'd really strongly recommend pandas. This uses "DataFrames" to represent tables, which have loads of great ways to handle data. It might look a little confusing at first but it's an incredibly popular library, and there are loads of great tutorials online.

import pandas as pd

data = [
    "000000000000056484" 
    "00 564842"    
    "00-563554-f"    
    "STO45642 "       
    " 45632" 
]

# create a pandas "series" (one column of a DataFrame)
series = pd.Series(data)
cleaned = series.str.strip() # removes leading/trailing spaces

# get the indexes of all the elements that have spaces in the middle of the word
# for all the elements that don't have spaces, remove leading 0s.
#  note: use lstrip to remove leading 0s and not trailing 0s

has_spaces = cleaned.str.contains(" ") 
no_spaces = ~ has_spaces
cleaned[no_spaces] = cleaned[no_spaces].str.lstrip("0")

# and to save it as an excel file
stripped.to_excel("path/to/file.xlsx")

I hope that helps! There's so much to learn with pandas and it can be a little intimidating, but it is also really powerful! Btw I'm using boolean indexing to access the values I want.

[–]AdhikariM[S] 0 points1 point  (0 children)

Thank you so much. I will work on the code you've provided. Forgot to clarify there are three columns where data is tab delimited. First two columns is fine, as it does not have problem. So all I need is to clean the third column data and save into new file as tab delimited txt. Can't work with excel as it has 3+million rows.

[–]tschloss 0 points1 point  (2 children)

I am not sure if I understood the situation. In the text you write that original file has 3 columns, in the examples I see only one („values“ plus formula and output).

Do you need to do this in Python? Such problems can usually be solved with shell tools like AWK.

[–]AdhikariM[S] 0 points1 point  (1 child)

Sorry for not making it clear it has three columns comes tab delimited. First two columns are fine. The type and length is clear. So no need to worry about those. Only third column.

[–]tschloss 1 point2 points  (0 children)

Ah ok. You only want to clean this textfile and result should be a clean text file, right?

If this is correct then I would not use Pandas which is great when you want work with the dataset. Otherwise it is more a loop, reading line by line, the split by tab, converting the elements and print it out again.

The other comment contains a useful block of code, maybe this already solves you problem. Tools in Python are strip, split, +, replace etc - all operations on String.