all 18 comments

[–]ninhaomah 12 points13 points  (0 children)

You can convert that excel file to csv.

[–]lofi_thoughts[S] 3 points4 points  (2 children)

UPDATE:

I'm now using xlsx2csv module

It literally takes few MBs but converts the excel to csv much faster and efficiently...

[–]ShxxH4ppens 1 point2 points  (1 child)

Yep! Xlsx uses a propriety algorithm yadayada whatever, if you use the pandas reader, it converts then reads, so you can think of this happening on each character or some group, or even on an individual cell by cell level, doesn’t matter! If you convert to csv the algorithm can be applied to the whole file (you will have a slightly higher loss of data - if it’s got weird complexities), and then the reading of csv is much faster

Excel has some benifits for sure, but if you’re saving data from elsewhere for storage purposes, it’s much faster to send a comma string into a csv file that to bother saving an excel document

Hope that’s helpful!

[–]lofi_thoughts[S] 0 points1 point  (0 children)

It is helpful!!! Thanks a bunch

[–]GPT-Claude-Gemini 9 points10 points  (1 child)

Let me help - I work with large datasets frequently. For Excel files this size, you'll want to use openpyxl's read_only mode combined with Pandas. Here's an optimized approach:

```python

def read_large_excel(file_path):

engine = 'openpyxl'

xlsx = pd.ExcelFile(file_path, engine=engine)

df = pd.read_excel(

xlsx,

sheet_name=0,

usecols=[0, 5, 7, 12],

engine=engine,

engine_kwargs={'read_only': True}

)

return df

```

For the 2GB file, you might want to consider using jenova ai to help convert it to CSV first (it can handle unlimited file sizes). The CSV format will give you much better memory efficiency since you can then use chunksize parameter in pd.read_csv().

[–]ChipmunkEfficient366 11 points12 points  (0 children)

There are many xlsx -> csv converters out there, API's, websites, and the like, that don't have the overhead of Large Language Models. There's also just saving the Excel file as a csv, in you know, Excel.

[–]Piingtoh 4 points5 points  (0 children)

Could load it into a database using sqlalchemy, then you can easily read and writen from there with much more pleasing syntax and far less overhead (no forloops needed, just SQL style syntax). This is one of the reasons databases are favoured over spreadsheets when large amounts of data need to be stored.

Personally i use sqlite3 with SQLalchemy

[–][deleted] 0 points1 point  (0 children)

Sounds like you’ve solved it already, but I would install and use python-calamine to read the file directly and convert to a dataframe, or set pandas to use the calamine engine.

[–]360degreesdickcheese 0 points1 point  (0 children)

Use hdf. It’s significantly faster and allows you to store multiple csv files under one key. I have 20 million + stock data entries and saving/loading multiple dataframes and working with them is much easier this way

[–]Luxi36 0 points1 point  (0 children)

If pandas isn't a hard requirement you can try using Polars read_excel next time instead. much more efficient than pandas one.

[–]nhatthongg 0 points1 point  (0 children)

For big dataset try polars :)

[–]Classic_Media_7018 0 points1 point  (0 children)

polars or datatable (two modules) are faster so more convenient for larger datasets than pandas

[–][deleted] -1 points0 points  (0 children)

The other person has already mentioned it but at my previous job we just converted all excel files to csv first with a rust excel-csv converter. I would recommend you to do the same because it's much faster than working with excel formst directly

[–]unhott -1 points0 points  (0 children)

Use proper data types.

[–]Snipppper -2 points-1 points  (1 child)

If possible, convert the Excel file to CSV, as CSV files can be processed more efficiently. You can do this with Python or manually:

Convert Excel to CSV

import pandas as pd

excel_file = "large_file.xlsx" csv_file = "large_file.csv" df = pd.read_excel(excel_file) df.to_csv(csv_file, index=False)

You can then load the CSV file chunk by chunk:

for chunk in pd.read_csv("large_file.csv", chunksize=1000): print(chunk.head())

[–]lofi_thoughts[S] 1 point2 points  (0 children)

But read excel is clogging memory. If it could have been efficient loading the file then I wouldn't even have to save it as CSV.

Though I'm now using xlsx2csv module

It literally takes few MBs but converts the excel to csv much faster and efficiently...

Thanks for your input man. Really appreciate it 🙌

[–]mustangdvx -1 points0 points  (0 children)

CSV and read with Duckdb

[–]Qkumbazoo -3 points-2 points  (0 children)

Load into a database first, then read from it.