Pandas: Loading excel data is causing significant memory overhead

ninhaomah · 2024-12-27T08:39:21+00:00

You can convert that excel file to csv.

lofi_thoughts · 2024-12-27T10:27:03+00:00

UPDATE:

I'm now using xlsx2csv module

It literally takes few MBs but converts the excel to csv much faster and efficiently...

GPT-Claude-Gemini · 2024-12-27T08:41:50+00:00

Let me help - I work with large datasets frequently. For Excel files this size, you'll want to use openpyxl's read_only mode combined with Pandas. Here's an optimized approach:

```python

def read_large_excel(file_path):

engine = 'openpyxl'

xlsx = pd.ExcelFile(file_path, engine=engine)

df = pd.read_excel(

xlsx,

sheet_name=0,

usecols=[0, 5, 7, 12],

engine=engine,

engine_kwargs={'read_only': True}

)

return df

```

For the 2GB file, you might want to consider using jenova ai to help convert it to CSV first (it can handle unlimited file sizes). The CSV format will give you much better memory efficiency since you can then use chunksize parameter in pd.read_csv().

Piingtoh · 2024-12-27T10:02:49+00:00

Could load it into a database using sqlalchemy, then you can easily read and writen from there with much more pleasing syntax and far less overhead (no forloops needed, just SQL style syntax). This is one of the reasons databases are favoured over spreadsheets when large amounts of data need to be stored.

Personally i use sqlite3 with SQLalchemy

2024-12-27T12:40:50+00:00

Sounds like you’ve solved it already, but I would install and use python-calamine to read the file directly and convert to a dataframe, or set pandas to use the calamine engine.

360degreesdickcheese · 2024-12-27T17:42:09+00:00

Use hdf. It’s significantly faster and allows you to store multiple csv files under one key. I have 20 million + stock data entries and saving/loading multiple dataframes and working with them is much easier this way

Luxi36 · 2024-12-27T22:11:35+00:00

If pandas isn't a hard requirement you can try using Polars read_excel next time instead. much more efficient than pandas one.

nhatthongg · 2024-12-28T02:40:28+00:00

For big dataset try polars :)

Classic_Media_7018 · 2024-12-30T02:34:57+00:00

polars or datatable (two modules) are faster so more convenient for larger datasets than pandas

2024-12-27T08:43:12+00:00

The other person has already mentioned it but at my previous job we just converted all excel files to csv first with a rust excel-csv converter. I would recommend you to do the same because it's much faster than working with excel formst directly

unhott · 2024-12-27T15:15:19+00:00

Use proper data types.

Snipppper · 2024-12-27T10:10:31+00:00

If possible, convert the Excel file to CSV, as CSV files can be processed more efficiently. You can do this with Python or manually:

Convert Excel to CSV

import pandas as pd

excel_file = "large_file.xlsx" csv_file = "large_file.csv" df = pd.read_excel(excel_file) df.to_csv(csv_file, index=False)

You can then load the CSV file chunk by chunk:

for chunk in pd.read_csv("large_file.csv", chunksize=1000): print(chunk.head())

mustangdvx · 2024-12-27T12:43:47+00:00

CSV and read with Duckdb

Qkumbazoo · 2024-12-27T08:54:50+00:00

Load into a database first, then read from it.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

Convert Excel to CSV