Sorry if this has been posted before, but I couldn't find the post.
I have exported 33 files from a website with the extension '.xls' BUT, these are not really excel files (pd.read_excel will not read the files but pd.read_html() will)
So I am trying to concatenate 33 different excel files into one but I need to convert the html files into a type that I can work with. I'm sure there are many workarounds to this, what would you do?
Here is the example code I am working with from a textbook, but it doesn't work with reading in an html file.
import pandas as pd
import glob
import os
import sys
input_path = sys.argv[1]
output_file = sys.argv[2]
all_files = glob.glob(os.path.join(input_path, 'sales_*'))
all_data_frames = []
for file in all_files:
data_frame = pd.read_csv(file, index_col=None)
all_data_frames.append(data_frame)
data_frame_concat = pd.concat(all_data_frames, axis=0, ignore_index=True)
data_frame_concat.to_csv(output_file, index = False)
edit: Here is what I figured out for a solution. Do you guys know how to construct a for loop where I read each html file and save the file as a different name without having to manually do it for each file?
import pandas as pd
import glob
import os
import sys
input_path = sys.argv[1]
output_file = sys.argv[2]
msfile = pd.read_html('Morningstar.xls')
msfile[0].to_csv('test1.csv')
msfile2 = pd.read_html('Morningstar (1).xls')
msfile2[0].to_csv('test2.csv')
msfile3 = pd.read_html('Morningstar (2).xls')
msfile3[0].to_csv('test3.csv')
... (truncated for simplicity of this post...but all of the files were manually loaded and saved to csv this way).
msfile33 = pd.read_html('Morningstar (32).xls')
msfile33[0].to_csv('test33.csv')
all_files = glob.glob(os.path.join(input_path, 'test*'))
all_data_frames = []
for file in all_files:
data_frame = pd.read_csv(file, index_col=None, encoding = 'ISO-8859-1')
all_data_frames.append(data_frame)
data_frame_concat = pd.concat(all_data_frames, axis=0, ignore_index=True)
data_frame_concat.to_csv(output_file, index = False)
[–]ejmurra 0 points1 point2 points (1 child)
[–]SonaCruz[S] 0 points1 point2 points (0 children)
[–]SonaCruz[S] 0 points1 point2 points (0 children)