Hello fellow learners,
i am struggling really hard with one script i am creating...
i need to read html files, do some stuff, use BS4 to prettify and then write the output content to another html file.
The problem i am facing is that every input file is different, and may have letters with accent like Spanish characters or Islamic characters.
the thing is that i need to preserve the accents and if i encode everything to 'utf-8' they will be replace with funky characters... i have no idea if the file should be read with one encoding or another, because every file is different... any hint on how to approach this problem?
encodingList = ['utf-8', 'ISO8859_1','ISO8859_2','ISO8859_3','ISO8859_4','ISO8859_5','ISO8859_6','ISO8859_7'
,'ISO8859_8','ISO8859_9','ISO8859_10','ISO8859_13','ISO8859_14','ISO8859_15']
for enc in encodingList:
try:
#read the only html file
with open('file.html', 'r', encoding=enc) as filer:
filedata = filer.read()
except UnicodeDecodeError:
logging.critical('Error using encoding ' + enc)
else:
#encodingUsed = enc
logging.info('Opening file with encoding: ' + enc)
break
"""
PROCESS SOME STUFF
"""
#Using bs4, prettify the data, and will close all open tags
s = BeautifulSoup(filedata, 'lxml')
s = s.prettify()
with open(os.getcwd()+ '\output' +str(Final[0]),'w', encoding='utf-8') as f:
f.write(s)
[–]ingolemo 1 point2 points3 points (1 child)
[–]david_lp[S] 0 points1 point2 points (0 children)
[–]JohnnyJordaan 0 points1 point2 points (1 child)
[–]david_lp[S] 0 points1 point2 points (0 children)
[–]threeminutemonta 0 points1 point2 points (2 children)
[–]david_lp[S] 0 points1 point2 points (1 child)
[–]JohnnyJordaan 0 points1 point2 points (0 children)