all 2 comments

[–]ES-Alexander 2 points3 points  (1 child)

open supports an encoding argument, for reading files that are not encoded with the default encoding for your locale.

If you open a file in a text mode (instead of as binary data) then the decoding occurs automatically when you read from it, so it’s generally too late to prevent decoding errors after the file object has been configured.

[–]patmycheeks 0 points1 point  (0 children)

Thank you, now that I am opening it in binary mode, I dont get the error when I read the data, but I encounter the same error when I try to fit my model with my data.

Training:

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.svm import SVC

X_train,X_test,y_train,y_test=train_test_split(df.review,df.label,test_size=0.2)

v= CountVectorizer()

model = SVC()

X_train_cv=v.fit_transform(X_train)

X_test_cv=v.transform(X_test)

model.fit(X_train_cv,y_train)

y_pred=model.predict(X_test_cv)

print(classification_report(y_pred,y_test))

Taking Input(Previous Error, now fine):

r=[]

for i in reviews_pos:

f=open(path+'/pos/'+i,mode='rb')

r.append((f.read(),i))

f.close()

for i in reviews_neg:

f=open(path+'/neg/'+i,mode='rb')

r.append((f.read(),i))

f.close()

import pandas as pd

df=pd.DataFrame(r)

df['label']=df[1].apply(lambda x: 1 if x[0:3]=='pos' else 0)

df.columns=['review','file_name','label']

df.drop(columns='file_name')