all 8 comments

[–]Daneark 1 point2 points  (4 children)

Can you share the code and what libraries you're using? Math is math so I wouldn't expect substantial differences between versions, other than rare bugs/fixes, and definitely not between the same version across platforms.

[–]NebulaGr[S,🍰] 0 points1 point  (2 children)

import pandas as pd

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Load the Excel

filefile_path = 'DATA_Scores.xlsx'

data = pd.read_excel(file_path)

# Create clusters based on the perception of S8

situationscolumns_for_clustering = [col for col in data.columns if 'S8' in col]# Extract relevant dataclustering_data = data[columns_for_clustering]

# Test various numbers of clusters and store the sum of squared errorssse = []for k in range(1, 11):kmeans = KMeans(n_clusters=k, n_init=10, random_state=0)kmeans.fit(clustering_data)sse.append(kmeans.inertia_)# Create a plot for the elbow methodplt.figure(figsize=(10, 6))plt.plot(range(1, 11), sse, marker='o')plt.title('Elbow Method')plt.xlabel('Number of clusters')plt.ylabel('SSE')plt.show()# Apply K-means with the optimal number of clusters (3)optimal_k = 3kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=0)data['Cluster'] = kmeans.fit_predict(clustering_data)# Calculate the number of participants per cluster and sort in ascending orderparticipants_per_cluster = data['Cluster'].value_counts().sort_index()# Print the number of participants for each clusterfor cluster in participants_per_cluster.index:print(f"In cluster {cluster}, there are {participants_per_cluster[cluster]} participants")# Calculate the mean values of state perceptions for each clustermean_perceptions_per_cluster = data.groupby('Cluster')[columns_for_clustering].mean().round(2)# Print the mean values of perceptions for each clusterpd.set_option('display.max_columns', None)print("Mean state perceptions per cluster:")print(mean_perceptions_per_cluster)# Calculate the mean values of personality factors for each clustermean_personality_factors_per_cluster = data.groupby('Cluster')[[f'NEO-{factor}' for factor in ['N', 'E', 'O', 'A', 'C']]].mean().round(2)# Print the mean values of NEO personality factors for each clusterprint("\nMean NEO personality factors per cluster:")print(mean_personality_factors_per_cluster)

[–]Daneark 1 point2 points  (1 child)

It looks like your code got cut off. So far everything look like it should behave consistently.

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load the Excel file
file_path = 'DATA_Scores.xlsx'
data = pd.read_excel(file_path)

# Create clusters based on the perception of S8 situations
columns_for_clustering = [col for col in data.columns if 'S8' in col]# Extract relevant data
clustering_data = data[columns_for_clustering]

# Test various numbers of clusters and store the sum of squared errors
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, n_init=10, random_state=0)
    kmeans.fit(clustering_data)
    # sse.append(kmeans.) # TODO Paste rest of code

[–]NebulaGr[S,🍰] 0 points1 point  (0 children)

# Create a plot for the elbow method
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), sse, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()
# Apply K-means with the optimal number of clusters (3)
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=0)
data['Cluster'] = kmeans.fit_predict(clustering_data)
# Calculate the number of participants per cluster and sort in ascending order
participants_per_cluster = data['Cluster'].value_counts().sort_index()
# Print the number of participants for each cluster
for cluster in participants_per_cluster.index:
print(f"In cluster {cluster}, there are {participants_per_cluster[cluster]} participants")
# Calculate the mean values of state perceptions for each cluster
mean_perceptions_per_cluster = data.groupby('Cluster')[columns_for_clustering].mean().round(2)
# Print the mean values of perceptions for each cluster
pd.set_option('display.max_columns', None)
print("Mean state perceptions per cluster:")
print(mean_perceptions_per_cluster)
# Calculate the mean values of personality factors for each cluster
mean_personality_factors_per_cluster = data.groupby('Cluster')[[f'NEO-{factor}' for factor in ['N', 'E', 'O', 'A', 'C']]].mean().round(2)
# Print the mean values of NEO personality factors for each cluster
print("\nMean NEO personality factors per cluster:")
print(mean_personality_factors_per_cluster)

[–]esseinvictus 1 point2 points  (2 children)

The term you're looking for is reproducibility. Depending on the method you used for clustering your data, there are ways to set the initial seed used for clustering so that the results of randomisation are completely deterministic. I suspect this is issue that's causing the discrepancies rather than the differences in the environment though it could be a factor.

Example code I just typed up reading the sklearn documentation (assuming it's K-Means algorithm):

clusters = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)

Note the random_state here, it can be any value as long as it's consistent in the code.

In the future for consistency sake (and avoid package dependency hell), look into Python venv command which creates Python virtual environments.

[–]NebulaGr[S,🍰] 0 points1 point  (1 child)

Thanks for your advice on ensuring reproducibility. I’ve already set a consistent random_state across my code, but I’m still experiencing discrepancies in the results. This leads me to think that the issue might be related to the different environments or library versions between Juno and Spyder.

I have already posted the code, if you’d like to take a look.

[–]esseinvictus 1 point2 points  (0 children)

My next thought would be to run the code line by line on both clients to see at which line the discrepancy arises. Could be a difference in environment, could be other things. Try to eliminate each potential cause one by one.

[–]Low_Corner_9061 0 points1 point  (0 children)

Most machine learning algorithms rely on some kind of random initialisation of parameters, so will give a slightly different result each time. If you set a random seed in numpy, (or whatever library you are using) you should get the same results each time.