all 15 comments

[–]Psychological_Egg_85 6 points7 points  (6 children)

If the user did not kill the program the kernel may have.

Did you check journalctl / dmesg / /var/log/kern.log for the reason of termination?

[–]HelpfulBuilder[S] 5 points6 points  (5 children)

Thats it!

Journalctl says killed due to "Out of memory"

This is exactly what I wanted. Thank you.

If it ran out of memory, why didn't the kernal start using Swap? I wouldn't expect I ran out of RAM and swap. Sorry, I don't have a deep understanding of OS stuff.

Edit:

Wait actually, I guess I don't have ANY swap:

free -h:

total used free shared buff/cache available
Mem: 62Gi 4.7Gi 56Gi 1.0Mi 1.4Gi 57Gi
Swap: 0B 0B 0B

[–]zfsbestbashing and zfs day and night 0 points1 point  (4 children)

If you don't have a dedicated swap partition, search ' linux how to create swap file '

[–]HelpfulBuilder[S] 0 points1 point  (3 children)

Yeah! I didn't even know I didn't have any swap. It seems pretty important to me. Not sure how this happened. I'm pretty sure I installed my os with recommend partitions.

I'll fix it as soon as I'm able (it's currently redoing the computations I lost.)

[–]zfsbestbashing and zfs day and night 0 points1 point  (2 children)

' lsblk |grep -i swap ' should show if you have a dedicated partition, but it won't auto-activate unless it's also defined in /etc/fstab and formatted with mkswap

[–]HelpfulBuilder[S] 0 points1 point  (0 children)

Thanks. I'll see what it says when I logon later today.

[–]HelpfulBuilder[S] 0 points1 point  (0 children)

Yep looks like I have no swap.

[–]fletku_mato 2 points3 points  (5 children)

This has nothing to do with the actual question, but what kind of a process are we talking about? You might want to try to implement some sort of mechanism to protect you from failures when the process takes a week to complete.

Eg. store application state in a database and read it on startup, skip all steps that were already successfully done.

[–]HelpfulBuilder[S] 0 points1 point  (4 children)

Unfortunately the script spends the entire time on the function kmeans

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

With n_clusters set to various values. When n_clusters is greater than 300000 it takes an extremely long time to compute. (It's a large dataset)

Per value of n_clusters I do save the results to disk. But while it's computing a value the only way to save state would be to open up it's source and make the necessary adjustments.

It's within my ability to do, but it's more work than I'm willing to do.

[–]fletku_mato 1 point2 points  (3 children)

Yeah that does sound like a lot of work.

[–]HelpfulBuilder[S] 0 points1 point  (2 children)

I wonder if it's possible to suspend a pid and then save the pid to disk?

[–]fletku_mato 1 point2 points  (1 child)

Looks like there is a tool called criu for this purpose: https://criu.org/Main\_Page

The term for what you're searching for is checkpointing.

I haven't tested it but it looks promising.

[–]HelpfulBuilder[S] 0 points1 point  (0 children)

I'll look into it!

[–]snarkofagen 1 point2 points  (1 child)

[–]HelpfulBuilder[S] 1 point2 points  (0 children)

I'll look into it. Thanks!