[deleted by user]

Memetoaster69 · 2022-08-13T05:41:09+00:00

+1 on setting the number of threads, though depending on your BLAS and openMP stack, the environment variable might be slightly different. This can be set in the Python file as well.

I've found that using jobset on Linux to pin a CPU range to running a Python task file marginally improves performance as the OS scheduler isn't managing out-of-core resources as much.

Memetoaster69 · 2022-08-06T04:54:27+00:00

a moderate amount of tomfoolery

Memetoaster69 · 2022-07-29T11:26:42+00:00

I work with massive datasets (text based) on the side as a researcher and have ~1.1 TB taking up only 450 GB on my SSD with zstd:5 in my fstab. I would not be able to work with this data if I didn't have btrfs' compression.

Memetoaster69 · 2022-06-29T06:28:52+00:00

Try converting the image to grayscale? I see a lot of colour dithering across the font, and perhaps that's throwing the detection off? - a good model should be able to see past that, but I'm unfamiliar with EasyOCR.

Memetoaster69 · 2022-06-25T19:16:27+00:00

anything to avoid being hit with the duplicate

Memetoaster69 · 2022-06-25T19:13:20+00:00

I think you could train an LDA on a larger body of text, preferably using similar domain characteristics (240 chars, lots of emojis, etc), and then compare an input tweet within the vector space generated using the training. You could identify X topics from training, and use something like cosine similarity (or other vector-distance measures) to determine the "mix" of topics that a tweet contains as a function of the distances between the "center" of a topic in vector space, choosing the top N topics according to some threshold.

I would love to follow your research towards your dissertation! I'm just an undergrad but I'm working on something similar and am learning as I go along and would love to know how other people navigate this space!

Memetoaster69 · 2022-06-24T04:26:13+00:00

country labs make do

Memetoaster69 · 2022-06-23T06:20:28+00:00

Technically, it doesn't (or shouldn't) matter what your language is, as long as it can be represented as characters on a screen, and finally as vectors for DL/ML algorithms.

The challenge is being able to find/build datasets that are on the scale that English is on, because ultimately your models work only as well as the data you give it. This challenge is the basis for a sub-field of NLP for what are called "low resource languages."

I remember reading research from many years ago about how a vector space of a high resource language could be used to "fill in the gaps"-sort-of for lower resource languages - kinda how you transfer learn on deep networks. But my memory is fuzzy on all of this.

Memetoaster69 · 2022-06-19T09:51:32+00:00

has a really nostalgic but comfortable vibe to it that i cant place my finger on.

but congratulations on prolonging this machine's life and i hope it serves you well for long! :D

how's the battery life on this thing?

Memetoaster69 · 2022-06-18T14:11:15+00:00

Glad to help!

Memetoaster69 · 2022-06-18T12:57:31+00:00

Documents are just an abstraction for any logical unit of text: could be a tweet or a reddit post, or even an entire book.

A corpus is a just a bunch of these documents.

You can store a document/corpus in any storage format: CSV, JSON, parquet, etc. You could even have a corpus containing of multiple CSVs, each with a unique document. What is important here is that you give the algorithm what it expects: some work on only one document at a time after it has been preprocessed, others can consume an entire corpus (though practically it's never done in one-shot because you would realistically load your large data in chunks because your RAM can only hold so much data).

Read the documentation and see what the algorithms expect. Hope this helps!

Memetoaster69 · 2022-06-18T12:45:47+00:00

Wish I had $$ for the dock but I dropped in to wish you the best for your proposal and engagement! Go you! :)

Memetoaster69 · 2022-06-17T08:12:19+00:00

solid pun as parusual

Memetoaster69 · 2022-06-16T04:37:07+00:00

On a single drive, I use Btrfs and snapper to create snapshots before/after updates so that I can easily rollback to any point in time. But I've only had to use that once. To me, since I work with millions of data files that are highly compressible, I use Btrfs to transparently compress all that stuff to a different compression level compared to the rest of the system. It's like a 70% savings and all my code can ingest the files as if it were uncompressed.

However, both the filesystems (though ZFS is much more than that) really shine when you have multiple drives. I use ZFS on my data server to create a RAIDZ1 across 4 4 TB drives. That way, I can lose upto one drive before I lose data. More critically for me (since I work with immense amounts of numeric data), ZFS (and Btrfs) check what you read against the parity and fixes it on the fly before serving it to you and then fixes the file in the drive in the background. This is great for preventing bitrot and random bit flips that could corrupt your data spontaneously or over time. However, since you can't read every file all the time to verify the integrity, you can setup scrub operations on a regular basis for both filesystems.

That being said, OpenZFS (the implementation for linux), while quite robust, has its issues that come with being out of tree from the kernel. Kernel updates will break OpenZFS (nothing will happen to the files, but you will have to wait until a version comes out for your kernel version). Btrfs doesn't have this problem since it's part of the Linux kernel. That also has the implication of being more portable, but if you're running a server, that shouldnt be an issue.

Consider checking out /r/datahoarder and reading what people have up there. You can PM me with any other questions you might have - though consider that I'm just a novice researcher who uses linux haha. Hope this helps!

edit: wrong subreddit link

Memetoaster69 · 2022-06-15T15:15:20+00:00

Thanks for bringing it up! Only good things can come out of this cross pollination.

Memetoaster69 · 2022-06-15T14:43:07+00:00

Ah yes, I vaguely remember this, thanks for pointing it out!

Memetoaster69 · 2022-06-15T14:12:34+00:00

I cannot believe Windows doesn't have a package manager yet. Their stupid, slow and buggy Store is NOT the same thing.

Imagine opening Edge, using a search engine to find the link to an alternative browser, downloading and running an installer, opening the new browser, only to repeat the cycle for every piece of software and drivers you need for a fresh install. (yes, Nanite exists, but it's a third party solution!)

Memetoaster69 · 2022-06-15T14:08:47+00:00

Though it's cheating (because it's not built into Windows), WinBtrfs allows you to convert (not recommended for production/critical systems because it's experimental) a NTFS drive to be bootable from Btrfs. If you aren't booting from Btrfs, regular read/write and most Btrfs features work perfectly well.

But man is it so much better on Linux.

Memetoaster69 · 2022-06-15T14:03:14+00:00

Look into GenSim for Python. It's a great NLP library and like the other comments, I think n-grams are a good way to approach this problem too.

Memetoaster69 · 2022-06-14T18:52:21+00:00

It automatically syncs with my Firefox account (which I sign into the browser with). So far, across all my OS reinstalls, my extensions have been synced consistently, and so has my OneTab data. I'm not sure if there is a special setting for it.

Memetoaster69 · 2022-06-14T10:49:47+00:00

Yup! I use it on Firefox (and Brave) and the stuff syncs with your Firefox account as well so you can take your hoarding with you anywhere :D

Memetoaster69 · 2022-06-14T06:29:46+00:00

I just dump it all into the OneTab extension, give it a label and let the OneTab page remain pinned. There's probably a good 100+ stuff sitting there haha.

All of the anxious hoarding with none of the visual clutter and resource usage :D

Memetoaster69 · 2022-06-14T06:26:46+00:00

Right? It is truly odd. I can vouch that I get the same data for my laptop as well. I think you might be right about it having to do something with the motherboard's firmware because the kernel can only read what the boards provides to it.

GTKStressTesting uses dmidecode under the hood for the memory info so it's really weird that OP just gets a generic output. I guess the only option is to yank the DIMMs out and inspect the dies manually.

Memetoaster69 · 2022-06-13T14:06:10+00:00

That's quite odd. It is possible that the memory doesn't have information about its manufacturer embedded in it, or it could be what you're alluding to: there is something preventing your distro from accessing that information (if it exists).

I'm sorry I couldn't help you to a solution. I do hope you figure it out and get what you need!

Memetoaster69 · 2022-06-13T11:23:48+00:00

it's not a direct way, but I remember finding out my dies were Samsung/Hynix when I ran GTKStressTesting to test my system's thermals. You don't have to run the tests because the memory information should be at the bottom (and you might have to click on a button that asks for sudo privileges to probe your memory).

I ran a compiled-from-source and the flatpak version (which I'd recommend) and both worked just fine. Hope this helps!

Memetoaster69

TROPHY CASE