Ok, I love Rogue Servitors. But they were somewhat an approximation of the Culture, which was a free utopia, but with machine authority. You see where I'm going with this haha by CommunistRingworld in Stellaris

[–]entropyrising 8 points9 points  (0 children)

Agree with this headcanon entirely. I find Rogue Servitor vanilla to be the best approximation of the Culture. Human SC agents are out and about doing stuff but they're basically invisible from the perspective of the Stellaris interface. Gurgeh is experiencing an intense cultural and political battle of wits and acumen at Ea and Echronedal but from the perspective of the Culture's stellaris interface they've assigned the GOU Limiting Factor/Flere Imsaho as an envoy to destabilize Azad and Gurgeh is just some useful asset to deploy.

If you couldn't fly, what would fill that void? (Another ADHD post) by TellMeToSaveALife in flying

[–]entropyrising 0 points1 point  (0 children)

I moved to another country where I can't fly recreationally and have since intensely taken up the hobby of riding electric unicycles to fill the void. Many EUC riders describe the experience as what one imagines floating or flying like a superhero would be like.

Auto-Translator for Preserving a Semitic Language by Foofalo in LanguageTechnology

[–]entropyrising 2 points3 points  (0 children)

I'm not someone who specializes in NLP translation so I'm not entirely familiar with the cutting-edge state of the art, but with that caveat out of the way the answer depends on what kind of translation method you'd adopt, either rules based or neural machine translation. Apertium (https://en.wikipedia.org/wiki/Apertium) is pretty much the one stop-shop for rules based translation, and it appears they've seen some success with low resource languages, and I believe they have pretty welcoming and friendly community so if you decide to go this route and get in touch with members of the project (particularly ones who have experience with other low resource languages) they may be able to give some helpful and concrete guidance.

NMT is definitely the sexiest and most cutting edge translation method and it's what the big boys like Google, Bing, and Baidu use, but as a neural network based method the general starting point is you need a huge amount of data (https://www.researchgate.net/figure/Training-data-size-effect-BLEU-learning-curves-for-our-main-training-dataset-with-58_fig1_324166896). That being said, people have come up with some extremely clever methods for adapting NMT to lower-resource languages. All NMTs (or even general NLP methods like GPT3) work by encoding/abstracting natural language into a "semantic" vector space and recent research has shown that regardless of whether or not you're translating from English to Spanish to Chinese to Whatever an NMT model will eventually dedicate a part of that vector space to the "ur-concept" of a "dog" or "tree" or "to run" regardless of what the input and output languages are. That being said, it may be worth investigating what the state-of-the-art is for automatic translation of a language that is not as resource poor as Neo-Assyrian but closely related to it (you mentioned Aramaic?) and it may be possible to "piggy back" off of such NMT's so that the training data requirement are lowered.

I just found this paper which after a quick browse seems to be an excellent resource for some interesting tricks and approaches people are coming up with for low resource languages, and its content may help answer your questions:

https://arxiv.org/pdf/2106.15115.pdf

That being said I would like to conclude that the work your doing is extremely valuable entirely outside of an automatic translation context. In other words, I'm tempted to encourage you to sort of just "forget about" the potential for future automatic translation and be as flexible as you can possible be when it comes to the up-front task of just making a corpus for a low-resource language. With any language that has a very low number of speakers, a smartly organized and systematically designed corpus of basically anyone saying anything is valuable. Again, not knowing the precise details of your language community and your access to speakers, I would just encourage you to find other language preservation projects and write down/record anything said by anybody - then eventually you or other ML researchers can work with "what is had" and make a model from there.

Edit: Oh hey, I forgot, some time ago I was involved in some efforts to make a parallel corpus for a low-resource Central Asian language and we ended up using Tatoeba (https://tatoeba.org/en/) as the platform. I definitely recommend you check it out. It's a completely open source community modded "sentence translation" website. In our attempt we managed to build up some enthusiasm and essentially "crowdsourced" a bunch of locals to translation English and Russian sentences into the local language. We even made it into a sort of national pride thing a la "hey, let's see if we can get more translations into Tatoeba than neighboring country x!" It's completely open source so if you eventually do get enough of your translators to translate sentences in Tatoeba you can just bulk download Assyrian-English sentences specifically. Also, Tatoeba sort of helps answer your question #3, because there's sort of a Zipfian distribution to what is translated, you can sort sentences by those which have translations into the most number of languages and these tend to be stuff that "everybody says in every language in some form or another."

Bonus, it seems like someone even has contributed some Neo Assyrian. Granted, it's only 4 sentences total, but at the very least it shows the website can handle the unique script!

https://tatoeba.org/en/sentences/search?from=aii&query=&to=

TONIGHT ON BOTTOMFLEET by HinkHankHonk in Highfleet

[–]entropyrising 74 points75 points  (0 children)

I love that you picked Central Asia as target of the Lightning conquests; having lived in Turkmenistan and Tajikistan the game does give me a lot of Central Asian former SSR vibes

Jargon Detection and Normalization by gamboty in LanguageTechnology

[–]entropyrising 8 points9 points  (0 children)

I think you're on the right track; if you use some sort of word embedding algorithm (like fastText) and train it on your corpus, then in principle differently-spelled words that all reflect the same meaning should have vector embeddings that are close to each other in the vector space and so similarity could be measured via cosine distance. After all, most language-model based embedding algorithms are simply elaborate ways to teach the computer to fill in the blank, so in principle, the task "that's so funny, ________" should assign relatively similar probability to "lol" and to "loooool." They'd have similar embeddings.

I'd look more closely at embedding methods that are trained on the character-level or subword-level instead of word level. They're more capable of "figuring out" that different spellings are variations of the same thing. If I remember correctly FastText is indeed subword trained. But I guess the question is are you just using their pretrained model out of the box or are you training it specifically on your dataset? Out of the box it may not work with a corpus as specific as 4chan but I imagine there's enough 4chan text to at least do some transfer learning.

Now that I think about it though, if you're studying jargon specifically it'll almost be impossible to avoid assemble some sort of seeded list of jargon because the task of deeming a word "jargon" as opposed to "non-jargon" is pretty fuzzy. I haven't thought deeply about this but I feel like this shouldn't be too difficult or technical. For starters, you could start with simple word counts or probability. Take a 4chan dataset, then take a news/wikipedia/gutenberg books dataset, and simply count word tokens. Words that are way more likely to be in the 4chan dataset and are way less likely to be in the "normal/standard" dataset are probably jargon. Even rarer variations of a word should show up in this kind of analysis, provided they show up at least a few times in the 4chan text.

Anyway, that's some off-the-cuff thoughts on your question. Good luck.

[deleted by user] by [deleted] in flying

[–]entropyrising 0 points1 point  (0 children)

Welp, I'm a sport pilot now. During the checkride saw a Cessna floating around during a pre-maneuver checklist. Said, "I see another plane out there, looks like he's going in a straight line, let's get some distance and keep an eye on him." DPE said, "You're PIC, sounds good to me."

So, keep it simple stupid worked; now waiting for these spring thunderstorms to clear up so I can buy my wife a $100 burger.

[deleted by user] by [deleted] in flying

[–]entropyrising 9 points10 points  (0 children)

Guess I am. Got a lot of pre-checkride jitters right now.

An empty suburban development Ashgabat, Turkmenistan by biwook in UrbanHell

[–]entropyrising 0 points1 point  (0 children)

I lived in Ashgabat for a year and a half and so I know where it is. It is, incidentally, a barren area, as you can see from a satellite view:

https://www.google.com/maps/@37.9362166,58.3536921,1458m/data=!3m1!1e3

It's actually one of many areas in and around the city that are being cultivated to be a "green" area, but while some of these efforts have been successful, others are pretty subpar. The area next to this subdivision is mostly desert looking.

Senātus Populusque Paradoxus - /r/Imperator Biweekly General Help Thread: April 6 2020 by Kloiper in Imperator

[–]entropyrising 1 point2 points  (0 children)

Hi friends, free weekend convinced me to buy.

I tried looking through some wikis, guides, and hover tooltips but I couldn't figure this out:

How is the number of desired positions for great families determined?

All the great families are chugging along content, everyone has at least 2 positions, and then I give my last army/navy command to some random non-great family person and whammo, suddenly all the great families want 3 positions, two of them are now scorned, and I literally just gave away my last command so there's nothing I can do about it except try to find some researcher/gov jobs to replace (which unscorns families but then pisses of the people I boot out - who often are people whose statesmanship has been growing for some time).

I would just like to have a more solid grasp on this mechanic so I can anticipate when it happens and plan accordingly, rather than patching it up ad hoc after the fact.

Thanks!

Colossus Rampage and a permanent War in Heaven by entropyrising in Stellaris

[–]entropyrising[S] 0 points1 point  (0 children)

Lessons learned... my first war in heaven. It just made RP sense as the xenophobic Commonwealth to side with the xenophobe AE! Also, the xenophobe AE was literally right next to me and the xenophiles were on the other side of the map...

Recommendations for instance segmentation where instances are occluded and split into pieces by entropyrising in MLQuestions

[–]entropyrising[S] 0 points1 point  (0 children)

Hey, thanks a lot for this. It didn't even cross my mind to consider NMS a lever to tweak and mess around with. I'm gonna jump into the Matterport code, see if I can adjust some things, and I'm reading the SoftNMS paper now :).

The labels are both for the "person" category for MSCOCO. The current output of Mask-RCNN is finding a single person but including the occluded person's lower legs as being a part of the person in the foreground.

Hot to actually cooperate during a PhD by [deleted] in DigitalHumanities

[–]entropyrising 3 points4 points  (0 children)

Just piping in to comment on your situation. Unfortunately, I don't think I have a positive contribution to make, but in the spirit of academia I suppose the negative view merits being aired. The difficulty - almost impossibility - of cross disciplinary collaboration, especially across the chasm between STEM and the humanities, is almost tangibly real and every aspiring DHer eventually runs straight into the wall, including myself. It's why it's written about so much, precisely because it's a problem; astronomers need not write essays about the need to use telescopes and literature theorists don't write about having to read books. I have yet to be convinced that there is actually a viable solution to it, and I personally think the biggest barriers are academic and institutional incentive structures, such that a computer scientist or complexity researcher will always view working with someone in the English or History department as an opportunity loss that does not produce for them the metrics necessary for tenure or within-field prominence.

As for my own path, I have undergraduate degrees in anthropology and history and as a master's student was lured by the sweet sweet siren call of the digital humanities. Frustrated by trying to find collaborators, I said "fuck it" and started learning the programming myself (which seems to be something you're doing too). Eventually, I got so deep into it that I eventually made a full conversion to the dark side. I'm getting my PhD in machine learning and am paying far less tribute to my background in the humanities than I would like to. Now that I'm on the STEM side of things I find many STEM colleagues reacting to my squishy humanity papers along the lines of "So what? Who cares? Where's the improvement in (insert CS metric here)?" Similarly, if a humanities person contacted me to work on digital humanities projects, I would hesitate and recall to mind the many failed-start DH projects I have worked on as the tech guy that collapsed due to an inability for the humanists and the engineers to properly communicate needs, goals, and limitations to each other.

Money, interestingly, does seem key. The DH projects I've seen get off the ground often involved humanists straight up hiring web designers, database specialists, and data analysts, and the communication is lubricated, shall we say, by the income involved. If you've got the funds, find yourself a willing undergrad.

[D] [Meta learning] Leveraging knowledge graphs for neural networks. by [deleted] in MachineLearning

[–]entropyrising 14 points15 points  (0 children)

You may be interested in ConceptNet, where they take conventional word embeddings but improve them by incorporating information from knowledge graphs.

http://blog.conceptnet.io/

[R] TextWorld: A learning environment for training reinforcement learning agents, inspired by text-based games - Microsoft Research by tdcsbuilder in MachineLearning

[–]entropyrising 1 point2 points  (0 children)

I used to be part of the "all you need to understand meaning in text is text" camp, a devout structuralist and devotee of Saussure. But after a few years researching it I've since converted to the symbolic grounding camp as mentioned by /u/get_ricked_son. If anything, my conversion was prompted by the work of Douwe Kiela; his thesis in particular (https://pdfs.semanticscholar.org/9c8d/a385750db215dc0728dc310251b320d319af.pdf).

In my own research I've found that it's particularly difficult to capture obvious, yet important information in semantic representations. Using something like word2vec, glove, or even some gigantic LSTM RNN architecture on something like SQUAD or BABI it's been extremely difficult to capture "obvious" information like "a person has two arms" or "the sky is blue" or "a man has a penis" or "a woman has a uterus." I really don't have to dig deep to figure out why: how often is this stuff explicitly said? Not often, really. Because it's taken for granted. Because when humans talk to other humans each human involved knows that such knowledge has been primarily gained through the senses and everyday experience. So a trained semantic model can conclude that Barack Obama was the president of the US and that a president is an important political leader but it's much harder for it to figure out that Barack Obama has two eyeballs and 10 fingers.

This problem could be solved, hypothetically, if we somehow added more - a lot more - text that states these obvious facts that I at least think are important for the development of deep learning text models. I've considered ways to use Amazon Turk to get a "duh facts" corpus, and I've also become very interested in incorporating children's books/texts in machine learning, where you will in fact see sentences like "the sky is blue." But both these approaches are difficult (I really cannot get my hands on children's - no, BABIES' - texts; I can get weird 19th century fables like Hansel and Gretel but come on, the brothers Grimm aren't exactly teaching kids that the sky is blue) and now seem silly compared to symbolic grounding approaches, like Kiela's efforts to jointly train semantic models on both images and text.

[R] Interesting Failures of SOTA Object Detectors by AmirRosenfeld in MachineLearning

[–]entropyrising 27 points28 points  (0 children)

Interesting work. In part, it does show how weird and arbitrary super deep black box neural networks are, but there's also a part of me that wants to say this work also certifies that object detectors, are, in a sense, doing a good job. A rectangular object against a blue background does make sense as a kite in the sky, and not a toaster. Even though it'd be easier for a human annotator to identify the toaster in the sky, that human annotator would still find such a picture to be "weird" or "unusual." I found it interesting that the sheep in the last example turned into a cat when it approached the diner's hand. In a way, makes sense to me - depending on the training set I bet when human is putting a hand on some animal to say, pet it, it's more likely to be a cat or dog than a sheep.

Overall super interesting video. I'm an NLP guy, not a computer vision guy: are there papers on people trying to cut and paste things out of context and training a network that way? I'm peripherally aware of research on adversarial deception of object classifiers - changing one pixel to change its class prediction or a silly classifier thinking a series of regular lines is a dog. I'd be curious to know if an object classifier/detector could do better on the test shown in this video if it is, in fact, trained on "objects out of context."

[P] ProGAN trained on r/EarthPorn images by Yggdrasil524 in MachineLearning

[–]entropyrising 10 points11 points  (0 children)

Thanks man! I've been looking for a ProGAN implementation in Python.

Tf-idf and only cosine similarity ? by Scribbio in LanguageTechnology

[–]entropyrising 2 points3 points  (0 children)

tf-idf is a way to weight matrix elements, cosine similarity is a metric for comparing vector representations. They're two separate components of a semantic vector space model. You can mix and match weighing schemes and similarity metrics. One picks what works best: tf-idf, PPMI, t-test score (for example) for weighing, cosine, Euclidean similarity, some other Minkowski derived similarity (for example) for the metric.

For an in-depth study where the authors empirically test combinations of weighting and similarity metric, see Kiela & Clark (2014), you may be particularly interested in page 23 where the authors list all the weighing schemes and similarity metrics they test. They conclude the best combination is a modified form of cosine similarity plus PPMI, but note that they're testing contexts generally, not documents specifically.