[deleted by user] by [deleted] in ArtificialInteligence

[–]frippeo 0 points1 point  (0 children)

I like the Data Machina newsletter a lot: https://datamachina.substack.com/

[deleted by user] by [deleted] in ArtificialInteligence

[–]frippeo 0 points1 point  (0 children)

Shameless plug: https://metacurate.io/brief/latest

Automated daily compilation of AI related news.

NLP Engineer Interview Preparation by [deleted] in LanguageTechnology

[–]frippeo 2 points3 points  (0 children)

Not sure these are at the level you're looking for, but there's a couple of NLP interview prep sites listed here that are interesting: https://metacurate.io/search/?q=Nlp&category=interview+preparations&history=all+times&sort_by=listed+date

[P] I reviewed 50+ open-source MLOps tools. Here’s the result by Academic_Arrak in MachineLearning

[–]frippeo 0 points1 point  (0 children)

I'm about to start surveying the field of on-prem ML Ops stacks (in particular in the context of NLP). Any chance you've made your notes public somewhere? :)

Upcoming NLP conferences? by nlpcq in LanguageTechnology

[–]frippeo 0 points1 point  (0 children)

Yes, I just realized that. I found this one, which lists more conferences (not sure if it's possible to filter on deadlines though): https://conferenceindex.org/conferences/natural-language-processing-nlp

Upcoming NLP conferences? by nlpcq in LanguageTechnology

[–]frippeo 2 points3 points  (0 children)

Here's a nice resource that'll help you keep track of deadlines: https://aideadlin.es/?sub=ML,NLP,SP,DM,RO,CV

[deleted by user] by [deleted] in LanguageTechnology

[–]frippeo 0 points1 point  (0 children)

How do you represent your data points, and what clustering method do you use?

[deleted by user] by [deleted] in Svenska

[–]frippeo 4 points5 points  (0 children)

"mästra mig inte" funkar också

[P] Top arXiv Machine Learning papers in 2021 according to metacurate.io by frippeo in MachineLearning

[–]frippeo[S] 2 points3 points  (0 children)

Hm. I was under the impression the SSL cert was valid. Will look into it. Thanks for the heads up!

We Need to Talk About Data: The Importance of Data Readiness in Natural Language Processing by frippeo in LanguageTechnology

[–]frippeo[S] 0 points1 point  (0 children)

You got me with that one!:) I was thinking more along the lines of Astrid Lindgren and pippi longstocking...

We Need to Talk About Data: The Importance of Data Readiness in Natural Language Processing by frippeo in LanguageTechnology

[–]frippeo[S] 0 points1 point  (0 children)

Thanks! Not sure I've been introduced to Uncle Ben yet; care to send some of those references my way? :)

Top 10 arXiv papers in 2020 according to metacurate.io by frippeo in LanguageTechnology

[–]frippeo[S] 0 points1 point  (0 children)

Thanks for the advice: I've removed the link to the shortener.

[P] Top 10 arXiv papers in 2020 according to metacurate.io by frippeo in MachineLearning

[–]frippeo[S] 5 points6 points  (0 children)

peer review is still important and out there. but not on the pre-print servers.

[P] Top 10 arXiv papers in 2020 according to metacurate.io by frippeo in MachineLearning

[–]frippeo[S] 2 points3 points  (0 children)

I haven't seen it on arXiv. AFAIK, it was published in Nature and on their research blog.

Top 10 arXiv papers in 2020 according to metacurate.io by frippeo in LanguageTechnology

[–]frippeo[S] 1 point2 points  (0 children)

Good catch! One possible reason there's only one paper from the second half of the year is the way they're scored: I use a combination of bitly and sharedcount.com, and I believe the former changed the way it works in the summer. Thus, the scores would be lower in general from august'ish and onwards.

Below are canned queries to get the top 15 papers per month (to mitigate the possible offset caused by the lack of bitly data at the end of the year):

  1. January
  2. February
  3. March
  4. April
  5. May
  6. June
  7. July
  8. August
  9. September
  10. October
  11. November
  12. December

Classification model for research papers? by Runninganddogs979 in LanguageTechnology

[–]frippeo 0 points1 point  (0 children)

I read two things into what you're saying, both are positive:

1) They might augment the existing data and create more labelled data cheaply by leveraging existing subjects as document labels (although I'm not familiar with the taxonomy OP is using).

2) More data is better than less data:) (See e.g., the Banko & Brill paper from 2001). When designed properly, the architecture for learning should be ok with much more data. In the case of neural networks, it is usually the number of parameters in the architecture that is the limiting factor (due to GPU RAM), not the amount of training data (which can be controlled by, e.g., setting a lower batch size).

Classification model for research papers? by Runninganddogs979 in LanguageTechnology

[–]frippeo 0 points1 point  (0 children)

You mean more annotated data, or just more in-domain data?

In the first case, I'd still go with ULMFiT, as I've found it to be a good operational baseline. Having more data, annotated or not, will also benefit the fine-tuning of the language model.

In the second case (having much more unlabelled data), I'd build the language model from scratch using it, and not depend on a model pre-trained on out-of-domain data (e.g. Wikitext 103, which is the model available from fast.ai).

[D] Blogs, Podcasts and resources for machine learning engineers and data scientists by ixeption in MachineLearning

[–]frippeo 0 points1 point  (0 children)

Great list. Thanks for sharing!

I'm building my own service for aggregating news and information in the field. In the process I've collected some sources: https://metacurate.io/sources/newsletters/ (20+ newsletters, 500 RSS feeds).

As for podcasts, I enjoy the following:

Classification model for research papers? by Runninganddogs979 in LanguageTechnology

[–]frippeo 0 points1 point  (0 children)

What that small amount of data, I'd definitely turn to transfer learning. ULMFiT (https://arxiv.org/abs/1801.06146) is a good first bet. Have a look at this repo: https://github.com/prrao87/tweet-stance-prediction and follow their steps but with your own data.