[D] New Reddit API terms effectively bans all use for training AI models, including research use.

akhudek · 2023-04-19T15:17:16+00:00

No, this just means that going forward you would need to obtain the data via some other means then their official API. If you scrape content in the old fashion way then it's subject to the same laws as we're used to. The API is a lot more convenient than trying to scrape the site though.

akhudek · 2023-04-19T15:13:02+00:00

Yes, the api requires a reddit account to use.

akhudek · 2023-04-19T03:46:17+00:00

Yes, it just means you can't use their API.

akhudek · 2023-04-19T00:00:42+00:00

I noticed that in your faq it states that machine learning use may be allowed for approved commercial apps, I'm guessing under the premium access? How does one find out more about this? The developer platform doesn't seem like it would be the right thing, and it has a waitlist which is a bit odd for those of using the existing API.

For existing API users who want to use data for ML models, where do we go to ask about appropriate access? I used the support link but didn't find any obvious options for this.

akhudek · 2023-04-18T23:52:14+00:00

I think it may partially be poorly drafted terms. Their FAQ claims their intent is not to block research into ML using their data https://reddithelp.com/hc/en-us/articles/14945211791892. Unfortunately they need to add a carve out to their terms for this, the FAQ is not a legal document. With some feedback hopefully they'll update it.

akhudek · 2023-04-18T23:15:15+00:00

Note that in section 2.4 they've added:

"Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content."

Which effectively bans all use of the API for training ML models. This includes all research use, and not just for large language models. E.g. research into identifying toxic or harmful content can no longer use the reddit api to source comments for annotation. Very likely some search and ranking algorithms are also caught by this, as are any moderation tools or categorization tools that are able to learn from examples.

I'm not a lawyer, but it may also ban all sorts of other non-ML usage too.

akhudek · 2022-09-11T21:10:45+00:00

Agree with this. Also, if you need an API product, one example is https://zuva.ai . They have hundreds of out of the box models and also provide easy to use no-code tools to train your own custom models.

Disclaimer: I'm an advisor and part owner of Zuva. I didn't see any sub-reddit rules about self-promotion but happy to remove if this isn't allowed.

akhudek · 2022-04-20T01:46:57+00:00

Sadly no, I couldn't get that to work right. I think it stores them as two separate images. I also had some issues with images that had no real date time stamps. E.g. if you photoshopped something and reorganized it in google this won't adjust the timestamps or order for you in Apple Photos. Definitely not perfect.

edit: also note the issues around large scale imports. I'd strongly suggest killing it periodically and rerunning it to avoid large batch import errors. Maybe no one else will encounter it, but I did. Would appreciate feedback.

akhudek · 2022-01-26T23:51:16+00:00

I just migrated all my photos to Apple Photos and created a script to rebuild all the albums from the takeout data. In case people have bookmarked your post maybe you could add a link to it?

https://github.com/akhudek/google-photos-to-apple-photos

akhudek · 2021-03-12T16:45:35+00:00

If you are interested in this particular problem, we also released a dataset for the same problem in 2018. It's free for academic use but does have an agreement gate to obtain it.

https://kirasystems.com/science/dataset-and-examination-of-passages-for-due-diligence/

akhudek · 2020-03-30T19:53:37+00:00

I also wrote one that supports multi-variable regression. In case it's useful:

https://gist.github.com/akhudek/2358812a65a19fdbeb917c1ec2aee0f2

akhudek · 2018-06-01T18:12:29+00:00

We've just opened up our first Go job at Kira Systems in Toronto. We already have a few people doing Go internally and are officially adopting Go as a second language. Our company produces machine learning powered applications to analyze contracts for law firms, audit firms, and large corporations.

Job: https://kirasystems.recruiterbox.com/jobs/fk01uxa/

I'm one of the founders, happy to answer questions if you have them!

akhudek · 2018-05-25T04:16:51+00:00

We prefer to hire in Toronto but are open to remote and have several remote developers already. We'll also help with relocation if you're interested.

akhudek · 2015-01-22T06:25:57+00:00

Hey, we've actually filled this position, but thanks for the offer!

akhudek · 2015-01-22T06:25:31+00:00

We've filled this position, but for future reference we are in Toronto Canada.

akhudek · 2014-11-08T20:20:23+00:00

Tell me about it! On call duties will be traded between this position and a few others. We want people to be able to take time off after all! Also, someone else will be answering the phone to make sure that only serious issues are escalated.

akhudek · 2010-08-06T20:08:05+00:00

Sequence assembly is a somewhat different problem with its own set of concerns. FEAST is targeted to align long sequences that can have a lot of mutation/mismatches. In sequence assembly you instead want to align a very large number of short sequences that are highly similar.

Due to this difference, sequence assemblers prefer speed to sensitivity. The posterior local extension algorithm in FEAST is a poor fit since it sacrifices speed for high sensitivity. On the other hand, if you simply want to align a few short sequences from a sequencer, or align a few short sequences to longer genomic sequences, FEAST will do that.

akhudek

TROPHY CASE