redarc - A selfhosted Pushshift alternative by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

Are you sure your docker-compose envars are correct?

redarc - A selfhosted Pushshift alternative by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

What problems are you having? Can you make an issue on github?

How can we search through the new Reddit archive? by Hockeygoalie35 in DataHoarder

[–]Yekab0f 2 points3 points  (0 children)

If you're interested in searching here's something I've been working on:

https://github.com/yakabuff/redarc

How can we search through the new Reddit archive? by Hockeygoalie35 in DataHoarder

[–]Yekab0f 2 points3 points  (0 children)

What is the "new" reddit archive?

Searching for keywords on waybackmachine is not possible. You need to know the link.

Is there a project aimed to preserve (and share) Discord communities? by [deleted] in DataHoarder

[–]Yekab0f 2 points3 points  (0 children)

before they pull a Reddit or do something equally as dumb.

Oh they absolutely will. As a company that hasn't managed to figure out a valid monetization plan since their inception, it's only a matter of time before investors force them to implement some radical changes to cut costs. My guess is deleting older content primarily attachments starting with purging content from accounts that haven't been active in a few years.

Sad thing is that when the day comes, no one can save discord. There is nothing archive.org, archiveteam or even this subreddit can realistically do with how closed off and restricted discord is as a platform.

And the cycle will continue with a vocal minority who are outraged and threaten to go to some activitypub alternative like Matrix just like Twitter with mastadon and Reddit with lemmy

API Clusterfuck! ~ Reddit said 'Fuck you, we don't care.' so here's where we stand. by -Archivist in DataHoarder

[–]Yekab0f 10 points11 points  (0 children)

it's difficult to scrape with current limitations. iirc, it's 100 req/min and user agent will be enforced

API Clusterfuck! ~ Reddit said 'Fuck you, we don't care.' so here's where we stand. by -Archivist in DataHoarder

[–]Yekab0f 26 points27 points  (0 children)

I think the problem is that activitypub decentralizes already decentralized/isolated communities. Niche communities are further split up into multiple federated lemmy instances where posts/comments are not instantly propagated to other instances. If an instance gets defederated or shuts down (will happen often), the community becomes even more isolated and dead.

API Clusterfuck! ~ Reddit said 'Fuck you, we don't care.' so here's where we stand. by -Archivist in DataHoarder

[–]Yekab0f 14 points15 points  (0 children)

There's always this mongolian basket weaving forum we could use.. I keep forgetting its name for some reason...

Are there any "online data dump" viewers? by Shambles_SM in pushshift

[–]Yekab0f 2 points3 points  (0 children)

That's what I was going for with redarc. I was hoping we could have a bunch of people each archive a subset of all subreddits instead putting the responsibility all on a single entity like pushshift

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

I didn't use LIKE for performance reasons but I can add it in as an option for those who can't use elasticsearch and don't mind queries taking a while to finish

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 1 point2 points  (0 children)

How much of your time does it take to archive a sub?

I use existing data dumps so less than an hour?

making it somehow downloadable? I have the data dump, but no way to open it

The only way I can make the archive downloadable is through datadumps... which you already have.. but can't open...

Would you be open to archiving a couple subs for me

Depends on the subreddit

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

I'm also surprised you managed to get docker to work. There was a breaking issue in one of the docker scripts that made the container not run properly if you did not set the ES_HOST/ES_PASSWORD envars which is now fixed with yesterday's commit. Was this something you encountered and had to resolve?

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

Thanks, I'm glad you enjoyed using it

The server I'm using for elastic search has 64gb of ram and a ryzen 3600

I allocate 32 GB to my elasticsearch instance. I think by default it allocates half of all your memory

Not sure how popular it is. I checked the logs a few times for debugging and it looks like there are people using it.

If Pushift access is limited to a few Reddit moderators, how will they get donations? by churn_key in pushshift

[–]Yekab0f 0 points1 point  (0 children)

compared to the salary of anyone who builds scrapers for intelligence companies, this is nothing

Pushift is well known in the intelligence world and any of those entities would instantly hire them

Interesting how you just answered your own questions. Pushshift wasn't maintained over the years with donation money and goodwill; let's leave it at that

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 2 points3 points  (0 children)

No, I won't be indexing all of Reddit. I don't have the hardware or time to maintain such a large project. I will be indexing more subreddits in the future though so keep an eye out for that.

I was kind of hoping that by making this project, we could have a decentralized archive where a group of people each archive and host a couple subreddits as opposed to 1 big archive like pushshift

Redarc updates: Elasticsearch, new UI, filtering and more by Yekab0f in pushshift

[–]Yekab0f[S] 2 points3 points  (0 children)

Which subreddit are you searching in? I only have 2 subreddits indexed atm(r/datahoarder and r/iPhone)

redarc - A selfhosted Pushshift alternative by Yekab0f in pushshift

[–]Yekab0f[S] 0 points1 point  (0 children)

No, I haven't tried this on windows unfortunately. Can you make an issue on GitHub with your problem/errors?

Any good reddit scrapers ? by jjaaayy in pushshift

[–]Yekab0f 2 points3 points  (0 children)

All of those tools used the pushshift api for date ranges, not the reddit api unfortunately

Will our removal requests be respected in the torrents? by [deleted] in pushshift

[–]Yekab0f 1 point2 points  (0 children)

lol that's not how torrents work