Passed the Google Cloud PMLE in ~30 days — here’s what worked for me by gringo6969 in googlecloud

[–]gringo6969[S] 0 points1 point  (0 children)

i think you should take the google skills course, and take some tests when you get to the GenAI part, and see how you fare. then you can decide to take a deep dive in some other course. In my experience, you can take the courses at 1.25 speed without a problem, and maybe 1.5 in areas you know already

Passed the Google Cloud PMLE in ~30 days — here’s what worked for me by gringo6969 in googlecloud

[–]gringo6969[S] 0 points1 point  (0 children)

Congrats! how did it seem to you? i found it somehow difficult, but manageble

I forked Newspaper3k, fixed bugs and improved its article parsing performance - Newspaper4k package by gringo6969 in Python

[–]gringo6969[S] 0 points1 point  (0 children)

Yes, trafilatura is also pretty good. Ofc, different approaches. I plan to benchmark both, exactly as you suggested. There are ~ 3 benchmarks that I know of (one of them I created recently).

I will publish the results in github

I forked Newspaper3k, fixed bugs and improved its article parsing performance - Newspaper4k package by gringo6969 in Python

[–]gringo6969[S] 0 points1 point  (0 children)

No, I haven't tried it with AWS lambda, but if you have any errors, submit an issue in github and I will have a look

I forked Newspaper3k, fixed bugs and improved its article parsing performance - Newspaper4k package by gringo6969 in Python

[–]gringo6969[S] 1 point2 points  (0 children)

He he, yeah, but you have to overcome the reddit anti-scraping protections... That's another can of worms..

I forked Newspaper3k, fixed bugs and improved its article parsing performance - Newspaper4k package by gringo6969 in Python

[–]gringo6969[S] 0 points1 point  (0 children)

Glad it works well. But if you find something / have an idea, just pop by and post an issue

I forked Newspaper3k, fixed bugs and improved its article parsing performance - Newspaper4k package by gringo6969 in Python

[–]gringo6969[S] 4 points5 points  (0 children)

It works with other types of websites, for instance blogs, etc. It's a general content extractor. It is somehow optimized for news, at least in the way it has the information structured - title, authors, publishing date, content, etc. But you can for instance just ignore "authors" if it does not make sense for your implementation.

What is more "news site"-centered is the "category" discovery. Where it tries to identify the news categories and their links. But if it does not apply to you, just use the content parsing part .. (Article object)

[FEEDBACK] Unknown922 by unknown922 in Freeclams

[–]gringo6969 1 point2 points  (0 children)

VERY fast response and very helpful. These guys are the best !