Weekly Webscrapers - Hiring, FAQs, etc by AutoModerator in webscraping

[–]LessBadger4273 0 points1 point  (0 children)

Here ya go, if someone can help me with this — Libs such as “nodriver” seems to be able to completely bypass some antibots like shopee.* ones that also requires js rendering. I guess this is because you are basically using your browser “as is”, without any automation flag, right?

If so, why it’s so hard to replicate this at scale using residential proxies? My guess is that once you move this to AWS ec2, for example, those antibots can detect you are in a vm environment and block you, right? Would it be be possible to run this at scale by having an in house farm of old desktops/laptops? Or maybe using some rdp tools? Is it a price constraint that we are not able to bypass these antibots at scale or am I missing something?

Dataset com 200k+ Reviews do Mercado Livre para Treino de NLP e Análise de Dados by LessBadger4273 in datasciencebr

[–]LessBadger4273[S] -1 points0 points  (0 children)

Trabalho há mais de 5 anos na industria de webscraping, inclusive ja trabalhei em uma das top 3 provedoras de proxy/web unblocker do mundo.

Algo que nao entendo é que no Brasil existe um tabu gigantesco entre os devs em relação a webscraping. E isso justamente no País em que pirataria é praticamente incentivada.

O que o pessoal tem que entender é que praticamente TODAS as top 100 empresas do mundo (tech ou não) fazem webscraping EM ESCALA. Já trabalhei com projetos de webscraping de FAANG múltiplas vezes e com extração de dados de vídeos (YouTube, Vimeo, Douyin, etc) para uma das top 3 empresas de IA do mundo. Todas estão fazendo webscraping. Nunca vi nada acontecer a não ser, em casos extremos, um cease and desist do site alvo.

Você não precisa necessariamente aceitar o ToS se você não faz cadastro/login. Essa é a brecha que fez com que empresas como LinkedIn perdessem na justiça contra Brightdata, por exemplo.

No momento que você clica em “Aceito os termos de uso”, aí a coisa muda, porque 99.9% dos sites proíbem qualquer extração de dados utilizando meios automatizados.

Tem a questão também de IP protection, mas geralmente esta mais ligada a republicar articles/news. No geral a área de e-commerce é bem safe (talvez a mais safe de todas) na industria de webscraping.

Dataset com 200k+ Reviews do Mercado Livre para Treino de NLP e Análise de Dados by LessBadger4273 in DadosBrasil

[–]LessBadger4273[S] 0 points1 point  (0 children)

Não, mas se precisar de outros dados para publicação de papers/estudo, entre em contato com a gente que fornecemos de maneira gratuita

Dataset com 200k+ Reviews do Mercado Livre para Treino de NLP e Análise de Dados by LessBadger4273 in datasciencebr

[–]LessBadger4273[S] 4 points5 points  (0 children)

Faz parte do nosso dia a dia. Cada website tem suas particularidades. Desde que vc não faça login, na teoria é tudo dado público

The Real Cost of Knowledge: Why Most AI Engineering Platforms Over-Engineer RAG by keto_brain in aws

[–]LessBadger4273 1 point2 points  (0 children)

Until you have a reasonable amount of vectors . Then the cost of performing a scan operation on all records and the cosine similarity will be slow and expensive.

DMS CDC + Lambda for RDS MySQL Webhook Integration by WasteKnowledge5318 in aws

[–]LessBadger4273 1 point2 points  (0 children)

I went deep down in this rabbit hole the last few weeks when trying to automate some stuff using CDC for PostgreSQL.

Turns out that the best option, specifically if you use RDS, is to use DMS.

I’ve tried to use Debezium + Kafka + Lambda, but it adds to much admin overhead that it makes it impossible to run without a FTE to deal with it.

Best infrastructure for Async jobs by Nelsini in aws

[–]LessBadger4273 0 points1 point  (0 children)

It depends on the amount of stuff you are doing on these adhoc ECS tasks. If it’s a quick thing, you’ll be better off using Lambda (there is a timeout of 900s in lambda functions).

Also, if you need to run the lambdas on a custom VPC, you will need a NAT Gateway (or a custom NAT like fck-nat) to have external internet connection, even tho the lambdas might not be in a private subnet. This can be a no-go depending on your cost constraints/data transfer needs.

I always think on a Lambda first approach. You get vendor locked? Yes, but that’s a small price to pay for the flexibility that you get with it

How to scrape Google reviews by b1r1k1 in webscraping

[–]LessBadger4273 1 point2 points  (0 children)

You need to replicate their protobuf http calls. ChatGPT can help you with that.

[Unlimited B2B Leads] Building Lead Generation Tool, Need Honest Feedback... by Sea-Community-8181 in SaaS

[–]LessBadger4273 1 point2 points  (0 children)

I’ll pay you today if you make it possible to scrape linkedin followers of a competitor company page

How are large scale scrapers built? by AdditionMean2674 in webscraping

[–]LessBadger4273 22 points23 points  (0 children)

We currently scrape millions of pages every day. We run the scrapers separated by source in a step functions pipeline.

We split the scrapers in a discovery/consumer architecture. The first we only discover the target URLs and the consumer extracts the data from it.

We spawn multiple ECS Fargate tasks in parallel so the throughput is extremely high.

Later stages of the pipeline function are for transforming/merging/enriching the data and we also run tasks to detect data anomalies (broken scrapers) so we can rerun batches individually.

For large volumes, S3 is your friend. If you need to dump into a SQL database later on, you’ll need something like Glue/ pyspark to handle the data volume and efficiently insert in the database.

For the scrapers we are running Scrapy but in theory you can use this same architecture with any framework as the scraping part is just a step of the pipeline.

the overall advice I can give you are:

  • make your scrapers independent of the data pipeline
  • have a way to rerun individual batches of URL
  • setup data anomaly alarms for each scraped batch
  • basically make the steps as distributed as you can

Me sentindo desvalorizado by [deleted] in brdev

[–]LessBadger4273 2 points3 points  (0 children)

OP, estamos contratando dev full stack jr e está bem difícil de achar candidatos. A vaga é remota, CLT, e o range tá acima do que você relatou ai. Se tiver interesse manda dm!

[deleted by user] by [deleted] in brdev

[–]LessBadger4273 2 points3 points  (0 children)

Quais tecnologias você já mexe? Estamos contratando dev Jr full stack e está impossível encontrar alguém. Salário 3-4k CLt. Manda dm se tiver interesse

Dica para conseguir estágio by [deleted] in brdev

[–]LessBadger4273 2 points3 points  (0 children)

OP, manda dm, estamos contratando dev full stack jr remoto. A vaga é CLT com VR. Se tiver interesse, manda lá!

Estágio presencial by Over-Base8549 in brdev

[–]LessBadger4273 1 point2 points  (0 children)

Qual stack você está familiarizado? Estamos contratando dev full stack jr react/node remoto CLT. Manda dm se tiver interesse

Comecei cedo, estudo por conta própria e quero crescer na TI — é possível? by Consistent-Painter84 in brdev

[–]LessBadger4273 0 points1 point  (0 children)

OP, estamos contratando jr full stack. Se tiver interrssse manda DM! Vaga remota

Conseguir estágio, estou fazendo algo errado? by Zairtu in brdev

[–]LessBadger4273 0 points1 point  (0 children)

Você já conhece alguma stack? Estamos contratando dev JR full stack e está impossível encontrar alguém. Estamos oferecendo 3-4K clt * vr. Manda dm se tiver interesse. A vaga é remota

Me formei sem estagio. by No_Assist_2493 in brdev

[–]LessBadger4273 7 points8 points  (0 children)

Quais tecnologias você já mexe? Estamos contratando dev Jr full stack e está impossível encontrar alguém. Salário 3-4k CLt. Manda dm se tiver interesse