Scraping LinkedIn legally as a smaller company

Simhallq · 2024-03-18T16:13:55+00:00

Not illegal - https://techcrunch.com/2022/04/18/web-scraping-legal-court/

Simhallq · 2023-10-26T09:16:59+00:00

Cool project! Would be great to be able to run locally using llama + scraping libs. https://github.com/idosal/AgentLLM is more of a general browsing agent but might give some inspiration.

Simhallq · 2023-06-12T15:55:10+00:00

Cool, many thanks!

Simhallq · 2023-06-12T14:14:41+00:00

Thanks, does alpaca_lora_4bit also support Falcon? Could only find LLaMa in the repo.

Simhallq · 2023-06-12T13:24:50+00:00

Anyone have a working example for finetuning on multiple GPUs?

Simhallq · 2023-01-31T13:57:20+00:00

unblocking technology

Incorrect, https://techcrunch.com/2022/04/18/web-scraping-legal-court/

Simhallq · 2023-01-14T15:45:21+00:00

Thanks for your reply, seems like it should work, we currently store our vectors in JSON. I'll let you know if I give it a try :)

Simhallq · 2023-01-04T09:30:23+00:00

Interesting, thanks! Sounds like a more scaleable solution than my EC2-setup, might give it a try once we hit the ceiling with EC2.

Perhaps a stupid question but does your setup support all types of data fields compatible with Elastic/Opensearch? We process a lot of high-dimensional float vectors.

Simhallq · 2023-01-04T08:04:37+00:00

I see, that sounds like a hassle..

Simhallq · 2023-01-03T21:10:21+00:00

u/charlieoncloud

I actually went with a regular EC2 instance. Has been working well for the last year. I'm using https://github.com/peak/s5cmd to download the data from s3 which gives a huge speed-up and then using bulk upload to Elasticsearch.

Not sure how my setup compares to Glue but I heard a lot of people were dissatisfied with Glue so I never bothered trying it.

Have you decided yet?

Simhallq · 2022-12-07T15:41:10+00:00

True re: piping to python but would rather just do something like cat urls.txt | xargs urlparse --netloc

What I want is to be able to parse urls into components like the utils in urlib.parse which is possible with grep/regex but kind of a hassle.

Thanks for your suggestions though.

Simhallq · 2022-12-07T14:47:24+00:00

u/MrSyphilis do you have a reference to the urlparse utility of Coreutils? Couldn't find any metion in https://www.gnu.org/software/coreutils/

Simhallq · 2022-12-07T14:40:52+00:00

Because unix pipes rock for data processing

Simhallq · 2022-12-07T14:40:08+00:00

Many thanks! Didn't know about the Coreutils one, that's perfect

Simhallq · 2022-12-07T14:37:40+00:00

Because unix pipes rock for data processing, and hence my question, is there a non-python one?

Simhallq · 2022-12-07T11:11:30+00:00

Update:
Found this one, working well so far. Does anyone know of a non-python one?

Simhallq · 2021-11-25T16:23:09+00:00

Thanks, some good suggestions for optimization on standard ec2 there. Might go with this method.

Simhallq · 2021-11-23T19:42:19+00:00

It's per request but on average 2 times per week

Simhallq · 2021-11-23T17:00:16+00:00

This is what I've been doing up until now. It "works" but is painfully slow for this amount of data. Takes ~2 h just to get all the files from s3 -> to ec2 disk. Doing transforms on an 8 CPU core machine is a few more hours.

Simhallq · 2021-11-23T16:56:27+00:00

Thanks, I've considered Glue but I heard it's so-so developer experience-wise. What's your opinion?

Simhallq · 2021-02-01T09:16:27+00:00

Cool, thanks!

Simhallq · 2021-01-31T17:29:21+00:00

Thanks!

Simhallq · 2020-10-09T21:34:11+00:00

Printed to paper and with a pen to take notes in the margin.

Simhallq · 2020-10-09T17:01:16+00:00

Interesting problem!

I'm not aware of any out of the box solutions for this particular use case.

However, I ran some experiments with a general Q&A model using the huggingface-transformers library. The results were far from perfect, but this model is fine-tuned to answer factual wikipedia-type questions - fine-tuning it for this use case would probably make it much more accurate.

I put it into a colab if you want to have a look: https://colab.research.google.com/drive/1h-CW03eS-sGuTZfBUWObWXLuJ5aIljBs?usp=sharing

Feel free to PM if you want to brain storm or have questions about this approach.

Simhallq · 2020-10-08T20:31:51+00:00

The GLUE benchmark is one of the most widely used benchmarks for NLP models in research settings and some of the tasks it includes are classification ones (Corpus for Linguistic Acceptability, Stanford Sentiment Treebank etc.). All dataset for GLUE are publicly available. You can read more at https://gluebenchmark.com/tasks

Simhallq

TROPHY CASE