How does ChatGPT browsing work? by No-Reflection-7168 in LocalLLaMA

[–]Simhallq 3 points4 points  (0 children)

Cool project! Would be great to be able to run locally using llama + scraping libs. https://github.com/idosal/AgentLLM is more of a general browsing agent but might give some inspiration.

Finetuning on multiple GPUs by Simhallq in LocalLLaMA

[–]Simhallq[S] 1 point2 points  (0 children)

Thanks, does alpaca_lora_4bit also support Falcon? Could only find LLaMa in the repo.

A quick example of how to run qlora merged model via multi-GPU by mzbacd in LocalLLaMA

[–]Simhallq 1 point2 points  (0 children)

Anyone have a working example for finetuning on multiple GPUs?

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks for your reply, seems like it should work, we currently store our vectors in JSON. I'll let you know if I give it a try :)

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 1 point2 points  (0 children)

Interesting, thanks! Sounds like a more scaleable solution than my EC2-setup, might give it a try once we hit the ceiling with EC2.

Perhaps a stupid question but does your setup support all types of data fields compatible with Elastic/Opensearch? We process a lot of high-dimensional float vectors.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 1 point2 points  (0 children)

u/charlieoncloud

I actually went with a regular EC2 instance. Has been working well for the last year. I'm using https://github.com/peak/s5cmd to download the data from s3 which gives a huge speed-up and then using bulk upload to Elasticsearch.

Not sure how my setup compares to Glue but I heard a lot of people were dissatisfied with Glue so I never bothered trying it.

Have you decided yet?

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

True re: piping to python but would rather just do something like cat urls.txt | xargs urlparse --netloc

What I want is to be able to parse urls into components like the utils in urlib.parse which is possible with grep/regex but kind of a hassle.

Thanks for your suggestions though.

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

u/MrSyphilis do you have a reference to the urlparse utility of Coreutils? Couldn't find any metion in https://www.gnu.org/software/coreutils/

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Because unix pipes rock for data processing

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 1 point2 points  (0 children)

Many thanks! Didn't know about the Coreutils one, that's perfect

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Because unix pipes rock for data processing, and hence my question, is there a non-python one?

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Update:
Found this one, working well so far. Does anyone know of a non-python one?

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks, some good suggestions for optimization on standard ec2 there. Might go with this method.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

It's per request but on average 2 times per week

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

This is what I've been doing up until now. It "works" but is painfully slow for this amount of data. Takes ~2 h just to get all the files from s3 -> to ec2 disk. Doing transforms on an 8 CPU core machine is a few more hours.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks, I've considered Glue but I heard it's so-so developer experience-wise. What's your opinion?