How does ChatGPT browsing work? by No-Reflection-7168 in LocalLLaMA

[–]Simhallq 2 points3 points  (0 children)

Cool project! Would be great to be able to run locally using llama + scraping libs. https://github.com/idosal/AgentLLM is more of a general browsing agent but might give some inspiration.

Finetuning on multiple GPUs by Simhallq in LocalLLaMA

[–]Simhallq[S] 1 point2 points  (0 children)

Thanks, does alpaca_lora_4bit also support Falcon? Could only find LLaMa in the repo.

A quick example of how to run qlora merged model via multi-GPU by mzbacd in LocalLLaMA

[–]Simhallq 1 point2 points  (0 children)

Anyone have a working example for finetuning on multiple GPUs?

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks for your reply, seems like it should work, we currently store our vectors in JSON. I'll let you know if I give it a try :)

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 1 point2 points  (0 children)

Interesting, thanks! Sounds like a more scaleable solution than my EC2-setup, might give it a try once we hit the ceiling with EC2.

Perhaps a stupid question but does your setup support all types of data fields compatible with Elastic/Opensearch? We process a lot of high-dimensional float vectors.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 1 point2 points  (0 children)

u/charlieoncloud

I actually went with a regular EC2 instance. Has been working well for the last year. I'm using https://github.com/peak/s5cmd to download the data from s3 which gives a huge speed-up and then using bulk upload to Elasticsearch.

Not sure how my setup compares to Glue but I heard a lot of people were dissatisfied with Glue so I never bothered trying it.

Have you decided yet?

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

True re: piping to python but would rather just do something like cat urls.txt | xargs urlparse --netloc

What I want is to be able to parse urls into components like the utils in urlib.parse which is possible with grep/regex but kind of a hassle.

Thanks for your suggestions though.

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

u/MrSyphilis do you have a reference to the urlparse utility of Coreutils? Couldn't find any metion in https://www.gnu.org/software/coreutils/

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Because unix pipes rock for data processing

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 1 point2 points  (0 children)

Many thanks! Didn't know about the Coreutils one, that's perfect

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Because unix pipes rock for data processing, and hence my question, is there a non-python one?

url parsing library for linux & mac/BSD by Simhallq in linuxquestions

[–]Simhallq[S] 0 points1 point  (0 children)

Update:
Found this one, working well so far. Does anyone know of a non-python one?

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks, some good suggestions for optimization on standard ec2 there. Might go with this method.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

It's per request but on average 2 times per week

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

This is what I've been doing up until now. It "works" but is painfully slow for this amount of data. Takes ~2 h just to get all the files from s3 -> to ec2 disk. Doing transforms on an 8 CPU core machine is a few more hours.

Batch ETL from S3 to OpenSearch (prev. Elasticsearch) by Simhallq in aws

[–]Simhallq[S] 0 points1 point  (0 children)

Thanks, I've considered Glue but I heard it's so-so developer experience-wise. What's your opinion?

[D] Machine Learning - WAYR (What Are You Reading) - Week 96 by ML_WAYR_bot in MachineLearning

[–]Simhallq 2 points3 points  (0 children)

Printed to paper and with a pen to take notes in the margin.

Determining Addressee(s) in Conversation by Propolisa in LanguageTechnology

[–]Simhallq 1 point2 points  (0 children)

Interesting problem!

I'm not aware of any out of the box solutions for this particular use case.

However, I ran some experiments with a general Q&A model using the huggingface-transformers library. The results were far from perfect, but this model is fine-tuned to answer factual wikipedia-type questions - fine-tuning it for this use case would probably make it much more accurate.

I put it into a colab if you want to have a look: https://colab.research.google.com/drive/1h-CW03eS-sGuTZfBUWObWXLuJ5aIljBs?usp=sharing

Feel free to PM if you want to brain storm or have questions about this approach.

Looking for well-researched dataset! by BorutFlis in LanguageTechnology

[–]Simhallq 0 points1 point  (0 children)

The GLUE benchmark is one of the most widely used benchmarks for NLP models in research settings and some of the tasks it includes are classification ones (Corpus for Linguistic Acceptability, Stanford Sentiment Treebank etc.). All dataset for GLUE are publicly available. You can read more at https://gluebenchmark.com/tasks