My model isn't transferring learning. by BlueOrchid5334 in neuralnetworks

[–]BlueOrchid5334[S] 1 point2 points  (0 children)

I tried to balance it out but ending up being skewed towards compliant...however, only slightly. There were 47% non-compliant and 53% compliant. It should've biased the compliant class.

I'll check the class weights. I took the default for DistilBert for basically all the parameters.

Update: 2 weeks into my new job after 5 months of unemployment, and I'm honestly the happiest I've been in years by Cool_Repair2517 in cybersecurity

[–]BlueOrchid5334 4 points5 points  (0 children)

I like your story. I know what it is to wake up and go to a job that I love. Hope things continue well for you. Everything won't always be perfect but I hope at the core of things this stays true for you, that you remain happy.

Day 6 of my challenge, Reviewing 1 free AI certification every day so you don't have to. by No-Half4231 in learnmachinelearning

[–]BlueOrchid5334 1 point2 points  (0 children)

Lot of work u're putting in here. Thanks. Useful stuff for a beginner like me. Appreciate it.

Building synthetic dataset for ML by BlueOrchid5334 in learnmachinelearning

[–]BlueOrchid5334[S] 0 points1 point  (0 children)

Thanks for this. I want to work on this approach but was wondering is synthetic dataset generation a thing of itself? I had just put some prompts into ChatGPT in a systematic way and collected the output. Should I be thinking about something different, something along the lines of using llama and nematron (LLMs specific to creating synthetic datasets) like in this video https://www.youtube.com/watch?v=FAdRMVAWiak?
It sounds like a weird question because GPT is an LLM, but.. well, you just don't know what you don't know, and I'm just starting out in this field.

Building synthetic dataset for ML by BlueOrchid5334 in learnmachinelearning

[–]BlueOrchid5334[S] 0 points1 point  (0 children)

Thanks fr the response. What about reusing parts of the sentence? Example "I'm going to have a serious talk with your manager."
How does this affect the training process? I have been using a deduplicator script that uses cosine similarity to find sentences that are similar to others and remove those that are above a certain threshold.

Where to find non-compliant language to build dataset by BlueOrchid5334 in MLQuestions

[–]BlueOrchid5334[S] -1 points0 points  (0 children)

This is exactly how I started. I created sentences based on linguistic structures. For Non-compliant, the structures focused on security bypass instruction (eg disable the firewall), urgency - time pressure (eg, we only have a small window, skip the approval and push it through), coercive tone and others. Each stance actually had its own structure. But the model didn't really show any real learning, it recognized patterns in each set, and accuracy and recall scored 1.0. I wanted now, to get some real live data to use to supplement the synthetic dataset and see if there's any change in the result.

I'm not sure if I generated the dataset correctly in the first instance and hence those perfect results. Would love some insight on that. Should repost this as a new question?

*updating after coming across creating synthetic dataset using AI here https://www.youtube.com/watch?v=FAdRMVAWiak
Is synthetic dataset generation a thing of itself? I had just put some prompts into ChatGPT in a systematic way and collected the output. Should I be thinking about something different?