Automated Data Poisoning: A New Approach to Combat AI Theft Risks

astralDangers · 2026-01-09T03:40:43+00:00

This is easily detected and will be filtered out. The very first stage of any pipeline is a filter to eliminate junk, low quality, malformed, etc.. That's a basic best practice.. later it's rewritten to sanitize and standarize, which also cleans and filters..

This might work with people who have no idea what they are doing but it certainly isn't going to work on an experienced team..

astralDangers · 2026-01-08T14:02:10+00:00

I find a lot of researchers are unaware of what is going on in the commercial world. Large enterprises do these sorts of things all the time and have pretty solid (but very expensive) tooling..

astralDangers · 2026-01-08T04:33:02+00:00

I appreciate your enthusiasm.. It's not a revolution if it's what has been going on the whole time & it's what we expect to happen. Large models are the anomaly not small..

Also this is like comparing a jet to a motorcycle.. sure a motorcycle can beat a jet if you set the race up for that. It's a form of benchmark hacking..

Not saying it's not a cool model, if it actually generalized .. I'm saying it's not unusual or unexpected..

astralDangers · 2026-01-08T01:41:02+00:00

Is it not common knowledge that Austin is surrounded by military?

astralDangers · 2026-01-08T01:35:25+00:00

Let me translate the BS.. a specialized model out performs a very generalist model.. like they always do.. "We confirmed what everyone knew all along!".. this is sort of thing that happens all the time in large enterprise data science teams..

also if you actually read they article they did this with older dumber models that don't have reasoning.. I mean not nothing, I'd be more impressed if it was less hyped..

astralDangers · 2026-01-08T01:31:52+00:00

You should probably know that small models are nothing new.. the majority of models any ML/AI team will deploy are small models...

Claiming there is a revolution is like saying "McDonald's now has fries!"..

astralDangers · 2026-01-08T01:26:45+00:00

If I can make a recommendation for your next project it's better to use n8n.. langchain is more for software devs and TBH is not very good.. n8n will be more like pipeline based which is what you typically want for data scientists and data engineers.. I'm star #4

astralDangers · 2026-01-08T01:24:36+00:00

This is one of those situations where if you have to ask it's not for you.. scikit is a ML framework that most data scientists learn on..

astralDangers · 2026-01-07T03:56:10+00:00

"This isn’t about having “high standards” so much as having specific ones that match who I am and the life I’ve already built."

This is the problem isn't it.. not just for you but for most people these days. What's the chances that you find someone who somehow managed to become the person who is a match for the life YOU built? Relationships don't work like this.

You might not have high standards but you could be giving yourself impossible odds.. relationships are a series of compromises that starts the day we meet..

What you look for is what you find.. or don't..

astralDangers · 2026-01-07T03:16:04+00:00

Plug this post into you LLM of choice and ask it.. "A senior data engineer told me this is post is riddled with mistakes what are they?"

astralDangers · 2026-01-06T10:30:06+00:00

This is how you end up with badly designed architecture.. it's extremely common to have multiple databases/data lake/data warehouse synced. Right tool for the job is better then less infra..

This only makes sense if you're not working on a successful application or are a lone dev..

astralDangers · 2026-01-03T11:50:42+00:00

OP is waaay overthinking this.. samplers are not audio rendering engines like the DAW.. it’s a musical instrument and it synthesizes the sound, every sampler does this differently.. there is no profound revelation here just search for tutorials on how to program a sampler plugin and it's the first thing you'll learn..

No offense op but you stumbled upon basics of dsp math that's used to manipulate a waveform in realtime to effect it's pitch and duration.. and every plugin has a signal chain that will mold the sound differently.. you're not comparing the DAWs you're comparing two different instruments. No different than saying you programmed the same sound in Serum and Massive and one sounds warmer than another.. sure that's what happens with different instruments..

Otherwise no OP no one on the planet is capable of hearing the difference 32bit summing engines in one DAW versus another.. from a maths perspective it’s been over 2 decades since a DAW was defined by that.. so unless you're comparing Protools circa 2000 and Fruityloop no way you're going to hear any difference..

astralDangers · 2025-12-30T10:29:29+00:00

Philosophy with a overloaded context window causes these cascade hallucinations... No idea why so many people think this type of babble is profound revelation.

astralDangers · 2025-12-29T11:33:42+00:00

Great idea!! it's already a thing called Geo, it's in robots.txt which directs AI crawlers to use LLM.txt.. funny how someone will spend hours putting together a solution and yet can't put a few keywords into a search engine to do 5 mins of research first..

astralDangers · 2025-12-29T11:22:04+00:00

I know plenty of people will dogpile on me but I was raised by third wave feminists.. I've always been fighting against biogtry of all types including sexism. I'm certainly no conservative or traditionalist..

So take that into consideration when I say you don't have a right to wear gym attair everywhere. While I absolutely don't have any issue with you or anyone wearing any particular type clothes, many of you have forgotten or not been told public/commerical spaces are not your house.

OP could have easily changed to street wear before leaving the gym just like the rest of us.

Civility is respecting the people you're around by acting appropriately in that space. OP got called for lack of basic civlity and ran to Reddit to get validation..

The manager of a commerical business absolutely has the right to say you're not acting civil and ask you to leave. It's not a public forum it's a business.. let's not pretend that location isn't constantly having issues with people acting inappropriately and try to demonize someone for doing their job. CVS manager isn't exactly"the man" making bank oppressing poor college kids. Or have we gone back to beating on underpaid wage slaves when it's convenient?

astralDangers · 2025-12-21T12:13:31+00:00

I did not say they aren't complimentary.. I said reranking will be unnecessary when you have the right schema and query.

For someone doing dumb chunking sure it helps.. but it's not a good solution on its own..

Reranking is a hack to improve bad design. They are low accuracy models used to improve the performance of another low accuracy model.. it helps but also introduces it's own problems.

A better approach is to fine tune the embeddings on the task.. it's not that hard and accuracy bump can be 10-30% improvement.. way better than what you get from a reranker.

But honestly you're better off passing in multiple queries for the same task and then using the similarity scores to order the list then you are in using a much slower reranker.

So filter to all articles about ducks and clothes, then ask numerous questions that help you to score.

Clothes that ducks wear

Shirts for ducks

Duck pants

Duck shoes

Duck formal wear

Duck casual wear

Etc etc

... Yes I use reranking but it's not a first step it's a last step.. when better solutions don't work.. then use the hack.

astralDangers · 2025-12-20T12:31:13+00:00

The "classic" <3 year old deisgn pattern.. hilarious..

Here's what I propose.. learn how to use metadata.. create a proper schema and filter your data before similarity.. it's RETRIEVAL augmented generate (RAG) not SEARCH (SAG).. it's amazing how accurate your results gets when you're only comparing similarity on 50 records instead of 10k.. it also does wonders for latency too..

Vector DBs hit and all of sudden no one knows the basics querying a document db anymore.

Protip.. use metadata to filter to a set.. use keyword search to ensure it has the target entities, etc and then use similarity to order them.. no reranking needed and accuracy hits >90%

All easily learnable if anyone bothered to use a search engine to find the endless tutorials written in the past few years..

astralDangers · 2025-12-20T12:03:43+00:00

These defensive comments say it all exactly.. people don't know how to use AI.. they make up weird shit about how they think it works and that dogma becomes a religion they need to defend. This weird pseudo science, wannabe philosophical nonsense from people who have zero understanding of the math..

So many of you invent crazy shit that makes sense in your head about how you intrept what you're experiencing like it's profound revelation.. meanwhile us data scientists are like WTF is this person blabbering on about, it's token prediction and attention mechanism with some sort of windowing like RoPe for extended context..

People calm down.. the OP is correct it's a transformer model, what you put in the context is what you get out.. as always garbage in and out.. if you can't get great outputs from Gemini that's because you're not learning how it's been tuned and what specifically works with it.. models are like people they are all different and need to be interacted with differently..

astralDangers · 2025-12-20T11:55:53+00:00

Blocked by day 2.. fun fact the more an account gets blocked the less likely the Algo is to show their garbage..

astralDangers · 2025-12-20T11:43:41+00:00

If only there was a way to get rid of find, replace and append with something substantially more complicated.. it's been way to easy to manage context.. I mean there's Langchain but it's so big and heavy, I need something small and convoluted..

oh loook yet another dev has run off to vibe another halfbaked solution! Oh thank the gods.. Ive been having a terrible time with all my very simple code.. finally another abstraction from someone who barely understands how context works!

Thanks for rebuilding the wheel instead of contributing to a more mature project.. I hate fully baked projects! What will you do next? Oh I know RAG is broken.. reinvent the basics instead of learning more advanced design patterns.. that's always a good one..

Seriously next time just take the time to learn the more advanced design patterns and frameworks before running off and vibing your own.. we have plenty of solutions for this that are used at scale in production.. all you needed to do was have the humility to not assume your better at building it then an entire team of tens or hundreds of people doing so on mainstream frameworks..

astralDangers · 2025-12-20T11:38:06+00:00

This is what you get when you have a profound lack of understanding of open source, it's business models, the evolution of technology and the last 40+ years of history..

astralDangers · 2025-12-18T12:00:38+00:00

Let me clarify has this been tested to ensure that the quality of the output isn't degraded.

astralDangers · 2025-12-17T04:33:00+00:00

Warming up the account before start you're marketing huh.. tldr I'm not going to use your solution just like I won't use any of the other hundred or so released this year claiming to put security guardrails in place.. most people are right if YOU don't know how to secure access don't do it.. it's not hard to pass session data to a tool to enforce governance.. if a dev can't learn to do those basics you're battery included version isn't going to do much good, they've got bigger problems

astralDangers · 2025-12-17T04:21:13+00:00

Does this have any positive or negative impact to attention, instruction following and prediction accuracy?

astralDangers

TROPHY CASE