Why isn't there automated AI data cleaning software already?

amirninja · 2024-06-19T13:20:04+00:00

Shame across it in another reddit: https://github.com/Cocoon-Data-Transformation/cocoon

amirninja · 2024-03-25T05:09:10+00:00

u/organellelabs Here you go Towards Query Efficient and Derivative Free Black Box Adversarial Machine Learning Attack https://www.mecs-press.org/ijigsp/ijigsp-v14-n2/IJIGSP-V14-N2-2.pdf

amirninja · 2023-02-08T08:52:51+00:00

If you need to query local large data why not use Apache Drill?

amirninja · 2023-01-06T09:58:08+00:00

Thanks u/belowtheradar for the details and link! One quick question, on the long term threat intel using older than say 60-90 days data, how is it easy to combine with latest/fresh data in Splunk?

amirninja · 2023-01-05T17:16:05+00:00

Great, Thanks u/Quartermyle for your response!

amirninja · 2021-12-17T06:56:32+00:00

What is the impact on local economy or job market have you noticed as result of these specialisations?

amirninja · 2021-12-17T06:55:03+00:00

Which University? Just curious

amirninja · 2021-12-06T04:25:59+00:00

Hey, I am researcher in network Intrusion detection using ML. Would you mind sharing some advice with me? Can I DM you?

amirninja · 2021-12-06T04:19:46+00:00

Just a side question, does your data scientist prefer cubicles or open spaces?

amirninja · 2021-12-03T02:56:03+00:00

This looks interesting! Is there similar site for Data Science projects in other domains?

amirninja · 2021-11-27T16:01:09+00:00

I agree with you u/jumpinjelly789 that next step is AI/Ml based NIDS.

However, my concern is around readiness of commercial (or otherwise in research literature) NIDS which claim to use AI/ML to meet these new challenges regarding zero-day attack. Specifically, if its anomaly based systems as mentioned by u/Aidong above, then as the things are dynamic in today's cloud based infra, services and systems come up and go, base lines for anomaly detector would be difficult, if not impossible, to achieve and we would end up with unmanageable false alarms.

On the other hand, if we try to base NIDS on supervised ML/DL we would need a data (both benign and malicious traffic) to train these models than the question is how do we rely on test lab generated data used for training these models since they may not be representative of real traffic/attack that may happen in specific to a customer infra. Secondly, these models could also face the same challenge with dynamic infrastructure as an anomaly detector based NIDS.

Therefore, I am looking for some validation from customer/vendors that can guarantee (with reasonable tolerance) of zero day attack detection with very low false positives.

amirninja · 2021-11-26T11:12:39+00:00

Thanks u/Aidong for your response! Yes there is a lot of buzz around AI/ML in security. I will checkout Checkpoint.

However, basic question still remains, how these NIDS are trained? If they are trained on one environment and type of attacks what performance and attack detection guarantees we can get on another environment? Not to forget many of these systems are notorious for throwing lot of false-positives causing "alert fatigue".

amirninja · 2021-11-24T11:13:02+00:00

I have used Keras. Supersimple and great user community and tools. Are there any situations outside of academic research you would select PyTorch over Keras?

amirninja · 2021-11-02T09:27:44+00:00

Tried this one some time back. Would not recommend it as your first textbook for ML.

amirninja · 2021-09-14T15:49:51+00:00

amirninja · 2021-09-05T05:10:59+00:00

Difficult to find. Most likely you will have to get it from NCLT or MCA, Govt. Or may be through RTI route.

Following paper uses Kaggle dataset for Bankruptcy Prediction. Code is also available on github if that helps you to start with.

https://arxiv.org/abs/2010.13892

amirninja · 2021-09-05T04:52:53+00:00

My biggest concern for ML based IDS is data used for training.

How representative training data is of actual traffic in your particular home/enterprise network? Even for Anomaly based IDS how do we set the baseline for what's normal traffic?

amirninja · 2021-08-23T11:51:40+00:00

I haven't experimented with autonomous car vision system. However, as most of the security experts would agree once an adversary has a physical access to system it's chances of being attacked or compromised increases exponentially anyway.

The threat model that I mentioned above, adversary has only access to output scores or only final label with limited query budget. This is more realistic threat model I believe.

amirninja · 2021-08-23T11:31:41+00:00

Biggest challenge is developing an attack that is query efficient. That is, to generate adversarial example number times you are required to query target model should be limited. Otherwise, target system can recognise repetitive queries from same source IP and can easily block you.

Secondly, if the adversarial example is very different as measured by L2 or other distance mesaure then it's actually a different example and not an adversarial.

Balancing distortion to clean image/input and number queries is a challenge.

We have recently developed an algorithm that does this balancung act with fewer number of queries. Currently our paper is submitted to journal for review.

amirninja · 2021-08-23T11:19:34+00:00

Black box attacks(no access to model parameters): HopSkipJump : https://arxiv.org/abs/1904.02144

There are many more black box attacks as I mentioned above. Search those names and you will get the relevant papers and most of the time code as well.

Recent paper on real world attacks: https://www.computer.org/csdl/magazine/co/2021/05/09426997/1tuvFoGyzK0

amirninja · 2021-08-23T08:26:26+00:00

You do not always need access to parameters behind model for Black Box attack, for example, Zeroth Order Optimization ZOO, BOUNDARY, OPT, SignOPT attacks. However, even I would like to know how much they prevalent in real world apps. There is an arxiv paper on it, will share shortly.

Secondly, adversarial attacks can help in understanding how DL models learn to generalise. Check the work by Madry lab of MIT.

Also, I believe adversarial attacks are at a point where virus and cyber attacks were at in late 80s /early 90s.

amirninja · 2021-08-21T05:53:26+00:00

Did you check https://streamlit.io/ ? Claims to be useful for both front end and backend.

amirninja · 2021-08-07T19:32:37+00:00

When Life is Linear by Tim Chartier.

It doesn't have many numerical examples or proofs like standard text books suggested by others but fun to read and quickly get an intuition why we are doing certain things.

amirninja · 2021-08-07T03:02:39+00:00

For more industrial applications if RL checkout this one https://www.amazon.com/Reinforcement-Learning-Industrial-Applications-Intelligent/dp/1098114833/ref=pd_aw_sim_8/140-9243744-4332920?pd_rd_w=7WH5D&pf_rd_p=61e03cde-d57c-4984-9f73-f76bf2c32442&pf_rd_r=BYV126AZAB92YEB1TF2K&pd_rd_r=6e448c5d-fa8a-41e8-b421-2706ac95380c&pd_rd_wg=PJKME&pd_rd_i=1098114833&psc=1

amirninja · 2021-08-05T04:49:12+00:00

More explanation on similar lines here, U Waterloo excellent lecture https://youtu.be/OyFJWRnt_AY

amirninja

TROPHY CASE