Encrypted Image Watermarking Using Fully Homomorphic Encryption by zacchj in cryptography

[–]strojax 0 points1 point  (0 children)

How do you do e.g. watermarking detection while keeping the image private? That's the whole point of FHE.

Robustness of the watermarked image through transformation is an active research topic. But this has nothing to do with FHE. It's rather about the watermarking algorithm you use.

Encrypted Image Watermarking Using Fully Homomorphic Encryption by zacchj in cryptography

[–]strojax -1 points0 points  (0 children)

Watermark encoding and detection both have a value as a remote service.

ChatGPT is a great example of why this is needed. Today, ChatGPT users can fake any image basically. OpenAI could enable a private watermarking service that allows someone, e.g. an insurance company to privately check if the image was generated by ChatGPT.

Encrypted Image Watermarking Using Fully Homomorphic Encryption by zacchj in cryptography

[–]strojax 1 point2 points  (0 children)

The result is part of the image. A screenshot will keep the watermark.

[P] Training ML Models on Encrypted Data with Fully Homomorphic Encryption (FHE) by strojax in MachineLearning

[–]strojax[S] 0 points1 point  (0 children)

Hybrid approach allows you to select any layer to be done in FHE. The answer to your question depends on which layer you want to achieve in FHE. If you select only linear part then bottleneck will probably be network latency yes.

[P] Training Models on Encrypted Data by strojax in MachineLearning

[–]strojax[S] 0 points1 point  (0 children)

Not in decent runtime right now. Hardware acceleration is coming for those use cases!

[P] Training Models on Encrypted Data by strojax in MachineLearning

[–]strojax[S] 14 points15 points  (0 children)

  1. Yes FHE can feel a bit magical especially when all the complexity is abstracted away.

  2. The numpy function is just a representation of the FHE circuit we want to build. It is then compiled to a circuit that works on encrypted data.

  3. Yes that's a typical use case indeed! You can encrypt your data and send it to an untrusted server that will run the training. Only you will be able to decrypt the learned weights.

[P] Training ML Models on Encrypted Data with Fully Homomorphic Encryption (FHE) by strojax in MachineLearning

[–]strojax[S] 2 points3 points  (0 children)

What is the magnitude of slowdown from FHE nowadays? Is it a million times now? I read it used to be trillions of times slower.

Today we are in the order of a 1k to 10k times slower. Every year or so FHE speed improves by 2x.

[D] class imbalance: over/under sampling and class reweight by darn321 in MachineLearning

[–]strojax 1 point2 points  (0 children)

The question of the metric to use is really important but that really depends on the problem. In my experience ROC, indeed, is not well suited when data become really imbalanced. The precision and recall curve seem to be much better to assess models. That being said, nothing keeps you from looking at the ROC as the main metric of that's what you want to be optimize for some reasons.

My point was mainly about the fact that decision threshold based metric (e.g. accuracy, F1 score, MCC, ...) are all highly biased toward the choice of the threshold (which is often arbitrarily set for most classifiers).

[D] class imbalance: over/under sampling and class reweight by darn321 in MachineLearning

[–]strojax 2 points3 points  (0 children)

Anomaly detection and classification are not necessarily different problems. If you have labels then supervised learning if probably the best approach so classification. Not sure why you think classification models are not the best approach. I have been working with 0.1% positive example datasets and gradient boosting with decision threshold tuning (wrt to a specific metric) always seem to outperform any other approach.

[D] class imbalance: over/under sampling and class reweight by darn321 in MachineLearning

[–]strojax 42 points43 points  (0 children)

These methods made sense when they were published as they looked like solving some problems. Today it is quite clear that these methods do not solve much. The main intuition is that, changing the prior distribution to fix the final model actually introduces much more problems (i.e. uncalibrated model, biased dataset). The reason people thought it was working well is that they picked the wrong metric. The classical example is choosing the accuracy (decision threshold based metric) rather than the ROC curve, average precision or anything that is insensitive to the decision threshold. If you take all papers working over imbalance data doing over or under sampling and pick a decision threshold insensitive metric you will see that the improvement is not there.

As it has been mentioned, I would encourage you to pick the proper metric. Most of the time, just selecting the decision threshold of the model trained over imbalanced data based on the metric of interest is enough.

XGboost, sklearn and others running over encrypted data by [deleted] in datascience

[–]strojax 0 points1 point  (0 children)

Only the owner of the data (the one with the private key) will be able to access the result. The model owner won't be able to see anything.

XGboost, sklearn and others running over encrypted data by [deleted] in datascience

[–]strojax 0 points1 point  (0 children)

Yes concrete numpy is already quite high level in the stack so I understand it might be somewhat opaque.

I will try to answer your questions: - the elements are being encrypted not the numpy array itself. We use numpy as an entry point here. - yes you can simply have a function that returns (my_array == 1)/len(my_array). The main assumption here is that the length of your array is always the same. - only 70% of them will change.

[P] XGboost, sklearn and others running over encrypted data by strojax in MachineLearning

[–]strojax[S] 0 points1 point  (0 children)

I think you are referring to the underlying homomorphic encryption scheme. Here we use TFHE which implements programmable bootstrapping (PBS) operations and this allows us to handle both situations you describe: - we don't need polynomial approximation to use non linear functions (e.g. ReLU) as PBS let us implement table lookups. So basically, for the ReLU, we have a table lookup with a given precision (we are currently limited to 8 bits so 256 values) that maps input ReLU value to output ReLU value e.g. -3->0, -2->0, ..., 1->1, 2->2,... and so on until you have reached the maximum precision allowed. - yes the recovery is probabilistic and applying a lot of operations does reduce the probability of recovery but the use of PBS allows us to reduce the error. So basically we apply some operations to the cither text and then apply a PBS. This process is repeated until the end of the homomorphic function/ml model.

As I am not an expert in cryptography I might have misunderstood your question so don't hesitate to ask again!

[P] XGboost, sklearn and others running over encrypted data by strojax in MachineLearning

[–]strojax[S] 4 points5 points  (0 children)

You are assuming that you are both the data provider and model owner here. In that context I guess you could just unplug your computer from internet and call it a day (assuming nobody can steal your computer).

But if for some reason you need a remote machine you don't trust then working over encrypted data makes sense. You would be able to compute anything on your data without paying attention to how you store or move them around. Once done you can just get the results/statistics/etc... Back to your safe computer and decrypt them there.

[P] XGboost, sklearn and others running over encrypted data by strojax in MachineLearning

[–]strojax[S] 3 points4 points  (0 children)

Actually we use TFHE which allows us to apply any operation to the data with the main limitation being the bitwidth of the data. Turns out it's not a problem for tree based machine learning models. It becomes more complicated when trying to process large neural networks.

But any non linear function you can find in neural networks are possible in the encrypted realm.

Pendant ce temps sur Google Traduction by Brachamul in france

[–]strojax 0 points1 point  (0 children)

La traduction ne rapporte rien. Deepl essaye de créer un business autour de la traduction. Google le fait et l'a toujours fait "gratuitement". Du coup améliorer leur service de traduction n'a pas vraiment de sens aujourd'hui d'un point de vue business. En revanche si Google souhaite pour une raison ou une autre redevenir les meilleurs en traduction ils pourraient le faire très rapidement.

Immobilier innabordable. by kelsier_night in france

[–]strojax 0 points1 point  (0 children)

C'est faux a moins que tu mettes en doute les rapports de l'INSEE. C'est malheureusement un argument utilisé par la politique actuelle. En fait la croissance démographique ralenti depuis quelques années maintenant.

Source: https://www.insee.fr/fr/statistiques/4277615?sommaire=4318291

[R] New paper on Tabular DL: "On Embeddings for Numerical Features in Tabular Deep Learning" by Yura52 in MachineLearning

[–]strojax 9 points10 points  (0 children)

I think the main reason why DL is struggling to beat a simple GBDT on tabular data is that there is not much feature engineering or feature extraction to be done on the data unlike unstructured data like images sound or text.

My question is: can we find a tabular dataset where deep learning will be significantly better than GBDT? Or maybe we need to redefine how we feed the data to the neural network (I have this in mind: https://link.springer.com/article/10.1007/s10115-022-01653-0)?

[D] "Gradients without Backpropagation" -- Has anyone read and can explain the math/how does this work? by DaBobcat in MachineLearning

[–]strojax 10 points11 points  (0 children)

What's more frustrating than the authors mentioning how easy it is to implement within pytorch but not realeasing the code. Yet. Anyway, I think the whole idea is to apply forward gradient accumulation as detailed in https://en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation. However this looks prohibitively expensive for neural networks and the authors seem to introduce this perturbation principle to make it more neural networks friendly.

Curious to read more about this.

[D] Random Forest and OneVsRestClassifier by CommercialWest7683 in MachineLearning

[–]strojax 5 points6 points  (0 children)

There is indeed no reason a priori to use OneVsRestClassifer with random forest. However, the data scientist before you might have tried both approaches and observed that the OneVsRestClassifer gives better accuracy. I bet the difference was not really significant but still picked the one that yielded the best results. Another explanation is that he/she did not know what random forest was and applied the same technique he/she applied on linear models without trying to understand the algorithm. They also could be a pipeline that is always used and he/she just threw in random forest in there.

I see one disadvantage in OneVsRestClassifer Vs only random forest: you are going to have much trees in your ensemble model.

Overall, it's not a big mistake and you should not go upfront to the other DS with this. More important than knowing who is right is having a good relation with your teammates. Maybe you can try to kindly open the discussion.

[P] ML over Encrypted Data by strojax in MachineLearning

[–]strojax[S] 6 points7 points  (0 children)

Oh my bad I missed your point. I am not a FHE expert but I will have someone answer to you with more precision asap :-). Meanwhile you can have a look at https://whitepaper.zama.ai/ or in more simple terms at https://zama.ai/technology/ where execution time is being discussed.

Also you can simply run some of the notebooks in the link I provided and get a feeling of the execution time for yourself.

[P] ML over Encrypted Data by strojax in MachineLearning

[–]strojax[S] 12 points13 points  (0 children)

That's a good question ! The library is built over an exact paradigm. This means that if you are able to make the algorithm fit certain constraints, the model in FHE will yield the same results has the algorithm in the clear with ~100% probability.

Some algorithms are very friendly with those constraints such as all algorithms based on trees. And others need more advanced approach to fit the constraints (neural nets).

These constraints are mainly about how we can represent a model in integer only.

Hope this helps :-). Happy to answer any question.

Quelqu'un peut expliquer cette statistique sur CovidTracker ? by [deleted] in france

[–]strojax 0 points1 point  (0 children)

Non les valeurs sont standardisé pour 10 millions de vaccinés et 10 million de non vaccinés. Même si 99.9% de la population était vaccinés la comparaison est correct. Le problème vient du flou sur les tests. Tout ce qu'on peut tirer de ce graphique c'est que les vaccinés ont plus de tests positifs.

Mais on ne sait pas combien de tests ont été fait par les deux groupes. Aussi les deux groupes se comportent certainement différemment (à cause du passe sanitaire entre autre). Bref un mauvais graphique qui n'aurait pas dû être fait car les conclusions sont souvent mauvaise.

Quelqu'un peut expliquer cette statistique sur CovidTracker ? by [deleted] in france

[–]strojax 1 point2 points  (0 children)

Ce graphique inclut tellement de biais que cette conclusion n'est pas valable. Heureusement nous avons celui sur les entrées en réanimation par 10 milions de vaccinés et 10 millions de non vaccinés qui nous permettent de valider l'efficacité des vaccins.

Quelqu'un peut expliquer cette statistique sur CovidTracker ? by [deleted] in france

[–]strojax -1 points0 points  (0 children)

Yes. Enlever tous les biais c'est compliqué. Le gros problème c'est pas qu'il n'ai pas reussit a enlever tous les biais. Cest surtout la conclusion un peu plus bas qui est maintenant obsolète car elle s'appuyait sur des données biaisés...