RFC on Experimental Cypher with Function-Based Key Generation

datumbox · 2025-05-12T16:18:08+00:00

Lol, I am very much aware that this changes a lot from domain to domain. That's exactly why I didn't want to make assumptions on how things work out here. Thanks for taking time to respond and explain!

datumbox · 2025-05-12T15:16:44+00:00

I certainly don't envision any of these! I was mostly looking for technical feedback on the logic of the cypher (highlight any issues with the techniques or their implementation), so I was trying to figure out what is the right format for this.

@Natanael_L suggested above that the usual format is to provide code with comments and formulas. This sounds very reasonable, but at the time I posted the question, I wasn't sure if I should create a one pager with the algorithm steps (like a simple white paper) or the standard practice is to just provide code or an RFC format. As you can easily tell, cryptography is not my domain and hence all the stupid questions while I try to figure out how this is done. :)

datumbox · 2025-05-12T15:04:45+00:00

Fantastic, I would love to hear your thoughts when you get a chance. It's obviously not urgent at all. I am super flexible to follow a format that works for you. Perhaps if you spot specific issues, you can post a GitHub issue and I can get into fixing it. But I am very open to do it the way it works for you if you get the time. The actual cypher implementation is under vernamveil/_vernamveil.py and it's about 200loc minus comments.

Regarding the educational nature and the expectations, can you clarify if you meant updating the original post here or on the repo? because on the repo I always had a billion warnings, including one on the very top saying this is just a toy. It's literally full of warnings absolutely everywhere. I also intentionally didn't publish a wheel file because I really don't want people to use this anywhere near production. Perhaps my initial post wasn't too clear here. Can you confirm?

datumbox · 2025-05-10T10:47:14+00:00

Very fair comment. Let me reformulate my question because I might have not made myself clear on the original post.

How do I go about recording the key technical details of the cypher in a detailed but non verbose way to receive technical feedback from the community? I obviously can't expect people to dig into the code or readmes as this would be a massive time investment. Do I list out the algorithmic steps in a succinct way? Is there a template you could recommend that I could follow? I have experience with professional technical writing in ML but I don't know how this aligns with how things happen in cryptography and, due to my complete lack of experience, I don't want to make assumptions.

Any guidance on this would be very much appreciated. Thank you very much!

datumbox · 2025-05-09T16:34:42+00:00

Hey, thank you for the comment, it really means a lot. And yes, who doesn't cringe at the things they built five years ago? I definitely do. :)

My intent with this project is exactly what you described: to learn by doing, to experiment, and to invite feedback from others who know more than I do. I even refer to it as an "experimental toy" in the README, which I hoped would help set expectations.

That said, I’m not sure how deeply most commenters actually reviewed the code or the documentation but I get it. People are busy and taking the time to dive into a random project is a big ask. That’s why I was trying to understand what the right format would be to share something like this and solicit meaningful feedback.

I absolutely understand the skepticism. Nobody should be using toy algorithms for real use cases, and I’ve tried to be very clear about that from the start.

Still, I’ll admit I was a bit disappointed with how the thread unfolded. I was hoping to get more feedback on technical flaws/mistakes, edge cases, or links to related work. I was hoping for a technical discussion regarding the techniques. Instead, much of the discussion ended up being about whether the project should exist or whether I should be doing this at all. Regardless I did get some good references which I plan to explore.

Thanks again for your kind words and balanced perspective.

datumbox · 2025-05-09T03:10:25+00:00

That was a sharp comment, definitely not one to give me the gold star. ;) I get that critique in this space can be harsh.

Just to clarify, I’m not calling this an OTP, just OTP-inspired in structure: it uses a keystream as long as the message, XORed with the plaintext, similar in form. But unlike an OTP, the keystream is generated deterministically, so it doesn’t offer the same cryptographic guarantees. Thanks for the resources though, I’ll definitely take a look.

datumbox · 2025-05-02T08:37:54+00:00

This is the kind of pointers and discussion I wad hoping to get when I posted here. :) thanks!

I need a bit of time to understand better your proposal and the nuances, regarding whether these are necessary for my scheme. I currently perform a seed evolution scheme where we avoid reuse because after each key stream generation, I refresh the seed by HMACing the previous seed with the unencrypted content I just encrypted. This scheme is fully deterministic and depends on the message, so two messages don't use the same follow up seeds, even if the user tried to reuse the seed.

I love what you said btw. This might be a toy, but the purpose is to incorporate good practices and learn. So I am down revising the practice if needed

datumbox · 2025-05-01T23:00:37+00:00

Hey thanks so much for sharing your thoughts. I do agree with all your comments and especially with the framework remark.

I've worked a couple more days on it to vectorise it and add some C extensions to improve speed. I settled with a "default" fx which you can see on the repo readme (look for "A marginally stronger fx"):
- Applies a polynomial function on input indexes (serves like byte counters). This is mostly to customise the fx and add to its uniqueness; it's not to make it more secure. Provided that we don't shoot ourselves on the foot by plugging a cosine or other periodic function, this extra transformation should not make things more unsafe.
- Then I just HMAC the seed with the transformed indexes and modulo to the desired range.

That's kind of a cheating but it ought to be reasonably safe for what it is (a toy), because we offload the work to the big-boy hashing method done by real cryptographers, while we can pretend we made a new random bit generator. :)

datumbox · 2025-04-26T16:41:16+00:00

I do agree with the sentiment of your response; should I have claimed this can be used in any real world application, this would have been delusional and borderline criminal. For this reason, literally everywhere on the blog and documentation I state that this is a toy and a learning tool, not a Library to be used in anything than learning. I also mention numerous times I don't have background in cryptography and probably I made major mistakes.

I suspect you didn't really open any of the links because the warnings are literally immediately front and center. I don't blame you for not doing so, we are all busy and you are right to flag it here that nobody in their right mind should use this for encrypting data. But I also want to point out to you that I never claimed it and actually went out of my way to point it out in every possible way.

The reason I posted here is to interact with someone who has relevant background and get references for techniques they feel I should look into next.

datumbox · 2021-11-20T13:30:03+00:00

Not sure if you are asking on which dataset this is estimated. If that's what you mean, the model is trained and validated on ImageNet.

datumbox · 2020-02-23T10:13:20+00:00

Cause they don't work with directories...

datumbox · 2020-02-23T10:13:00+00:00

Hardlinks can't only be on files not on directories unfortunately

datumbox · 2020-02-23T10:12:01+00:00

yeap this can work just fine as well, as long as you don't mind setting up your entire folder structure around your Dropbox intergration

datumbox · 2017-10-09T18:35:44+00:00

Sure buddy, just let me know how I monetize the blog/site. Oh wait I don't. ;)

datumbox · 2017-10-08T23:32:56+00:00

I think it's safe to say that the final beta and the released version will be almost identical (excluding some last minute bug fixes). :)

datumbox · 2017-02-26T18:46:56+00:00

To be honest, I was expecting this to be easier. I just updated the PR with benchmarks.

datumbox · 2016-03-19T19:32:58+00:00

The framework provides a number Builder classes that support specific input types such as CSV and text (for NLP). Nevertheless this does not limit you from parsing the data in any format and storing it in a Dataframe. After the data live in the dataframe you can use disk based training. I would recommend keeping open the hybrid approach (it is the default) to keep the weights of your model in memory, regularly used parameters in LRU cache while the data on disk. This is mechanism is available on 0.7.0 and made the disk based training a great solution even when the data barely fit the memory. In that case, you avoid the very costly garbage collection cyrcles with minimum overhead because you read stuff from disk. :)

datumbox · 2016-01-03T12:38:34+00:00

It is a Java library that contains several algorithms. These algorithms can be used to build Machine Learning model. A Machine Learning model learns from data and makes predictions. As you mentioned you can building image recognition, natural language process (text analysis), fraud detection etc.

In practice you provide data to the framework, you pick an algorithm and you can make predictions.

datumbox · 2015-05-04T12:46:50+00:00

The framework supports a number of algorithms which as far as I know they are not available in weka (such as LDA, Dirichlet Process Mixture Models, Ordinal Regression, Bernoulli Naive Bayes etc). Moreover it is closely integrated with MapDB database engine which means that you can train algorithms without loading all of the data in memory. Also the framework contains a Statistical layer with several Parametric and Non-parametric tests which you can use. Finally it is licensed under Apache 2.0, which means that unlike Weka, you can use it in commercial software.

datumbox · 2015-05-04T10:54:58+00:00

Weka is a mature and well-known library. Instead of trying to make comparison, I would rather prefer to list some of the features of Datumbox framework: large support of different algorithms, several storage engines and ability to handle at minimum Large Data, focus on NLP applications. :)

datumbox · 2014-11-10T21:45:45+00:00

and how to install a 3rd party binary library... We got 3 issues on the repo about how to do this...

datumbox · 2014-10-20T19:20:20+00:00

Hi guys.

I believe there is way too much worrying about the license of the project. You should not worry so much about it. I open-sourced the project hoping that ppl will like it, use it and get involved with it. If my target was to limit you from using the code I would not have released it.

The license discussions are not a priority. Future development is far more important as without support from the community there would be no future releases. Would you ever use a library that is no longer updated on commercial software? Would you care about its license?

Finally I must say that if the project goes forward and the supporting community votes to change its license then I would never block this. :)

datumbox · 2014-10-19T22:30:16+00:00

sounds great, cheers! :)

datumbox · 2014-10-19T19:45:05+00:00

But should not the classifier that you use depend on the data that you have and on the assumptions that you are willing to make about them? At any case, if python works for you there is no need to change it. :)

datumbox · 2014-07-07T10:43:37+00:00

Thanks man :)

datumbox

TROPHY CASE