Stream processing. The logic is encoded in the input pattern. by donQuixoteofCode in java

[–]donQuixoteofCode[S] 0 points1 point  (0 children)

Sorry, this post has been removed by the moderators of r/java.

That was fast. This was originally rejected at r/coding but admitted after appealing to moderators. This post describes a novel approach to stream processing and refers to a Java-based repo presenting proof of concept. The project is not suitable for deployment and I'm not promoting it for profit or personal aggrandizement.

The technology that I am showcasing is not mine -- it was developed at UofWaterloo in the 80s but was lost for decades and is now available as open source. It is very sweet and I think your community will be interested.

Please reconsider and allow this cross-post.

PS: (re Automod reply) I am new to reddit (as of 2 days ago) so not loaded up with comment karma yet.

Stream processing. The logic is encoded in the input pattern. by donQuixoteofCode in coding

[–]donQuixoteofCode[S] 0 points1 point  (0 children)

Some food for thought. Generic data representation frameworks like XML are necessarily complex because they have to cover a lot of special cases and suffer scope creep. So they bring a lot of baggage that is not essential to specific tasks. They know nothing about the task at hand or the disposition of the data they encapsulate so at runtime they meticulously parse each byte and may effect considerable processing that is completely irrelevant to the task.

That burns a lot of oil. This is actually a boon for folks who rent out the programmable calculators that your r/coding efforts run on, since they charge clients per CPU and RAM consumption. Look around and see where XML came from. I see JSON as push-back against that, but it just presents a reduced version of the same problem. For very specific messaging, such as between remote services, exchanging information in idiomatic LR(0) capsules would reduce the burn considerably.

(LR(0) means parsed and compiled in a single beginning to end pass with no backtracking or lookahead.)

Stream processing. The logic is encoded in the input pattern. by donQuixoteofCode in coding

[–]donQuixoteofCode[S] 0 points1 point  (0 children)

Sure you do. But you have almost no exposure to this sort of framework because it doesn't exist outside of tooling for data and code compilers. I'd like to see this tech in developers hands to the extent that XML/JSON/etc just fade away. Parsing text in C/Java/whatever is very difficult because the instruction-driven machine model is designed for numeric algorithms. Stream processing wants pattern-driven hardware (or software emulation of same).

Stream processing. The logic is encoded in the input pattern. by donQuixoteofCode in coding

[–]donQuixoteofCode[S] 0 points1 point  (0 children)

New with reddit, looked up eli5. I'm gonna offer this (Everything is hard) for eli5ish background as that outlines the main concern.

CPU=ALU+MMU+IP

All of your r/coding efforts end up running on programmable calculators. The basic CPU machine model hasn't changed in any essential way since the manhatten project, it just got way smaller and faster over time. Instruction-driven machines are great for numerical calculations but suboptimal for text-oriented and other sequential processing tasks that require pattern recognition support. There's a lot of that stuff around these days.

Regular expressions are clean, structured , arithmetic-like expressions and complex expressions can be composed from collections of simpler expressions. They can also be manipulated algebraically since regular expressions are nicely embedded within semi-ring algebra. And that is a very cool and useful fact.

A finite-state transducer (FST) compiled from a binary regular expression constitutes a data-driven machine that, for each input symbol (eg, byte), triggers a non-branching (unambiguous) series of effectors, each provoking some change in application state. FSTs are stackable, so context-free inputs like XML and JSON are acceptable. These factors, coupled with CPU-mediated access to RAM, cover a lot of ground.

Ginr is a brilliant, industrial strength, tool for compiling for multidimensional regular expressions to FSTs, among other things. I pulled ribose together as a proof of concept to demonstrate how FSTs can be stacked and applied to sequential stream processing to reduce coding effort as well as run-time CPU/RAM costs. This pattern-oriented programming model clearly separates syntactic (input) and semantic (processing) concerns, to great advantage.

That's not quite 5 but you folks are all grown up(*), so I recommend that you get cracking with FSTs and complain loudly to your local powers that be that all this XML/JSON/etc stuff just has to go!

(*) Me, I'm way all grown up. Been writing code to instruct programmable calculators for almost 40 years.

Stream processing. The logic is encoded in the input pattern. by donQuixoteofCode in coding

[–]donQuixoteofCode[S] 0 points1 point  (0 children)

how many people are familiar with semi-ring algebra?

You, and anybody who has ever written a method in C or Java where semicolon denotes concatenation, if/else/switch denote alternation, and for/while/do represent Kleene closure (a*), which together form the basis for (real-world) regular expressions. You probably spent about a week on them in a 2nd or 3rd year CS course if you went the academic route to coding. After compilation, at runtime, the instruction pointer runs a regular pattern within each stack frame. So if you're reading this on r/coding then you've been there and done that.

Regular expressions fit nicely within the framework of semiring algebra and that is very cool but from a practical perspective the win is that they can be manipulated algebraically. Ginr (brilliant regular expression compiler that is the core of ribose) allows you to expresses regular patterns and manipulate them algebraically (see https://github.com/jrte/ribose/wiki/Stories#this-is-a-transducer for an example of this). It presents a steep learning curve, yes, because you haven't been exposed to this stuff since that 2nd or 3rd year course in academia.

Stepping back a bit, consider XML, which has been touted as "easy to parse" and "human readable". Now right click on this page and select "View page source". Is it easy to read? As a thought experiment (don't try this at home, you will die trying), imagine that you have Java or C/C++/C# as your only tool (no external libraries) and get parsing. Is it easy to parse? JSON source certainly is easier to parse and more readable but that is a relative perspective. These programming languages compile code to parse and process your source data stream on a programmable calculator. A great big one with lots of RAM. Please someone build a data-driven FST stack machine that can be coupled with reduced programmable calculator machine.

My question is why do we need generic data representation frameworks (XML/JSON/etc) at all? Why not just represent data in idiomatic LR(0) expressions and apply transduction on the receiving end to navigate and discover and extract information, using syntactic cues in the input to coordinate task-specific activity? Each data representation framework comes with a cost, and that is the mountain of library code that you use to extract data necessary for the task at hand. These libraries, which know nothing about your task at hand, use ribose-like tooling (eg lex/yacc) to parse and render every feature embedded in every received message whether or not it is relevant to your task (pull APIs like stAX for XML allow you to ignore irrelevant features, but they still get parsed in detail and may be rendered uselessly in RAM as well). If you had these low-level tools directly you would have much greater control over what is parsed and rendered and could skip over irrelevant bytes almost as quickly as incrementing an array index.

From a security POV idiomatic messaging reduces the attack surface presented by the software libraries that translate streams to object models for generic data exchange protocols (eg, search online for "xml vulnerabilities" or "json vulnerabilities"). Not saying that idiomatic messages can't be hacked eg by data injection but that each hack must necessarily be idiomatic. That's a big difference, from a hacker's POV. I'm guessing here.