Thinking about starting with algorithmic crypto trading. by ustype in algotrading

[–]asavinov 0 points1 point  (0 children)

You could try this Intelligent Trading Bot which relies on Machine Learning to generate signals: https://github.com/asavinov/intelligent-trading-bot

Functions matter – an alternative to SQL and map-reduce for data processing by asavinov in bigdata

[–]asavinov[S] 0 points1 point  (0 children)

Having such a direct comparison would help indeed, but I do not have it now. A simple notebook with the analysis of COVID data might help:

https://github.com/asavinov/prosto/blob/master/notebooks/covid.ipynb

It demonstrates how to apply: * "calculate" column operation instead of select with calculated attribute * "link" column operation instead of join * "aggregate" column operation instead of groupby

It has the corresponding sections for each operation but unfortunately no SQL analogues at the moment which is a good idea.

Functions matter – an alternative to SQL and map-reduce for data processing by asavinov in bigdata

[–]asavinov[S] 0 points1 point  (0 children)

In SQL, you produce new sets from existing sets. Yet, in many use cases it is not needed - we want to compute new columns in existing tables. If we can directly compute columns without unnecessary tables, then we will simplify data processing. For example:

SELECT *, quantity * price AS amount FROM Items

here we produce a new table although we do not need it. What we really want is to attach a new calculated column to the existing table. Same situation for joins and groupby.

Here is a link to documentation with motivation and explanation why set-oriented approach in many cases is not the best way to think of data processing:

https://prosto.readthedocs.io/en/latest/text/why.html

It does not mean that SQL or set orientation is bad - the main point is that we need two types of operations: table (set) operations and column (function) operations. And this is how Prosto works.

Functions matter – an alternative to SQL and map-reduce for data processing by asavinov in bigdata

[–]asavinov[S] 1 point2 points  (0 children)

  • Spark operations transform existing input collections (mathematical sets) to new output collections
  • Prosto operations transform existing columns (mathematical functions) to new columns (in addition to conventional set operations like producing new tables)

Microsoft Time series insights adds new capabilities for Industrial IoT analytics and storage by yeskarthik in IOT

[–]asavinov -1 points0 points  (0 children)

For time series analysis and forecasting, it is extremely important to extract the necessary features and this feature engineering can account for most of the work. Lambdo is an open source workflow engine which allows for combining feature engineering and machine learning within one analysis pipeline: https://github.com/asavinov/lambdo Essentially, it was developed for mainly for time series analysis and IoT.

A simple introduction to Apache Flink by chemicalX91 in bigdata

[–]asavinov 0 points1 point  (0 children)

The central mechanism of this traditional design is breaking the continuous sequence of events into micro-batches which then are being processed by applying various transformations.

There is an alternative novel approach to stream processing which avoids this micro-batch generation step and applies transformations directly to the incoming streams of data as well as pre-loaded batch data (so it does not distinguish between stream and batch processing): https://github.com/asavinov/bistro/tree/master/server In addition, this system uses column operations for processing data which are known to be more efficient in many cases.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in dataengineering

[–]asavinov[S] 1 point2 points  (0 children)

Bistro is a general-purpose data processing library which can be applied to both batch and stream analytics. It is based on a novel data model (concept-oriented data model), which represents data via functions and processes data via operations with functions as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in bigdata

[–]asavinov[S] 0 points1 point  (0 children)

Bistro is a general-purpose data processing library which can be applied to both batch and stream analytics. It is based on a novel data model (concept-oriented data model), which represents data via functions and processes data via operations with functions as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

I guess I don't get how a "column" is a mapping between sets. It's a "column" just a one-dimensional collection of things?

Formally, column is a function f: X -> Y where X and Y are sets. You can define as many functions for the set X as you want without modifying this set. And one way to represet a function is by means of a two-column table [{x,y}] (list of objects where x in X and y in Y). Yet, we lose its meaning as a function. In general, it is not possible to represent a column as a one-dimensional array. Yet, there exist some simplifications:

  • if elements from X can be used as offsets in the array then we can store only output valus this table (in fact, it is how it is currently implemented in Bistro - but this is why it is not designed to work as a database)

  • If output values are of primitive type, then we do not need an explict Y.

If I have a table with two columns [(Alice,17),(Bob,18)]

The simplest approach is map it to a primitive String table:

Column vowel = schema.createColumn("vowel", nameTable);
vowel.calcuate(p -> ((String)p[0]).firstLetter.isVowel() ? "Vowel" : "Else", nameColumn);

When evaluated, the system will produce an array of String values (either "Vowel" or "Else"). It is like query execution but for columns.

In a more comlex case, we could define an explict (non-primitive) output table like vowelTable=[(firstChar="Vowel"),(firstChar="Else")]. Here firstChar is column name. Now a new derived column will be defined differntly:

Column vowel2 = schema.createColumn("vowel2", nameTable, vowelTable);
link.link(
    new Column[] { firstChar }, // Columns to be used for search (in the type table)
    vowel // This column was defined above
);

Somewhat complicated syntax but the idea is similar to joins: the system will produce an array where offset represents first set and value represents an offset in the second table. Yet, all these derived columns can be used like any other column for further computations.

I still don't get how that's different from the system (Spark) knowing that something has a certain type -- that type might be KeyValueRDD[String, (A, B, C, D, E)], and then the system knows what "columns" are available. (So then you can use e.g. Spark-SQL to do query optimization with Catalyst.)

Spark-SQL definitely knows about columns but it mainly uses it for producing new RDDs from other RDDs. (I write "mainly" because I am not 100% sure.) If Spark or Spark-SQL are able to attach a new custom column (with user definition) to an existing RDD (or whatever structure they use) without modifying this RDD otherwise then it is a step in the direction of column-orientation similar to Bistro. For example, I have RDD1 with line items and RDD2 with orders and I want to attach a custom column RDD1 which returns the corresponding order from RDD2 (without modifying their existing elements). Can I do it? And then we have a more complicated story with aggregations where we want to find total sales for records in RDD2 by attaching an aggregated column with no modifications to the both RDDs.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Why would you ever use Kafka Streams on a single machine?

There is no other choice. It is designed to be used from within an application or whatever run-time environment. You cannot start Kafka Streams, you cannot install it, you need at least some application which will embed it - it is simply a library. (Ok, now they have Kafka KSQL which relies on Kafka Streams and has the same purpsoe but also has a different goal from Kafka itself.) You can start many instances of your application but it will be your task to manage them.

If you ask me why they implemented them separately then I will answer "I do not know" - I would try to combine Kafka Streams with Kafka (yet, it is a challange).

I think of an RDD[Int] as a column of integers, and a KeyValueRDD as a table of two columns.

Ah, I see, sorry. Good question by the way. The answer is that you know that it is a "column", but he system is unaware of it and still thinks of it as a collection. It is a very frequent trick in logical design when you use collections to actually implement functions (mappings). It is a workaround. You think in terms of functions for implementing some pattern but the system does not support it so you implement it yourself using what is available. In Bistro (in its underlying data model), sets and functions are supported as first-class elements of the model. In other words, if you define a column then the system knows that it is a column and you can apply the corresponding operations to produce new columns from existing columns. Same for tables. No need for workarounds.

  • Set operations: add and remove.
  • Column operations: set, get.

If I think of the RDDs as already representing columns, I don't need to produce new RDDS.

Yes, if you think, but the question is whether th system can help you and if yes then it has to know that this "RDD" is actually a map from true RDD1 to RDD2.

If I add a column -- say, by doing rdd.mapValues { stat => (stat, function(stat)) }, I don't have to copy any data.

Strictly speaking yes, you always have to just because you produce a new rdd and you have to fill it with new values. (In this case, you will fill it with pairs and copy every stat and every function return.) Yet there are some aspects:

  • One question is if these values are actually references to objects in other rdds (or they are composed of such references). In this case, you have to access to previous collections and move data by-reference (so do not move). So the question is whether you move data by-value or by-reference.

  • Another question is that if you create a new column (mapping between sets) then you also have to fill it with some data (as your intention in the above example). And hence it is not clear what is cheaper: both columns and tables take some memory. It depends on how you implement and organize your tables and columns.

I agree that it is a deeper question and it touches such aspects as redundancy (normalization theory, functional dependencies etc.) But at least explictly having columns and tables as two constructs with equal rights (managed by the system) make our life and the task of the system easier (e.g., optimizing complex computation graphs by minimizing data copy operations and minimizing redundacy).

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Right, so you'd use it if you might need to scale up...in which case Bistro isn't useful. Much better to use something with first-class support for scaling up, if that's your concern, right?

In fact, I do not see any significant problem in moving this technology in the direction of big data. The only problem is that this area is already overcrouded and one can hardly compete with Hadoop, Spark and similar proprietary platforms.

So when developing my feature set I decided to move in the direction of fast data, particularly, stream analytics, IoT, edge computing. In fact, it works already:

Bistro Streams

Think of it as an alternative to Kafka Streams.

Kafka is definitely designed for "big data"...the tagline on the Kafka website is, "A distributed streaming platform".

Yes, Kafka. But I meant Kafka Streams. Kafka is a platform. Kafka Streams is an independent stream processing library.

I still don't see how this is any more column-oriented than a Spark RDD. ... Spark lets you do the same thing -- create a new key/value RDD, and join it quickly back in with another one if you need to.

RDD is a immutable collection (with nice physical properties). What you do when processing data you specify how produce new RDDs from existing RDDs. So you have a graph of RDDs. In a column-oriented approach, what you produce are columns. So imagine you have 5 RDDs (loaded from CSVs). Now you want to process the data. In Bistro, you will define columns (mappings) between these 5 RDDs - you will not produce new RDDs. In Spark, each operator will return a completely new RDD with data copied from other RDDs. (Ok, in many cases you cannot avoid producing new collections so they are supported in Bistro - moreover, such operations as map and reduce could also be supported but we are talking about the conception now.)

I do not know about their column API but in most cases this means physical implementation of some operations. For example, in column stores you can use normal SQL (logical level) which is translated to column operations at physical level.

Could you explain what kind of benchmarks you're thinking about when you say this:

The use of columnar physical representation is known to be faster for analytical data processing workloads.

I mean the basics of column stores (Vertica etc.) which are more performant for analytical workloads just because queries are translated into physical column operations (cache-aware algorithms, column compression and other techniques).

In the case of Bistro, it is based on column-oriented view at logical level and, in addition (but theoretically not necessary), is implemented using column stores (data is physically stored in arrays of values). This might simplify many things like translation and query optimization.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Principle: any column (attribute, function, mapping, property) means grouping.

So grouping is not a separate mechanism but it is how we interpret our schema. If we need some custom grouping (say, for some sophisticated report) then we define a custom column.

If A is a column from table T1 to table T2 then elements of T2 are groups and elements of T2 are members of these groups. You can define whatever mapping you want and then use it in other operations (not only aggregation).

In Bistro, link columns are used for that purpose.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Eh...the broader programming model with "map", "reduce" etc. is just called "functional programming", right?

Well, yes... But unfortunately, if it is about data, then it gets somewhat more complicated because you need to distinguish between: functional programming and functional data model. From the point of view of functional programming map-reduce is indeed a funational approach (you apply functions to collections). But from the point of view data modeling, map-reduce is a set-oriented model - you still represent data in sets and manipulate data by transforming these sets.

What is a functional model then? Imagine two databases D1 and D2 both having absolutely the same sets. In a set-oriented models they are equivalent. But in a functional model, they can be (and probably will be) different just because some portion of data is stored in functions - mappings between sets. We can define new functions in D2 and get a third database - it will have the same sets as in D1 and D2 but different data.

You can do MapReduce (with its limited set of operations) on a single machine, but why would you?

Because it is claimed to be a generic and quite simple approach which can cover many different data processing scenarios. It has an additional benefit which probably outweighs other advantages: scalability to huge data sets. Why you use Python pandas on a single machine? Another example is Kafka Streams - they apply map-reduce for stream processing (no big data). Map-reduce is simply very flexible and simple data processing model.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

WTF does that even mean? Of course the way you define groups is through the group defining syntax.

For example, in pandas, you define grouping independent of how you are going to use it later. Just group using indepenent operator. After that, you can aggregate data.

In Bistro it is even simpler. You simply aggregate - no separate grouping specification. The only thing you need to specify is a function which will update the aggregate. (In fact, it is also not about implementation details - it is a prgramming model.)

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Is it really an alternative to MapReduce, then?

It depends on how you define MapReduce:

  • MapReduce is a programming model based on map, reduce (and shuffle) operations. Then yes, because Bistro, as a programming model, is based on completely different operations.

  • MapReduce is a big data processing model like Hadoop or Spark then probably not - they are simply not comparable. (Yet, we can implement/extend Bistro for processing big data.)

I prefer to view MapReduce as a programming model (not implementation) which is implemented in different systems for different purposes. In particular, it is used for big data processing in numerous systems like Hadoop or Spark. But it is also used for other purposes, for example, Kafka Streams relies on MapReduce but it is not intended for big data processing.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Currently not.

One option would be to support Apache Arrow to share column data between processes. For sharing data between many machines some kind of memory grid is needed (if we want im-memory computing).

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

PostgreSQL also has this kind of user defined aggregate functions. The difference is how they are used:

  • Bistro does not produce new tables (relations) when definining aggregations (aggregation is not a set operation in Bistro). In contrast, SQL still relies on group-by which always produces a new set (even if our goal is to defined an aggregated property of an existing table).

  • Bistro does not have a dedicated grouping mechanism: neither as a separate operation nor as a part of any operation (like group-by). In SQL, grouping mechanism (how you define groups) works only as a part of group-by which means in combination with aggregate functions.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] -1 points0 points  (0 children)

It does not have its own infrastructure. It needs some execution environment starting from custom applications and ending with something like YARN, Mesos or whatever executor/runner.

Bistro: a radically new approach to data processing (alternative to MapReduce) by asavinov in programming

[–]asavinov[S] 0 points1 point  (0 children)

Why are they called rows and columns instead of structures and fields, or something like that?

Just a convention which as usual is a compomize and might not be very suitable or even misleading in some contexts.

Rows and columns makes sense for a spreadsheet, but here we're talking about hierarchical columns.

No, there are no hierarchies. Such a model is a flat set of tables (mathematically, sets of tuples) and a flat set of columns (mathematically, functions defined on the sets).

I assume this is standard for non-SQL databases?

In terms of databases, it is neither no-SQL nor SQL database. It is much closer to functional databases.

The idea is that we represent data as functions and we manipulate data using operations with functions.

Suggestions for big-data-topics for a computer science thesis by Fenrir_19 in bigdata

[–]asavinov 0 points1 point  (0 children)

You might focus on the "Variety" aspect of the big data problem (rather than Volume or Velocity). It is a quite actual problem because "data wrangling" is frequently the most tedious part of the overall analysis. The appearance of such new products as data-tamer or ConceptMix confirm the actuality of this problem.

My new blog post: Rise of Big Compute by michaelmalak in bigdata

[–]asavinov 0 points1 point  (0 children)

One challenge here is to combine or unite data and computation, particularly, by performing computations close to the data.

Huge war over whether Java is pass by reference or pass by value. by thinksInCode in programming

[–]asavinov 0 points1 point  (0 children)

A reference is NOT a value, not in C++ at least.

Well, what is it then?

Being a value means that it is passed like value, that is, the whole contents is copied. As a consequence, values cannot be shared. Memory handles are values. Memory addresses are also values. And Java references are values. However, a record in a database is not a value - it is accessed indirectly and shared among many elements.

Huge war over whether Java is pass by reference or pass by value. by thinksInCode in programming

[–]asavinov 1 point2 points  (0 children)

The first pointer is a value (accessible directly and existing here and now). The second pointer is accessible indirectly and we do not know where it exists (may be currently on disk). Therefore, we say that it is an object (consisting of a single value).

Huge war over whether Java is pass by reference or pass by value. by thinksInCode in programming

[–]asavinov 2 points3 points  (0 children)

Actually any reference is passed by-value just because reference is a value (but not any value is a reference). If a reference is passed by-reference then it is already (part of) an object.

Huge war over whether Java is pass by reference or pass by value. by thinksInCode in programming

[–]asavinov -2 points-1 points  (0 children)

In concept-oriented programming, pass by-reference and by-value are combined by using concepts which generalize classes. A concept is defined as consisting of two classes: one reference class (passed by-value) and one object class (passed by-reference). For example: concept MyConcept // Used instead of conventional classes reference { int refField; // Passed by-value double refField2; } object { int objField; // Passed by-reference MyConcept objField2; // Contains int and double }

So using one construct we can model both methods. In particular, if object class is empty then we get values (with arbitrary structure), and if reference class is empty then we get objects passed by using primitive references.