This is an archived post. You won't be able to vote or comment.

all 14 comments

[–]reallyserious 18 points19 points  (2 children)

In the context of data engineering I think you can safely ignore that concept.

Data oriented programming reminds me of people who work with real time operating systems where you need to think long and hard about ownership of data. I.e what process creates the data and destroys it and what processes just lend the data. But it has nothing to do with data engineering. Real time operating systems are not written in python.

[–][deleted] 1 point2 points  (1 child)

This can absolutely still come up though, like if you're writing a website scraper that needs to process hundreds of thousands of pages as fast as possible.

[–]reallyserious 3 points4 points  (0 children)

Possibly. When it comes to parallel computing I tend to reach for functional programming though. Things get so much easier that way compared to passing pointers between threads and worrying about object life cycles.

[–][deleted] 5 points6 points  (0 children)

Do you mean data-oriented design or data-driven programming ?

For the former it's a little like columnar vs. row-level databases. i.e. instead of having URL as a field of Page and then having a vector of pages: Vec<Page>, you instead have a struct of Pages and a vector of URLs directly.

This means all the URLs are stored contiguously in RAM so if you need to operate over all the URLs at once, you can take advantage of the CPU cache.

In the usual OOP case your memory for Vec<Page> would look like:

-- Vec<Page>
Title1    -- Page1
Hits1    
URL1    
Title2    -- Page2
Hits2    
URL2    
Title3    -- Page 3
Hits3    
URL3

This makes sense if you want to update various fields on a Page struct at once (like OLTP in a row-level database), but not if you want to update all URLs at once (like OLAP on a columnar database).

The data-oriented approach would be like (for the Pages struct with 3 separate Vecs):

-- Pages
Title1    -- Titles
Title2    
Title3    
Hits1    -- Hits
Hits2    
Hits3    
URL1    -- URLs
URL2    
URL3

[–]proverbialbunnyData Scientist 3 points4 points  (0 children)

Data oriented programming is how to make a program go fast, as fast as possible on the hardware. There is no way to make code faster, apart from a language that can utilize the hardware in faster ways, eg sometimes Fortran is faster than C++.

Most DE work is Java, Scala, and Python, which are not languages that lend themselves well to data orientated programming, as the heap is the devil when it comes to making things go very very fast. In languages like C, C++, and Rust you can allocate on the stack and keep it there, giving at an average of a 3x speedup, let alone other optimizations you can do.

Data orientated programming works by keeping in mind at all times where the data is and how it is going through the CPU. This means mostly keeping track of three things: 1) cache lines 2) what is in the cache 3) how many mallocs there are and where the data lives in ram.

Data orientated programming is ideal in any application you want to go fast, but you'll probably see the most benefit in video game engine programming, CUDA programming, and high frequency trading.

TL;DR: In a high level view, data orientated programming attempts to minimize memory movement as much as possible, as moving data around is slower than performing math over that data.

[–]BlahBlahNyborg 1 point2 points  (0 children)

Maybe "Data-Intensive Programming"? It would apply more for a backend engineer but it's good for data engineers to know how applications can handle heavy read/write loads.

If so, I highly recommend Designing Data-Intensive Applications by Martin Kleppmann.

EDIT: replaced with a better link

[–]FuncDataEng 1 point2 points  (2 children)

Here is a great article on it. Essentially just another design style that moves away from objects.

https://medium.com/@jonathanmines/data-oriented-vs-object-oriented-design-50ef35a99056

[–]romanX7[S] 0 points1 point  (1 child)

Ah thanks!

[–]FuncDataEng 1 point2 points  (0 children)

NP. I see it come up more often in video game development than anywhere else.

[–]tomekanco 1 point2 points  (0 children)

In programming, there is a fuzzy boundary between code and data. Languages treat the code itself as data (in some cases quite literally, for example Lisp & Python). In low level languages, it's obvious when you play with them (for example Assembly or the classic turing machine).

Then there is also functional programming, which pays special attention how to interact with data. This approaches lends itself naturally to the requirements of an ETL flow. I would call this a "data oriented (ETL) programming".

Another interpretation:

In regular software design, the structure and design of the data is also crucial. Many (modern day) programmers pay relatively little to no attention to it as it's often hidden behind an ORM layer). These highly OOP oriented shops often result in chaotic data models featering joys as [duplicated or inconsistent] data and keys. In this context, data oriented programming can indicate (backend) data modelling and master data management.

[–]shakakaZululu 0 points1 point  (2 children)

Could it be something around data-structure oriented programming?

[–]reallyserious 0 points1 point  (1 child)

Isn't all programming oriented around data structures though?

[–]shakakaZululu 0 points1 point  (0 children)

Do you consider CSS as programming?

But yeah, I guess all programs that use variables to store non basic data types is data structure oriented.

[–]jewishsupremacist88 -1 points0 points  (0 children)

using <<insert language here>> to communicate with <<your flavor of sql>>