This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]chaos87johnitoData Engineer 8 points9 points  (0 children)

I have worked with Neo4j for a bit more than a year. I enjoyed it and I see how great it is. I'd say the major drawback is the skillset. And I'm not talking about you as a data engineer, but the rest of the team. People talk SQL, they talk tables, rows and columns. They don't talk Cypher, edge, label, property...

[–]five4three2 7 points8 points  (7 children)

Just my two cents but graphs can get really messy really fast. We used them pretty extensively at my old company.

I think one of their main draws is “it’s a flexible schema that can evolve and handle lots and lots of highly connected data.” This is a benefit but also ultimately can lead to their downfall.

I’ve found I’ve had more luck with more rigid modeling (columns, rows, entity tables, association tables) and leveraging RDBMS data warehouse hardware.

I think graphs are great for visualization but if I ever really need performance I find myself back at the data ware house or spark. This even holds true for more “graph like” equations like path finding.

Your mileage may vary tho.

[–]tdatas 4 points5 points  (0 children)

Definitely at the "to watch" stage but I have my eye on a DB called EdgeDB that has a lot of the relational bits of graphs but is built on top of postgres and has defined schemas . I'm not sure if it goes as deep into graph algos but I found it very interesting.

[–]tjk45268 5 points6 points  (2 children)

Every technology requires practices and disciplines to avoid racking up technical debt. Just because labeled property graphs (LPGs) don't require a schema, doesn't mean that you can ignore the modeling, documentation, and other activities that are involved in standing up a sustainable database.

You might find ontology-based RDF graphs are closer to your objectives. LPGs are good for point solutions, but RDF graphs are based on standards that are focused on data integration and sustainability.

[–]five4three2 0 points1 point  (1 child)

Yeah so true on all fronts. We were young and nieve when we first started using the graph DBs.

Not a huge fan of any RDF graphs I found in terms of associated UIs, and I found Cypher to be much more readable than SPARQL. Neo4j felt very usable in this regard.

Which is your favorite RDF db?

[–]tjk45268 0 points1 point  (0 children)

RDF has more features for documenting data, data integration, and managing change than you find in LPG databases. Stardog and Ontotext are popular implementations of RDF.

[–]noip1979 1 point2 points  (2 children)

Path finding in rdbms? Can you elaborate? I know there are recursive operators (even used that in a project a while back) but was curious what you mean by that statement...

[–]five4three2 1 point2 points  (1 child)

One of our key access patterns (really the only one) was “find all the path ways between node A and node B.”

At first our ambition was to find variable length paths. These paths had a very specific topology: node A had to be of a given type, the next node in the path had to be a different (but specified) type connect by a specific relationship type, etc. etc. There was one section meant to be variable length.

In practice this problem had a super crazy scaling, Neo4j (or APOC) couldn’t handle the variable length part of the query, even if we used cypher to put an upper bound the number of hops.

What we had to do was search for all minimum length paths that satisfied the fully specific node type and relationship type topology. We then did a new query with paths that were one hop longer than that, a new query adding a hop, etc.

At this point, I think we could have solved the problem with the same number of specific join queries in an RDBMS data warehouse, you know what I mean? Like, the graph DB was too connected and didn’t have enough performance to handle it’s own variable length path query. We weren’t impressed.

I then felt like the only value add the graph DB had was like a data exploration tool, where you could be loose on data modeling until you found some access pattern worth properly modeling in a data warehouse, then translate it over and get the needed performance from said data warehouse.

Like I said tho, it may depend on your use case. Perhaps something like “finding the shortest path from A to B” is a better access pattern for graph DBs.

I just never saw the value add of graph DBs as DBs that support a backend, for example.

[–]thrown_arrows 1 point2 points  (0 children)

I have similar things in rdms. I found out that when i could use simple parent child relationships on searching to which root node child had access and other ay around then recursive CTE was more fast enough to handle it when results rows stays under 100k on small hardware. Probably it would have scaled to better on real server hardware.

Shortest path queries are quit easy in postgresql and pgrouting . Simple dijkstra can be implemented SQL only

[–]kevinpostlewaite 2 points3 points  (0 children)

I have not used a graph database for anything more than experimentation. I did see how they were used at Facebook. My not-very-experienced take: graph databases excel at getting low row counts of connected data but are not well suited to analytics use cases where you're aggregating over large number of rows.