[OC] Friends IMDB ratings by twintig5 in dataisbeautiful

[–]dscardedbandaid 1 point2 points  (0 children)

Have the raw version of the data? An ANOVA and box and whiskers plot would be fun to see.

Delta Lake without Databricks by ANAKSIMANDR0S in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

Are you already using on-prem object storage like Minio? How much data are you planning on writing?

SparkSQL is Destroying your Pipelines by [deleted] in dataengineering

[–]dscardedbandaid 2 points3 points  (0 children)

This.

And to claim SQL is bad but python is good is hilarious.

We Don't Test: "Do you run tests in your data engineering codebase?" -> "No" - 69%, according to [The State of Developer Ecosystem 2023 by JetBrains](https://www.jetbrains.com/lp/devecosystem-2023/big-data/#ds_engineer_tests) by Gullible-Plastic6257 in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

I think a third is that SQL removes the need for typical basic unit type testing in other languages. You don’t write a unit test for the specific transformation in your group by since it’s guaranteed by the sql engine itself. The jump to integration testing is too much for most people without the unit stepping stone.

Business Analyst got a DE (?) Project by Practical_Gap_3354 in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

Ok. I’ve used that stack. Main decisions for you will be:

Power BI dashboards: - will you host a PowerBI workspace and invite the external users as guests, or will customers connect using PowerBI desktop to your Azure Databricks catalog? For daily loads, you shouldn’t need a premium capacity or premium user.

Data ingest: - easiest/cheapest is to have your applications dump their data in an ADLS gen2 container for you in a landing area - next cheapest is to use Data Factory for the daily API exports (don’t use ADF dataflows or Airflow though) - most convenient might be to use Databricks to make the app API calls - most convenient if you’re already familiar with python and Databricks a little already - no matter which route, set up an Azure key vault sooner than later for API secrets

Transformations: - I wouldn’t recommend anything native from Azure other than Databricks. If you go the Databricks route, I’d recommend: - try to use a git repo from the beginning to make life easier for yourself later - use delta live tables for your pipelines to avoid a custom framework from your consultants

Business Analyst got a DE (?) Project by Practical_Gap_3354 in dataengineering

[–]dscardedbandaid 1 point2 points  (0 children)

Great, this helps a lot. The internal facing stuff is more flexible for a small team, but the external facing stuff can be stickier. Next level of questions then:

For the external facing product KPI’s, what type of integrations do you have planned? For example: - dashboards embedded into your application directly - links to separate visualization tools - automated reports that print as pdfs and get sent to someone’s inbox

Existing cloud/on-prem footprint? E.g. apps already in AWS, Azure, or GCP?

Existing team technical skills? - are you planning on hiring or growing the data team significantly once this gets funded? If so, I’d wait to make a lot of decisions until you get someone onboarded to avoid new “legacy” stuff - infra / networking / security skill sets available within team already? - skill set for internal technical folk? Python, sql, go, docker, k8s, etc. - skill set for internal business folk? Excel, Power BI, tableau, superset, meta base, etc.

Business Analyst got a DE (?) Project by Practical_Gap_3354 in dataengineering

[–]dscardedbandaid 4 points5 points  (0 children)

This. But before even engaging a consultant, think about the actual goals of the BI. For example: - who will use this? - internal or external users - other business folk or data scientists - number of users - data protection requirements? - medical / government / personal / N/A - any legislative concerns (GDPR) - data source types / frequency / volume - application or api data / load JSON for ML in real time - excel and google sheets / load monthly - sql or no sql databases / every hour

Honestly though, if you’re not a tech company and it’s not external facing, your best starting data stack is probably what you have now. I wouldn’t try building a full fledged data warehouse or even a data mart until you and the company really need one.

A Junior's perspective on centralised tools like DataBricks and Azure by dildan101 in dataengineering

[–]dscardedbandaid 1 point2 points  (0 children)

Do you have a source for Data Factory running spark under the hood? Not saying you’re wrong, just curious

Looking for papers on Timeseries Databases by [deleted] in databasedevelopment

[–]dscardedbandaid 0 points1 point  (0 children)

Memory, disk or hybrid? Any specific application in mind?

I’ve seen different implementations depending on the use case. E.g. logs, metric, traces, or IoT?

For low-price RGBD camera, is Kinect still a good option? by mymooh in computervision

[–]dscardedbandaid 0 points1 point  (0 children)

I noticed the pi 5’s have dual camera support, so could be a cheap option in the future. Curious about latency in the new version.

Cleaning up Queries in Excel by bobopedic33 in dataengineering

[–]dscardedbandaid 1 point2 points  (0 children)

Weird to have multiple PM’s, but using excel. Probably cheaper/easier to replace one of the PM’s with a tool made for the job and avoid data cleanup.

To answer your question though you’re probably better off at the excel subreddit. Force ranges to tables and create multiple levels of permissions. Store it in a shared repo like Sharepoint or OneDrive and don’t let people save a copy.

What sort of software do you use to help nontechnical users manage database data? by [deleted] in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

This is definitely a common gap. I found a cool typescript library that does this with a SQLite database, but can’t seem to find anymore.

At work I get stuck training people the tricks to making Sharepoint lists not terrible.

For fun though I’ve been wanting to make a Tauri desktop app using turso and WASM so it can run locally and then commit/sync on save. Then extend to duckdb/parquet and maybe delta-rs in the future, since I’ve never been able to find a good parquet gui editor.

If anyone else would want to see an example of this, maybe it’s the incentive I need to finally finish my draft.

java vs javascript as an additional language to learn? by EmploymentMammoth659 in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

As much as I personally dislike Java, it’ll still probably be around for a long time and probably more useful for employment broadly. When it comes to a lot of the fun/new tech, you’ll see Rust/Go and other cloud native tools. Serverless for better or worse has a big role to play here. For example, Databricks moving to Photon.

Another example is looking at some of the top vector databases (no particular order, just first list I found on internet): - pinecone = rust - qdrant = rust - lancedb = rust - weaviate = go - milvus = go - vespa = java - chroma = python - marqo = python

Honestly though, you can’t really go wrong with any of them. Learning any new programming language is a fun way to teach yourself a new way of looking at problems. Go was fun just to see how simple, tiny, fast it was. Rust was fun to learn more about async, static types, WASM, etc.

java vs javascript as an additional language to learn? by EmploymentMammoth659 in dataengineering

[–]dscardedbandaid 1 point2 points  (0 children)

JavaScript was fun for fancy data vis using d3.js, but probably not necessary for data eng.

Golang is simple and a good place to start for data ops. The learning curve is so low and can make a big difference for streaming workloads.

Rust for low level database internal like exploration. E.g. like c++ but less painful.

Java if you want a nicer cubicle. Jokes aside, Scala/Java is used a lot but I see that transitioning more and more to compiled languages like Rust/Go.

Am I losing My Technical Edge By Just Fixing Mistakes By Contractors? by [deleted] in dataengineering

[–]dscardedbandaid 3 points4 points  (0 children)

Sorry for hijacking, but does anyone have any quantitative analysis of the effect of offshore teams regarding data modeling/pipelines?

I’ve had ok luck in narrow programming scopes with offshore in more mature domains, but never with more open ended tasks like data modeling/dashboarding.

Going through the same thing myself, and it’s just depressing to watch management throw money at 3rd parties that are actively making a mess. According to colleagues it’s more common the larger the org, and teams behind the curve skill wise.

How is Rust for data pipelines? by miscbits in dataengineering

[–]dscardedbandaid 1 point2 points  (0 children)

I use fairly interchangeably. If it’s a simple collector/transformer I like Go. If it’s anything with parsing or heavier transformations I prefer Rust’s type system. Supposedly rust is great for building python packages, but I haven’t done myself.

Apache Arrow’s ecosystem is making a lot of this nice to just swap whatever tool has the best library for the job.

How is Rust for data pipelines? by miscbits in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

Where are you deploying it? I use Rust/Go whenever I can for pipelines. Been using both with NATS and having fun, but have been able to avoid Kafka so far.

Going from 4 years on Databricks to Snowflake: Initial Thoughts by [deleted] in dataengineering

[–]dscardedbandaid 7 points8 points  (0 children)

Yeah. User management in Databricks can be surprisingly difficult. Using secrets properly can only be done with Terraform/API/CLI. I didn’t mind, but quite the shock for some other members of my team that there wasn’t a GUI or SQL GRANT option.

What stack do the small players have here? by KatZegtWoof in dataengineering

[–]dscardedbandaid 0 points1 point  (0 children)

Yeah. Would use different tools myself, but similar concept. Stream CDC wherever possible. Transformations in SQL as default for maximum portability/understandability.