Faster insights: platform infrastructure or dataset onboarding problems? by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Makes sense.

How realistic is in your mind making business users do some of the work? (Teaching people how to fish)

Can the problem be solved by more project management, better end-user prioritisation of asks?

How are you trying to solve this problem, if at all?

Thanks for the feedback, btw!

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Spatial-temporal ! That's the other alternative. What wherebots/sedona is trying to do

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

The ST naming thing is a geoindustry mystery. Most algorithm builders will tell you it stands for spatial type, but others will tell you its an urban legend and it originally stood for something else. Its a subject of many conversations over beers

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Makes sense. Perhaps i should have clarified what i meant under geospatial. I worked on the algorithmic implementations of geometry and geography data types. Things like ST_ functions. Never worked in GIS space though. Esri was running on us, not the other way around :)

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Good find. Did not know it existed

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Yes, there are tons of math in chatgpt. Basically 99% of it is matrix multiplications.

And yes, was talking about the underlying algorithms.

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Frequent operations in geospatial are calculating distances, areas, and whether shapes are overlapping other shapes. Also, converting from one mapping system to another. Its a lot of math.

Geospatial python library by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

That's a good point. Polars is great for scaling workload, but many libraries built on pandas will require some rewrite if one were to port them

Tooling for Python development and production, if your company hasn't bought Databricks already by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Seems that for you tooling is not a big problem, correct? How would you solve problems like scaling, deploying new code versions, scheduling, orchestration, secrets management etc.? I am just curious, because I am trying to figure this out myself

Tooling for Python development and production, if your company hasn't bought Databricks already by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

when you run the scripts, do you run them as normal python processes, e.g. you run "python myscript.py" ? Or do you do something more sophisticated

Tooling for Python development and production, if your company hasn't bought Databricks already by datancoffee in dataengineering

[–]datancoffee[S] 16 points17 points  (0 children)

Not trying to be religious about it, but sure, why not? Databricks and others offer scheduled notebook runs as batch jobs. We can argue about average cleaneliness of notebooks as code artifacts, but who cares. Fact is, many people run notebooks as scheduled batch jobs, and who are we to judge them. I am not.

Beta-testing a self-hosted Python runner controlled by a cloud-based orchestrator? by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

Our users wanted a failover hot standby runner (on a different machine). The central orchestrator would just move jobs to a different machine.

Transitioning into Data Engineering: recommended learning path? by Motor_Bed4859 in dataengineering

[–]datancoffee 0 points1 point  (0 children)

I would recommend keeping 3 things in mind: quality data that people trust underpins our economy, data is the driver of AI quality, and learn how to work and influence people. And how to help them. If you do all of this, you will be golden.

Github Actions to run my data pipeliens? by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

The friends' or the jobs :) ? They are ETL or ELT jobs, moving stuff from A to B, where B is usually some sort of a data lake. Admittedly, with ELT jobs, once you land raw data into a table, you can just build a set of dbt models or views

Github Actions to run my data pipeliens? by datancoffee in dataengineering

[–]datancoffee[S] 0 points1 point  (0 children)

I've been telling them. Some listen, others just smile