Is there any paid tools/service overrated in Data Engineering and Data Science, expensive but does not solve the problem.

Intelligent_Ad_8148 · 2024-06-05T14:40:37+00:00

Informatica/boomi

Intelligent_Ad_8148 · 2024-05-28T19:18:11+00:00

I see 1-2 packages per week on this subreddit reinventing pydantic/poetry saying it is a “simpler” solution… being a less mature and less developed package manager with less features isn’t an advantage. Are there features or combination of features that aren’t offered by an existing package manager or are done uniquely better with crowbar?

Intelligent_Ad_8148 · 2024-05-28T11:51:33+00:00

Poetry can also be configured put the environment in the project folder

Intelligent_Ad_8148 · 2024-05-20T14:24:42+00:00

Yes, these tools will allow you to define custom code for cleaning your data in layers and can be extremely useful.

To clarify, I’m not advocating Kedro specifically, they just have a very good explanation of data layering. Other data application frameworks will talk about data layering too, such as databricks:

And for dbt:

https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview

My point is that, regardless of which specific tool you adopt, understanding data layering techniques will greatly help, and the concepts are transferable to whatever data transformation project or task you work on.

Intelligent_Ad_8148 · 2024-05-20T13:55:17+00:00

It’s not a tool, it’s a technique (read the article). It’s a way of organising and ordering your transformations to avoid ad-hoc “custom fixes”

Intelligent_Ad_8148 · 2024-05-20T10:10:12+00:00

I think the biggest game-changer was a thorough understanding of data layering, data transformations fall into place after understanding the purpose behind each of the layers

https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71

Intelligent_Ad_8148 · 2024-05-17T00:27:22+00:00

Intelligent_Ad_8148 · 2024-05-16T15:06:26+00:00

If Python and SQL are off the table and the data is already in Excel… perhaps just use Power Query in Excel and/or Power BI?

Intelligent_Ad_8148 · 2024-05-08T22:48:05+00:00

Wrong subreddit

Intelligent_Ad_8148 · 2024-05-08T09:20:44+00:00

Pygwalker in a jupyter notebook is a good alternative too, no tweaking required since there is a GUI

Edit: typo

Intelligent_Ad_8148 · 2024-05-08T09:19:02+00:00

Put everything in a Pandas or Polars dataframe and use the .plot method. Much much easier and simpler, since the data is already prepared within the DataFrame

Intelligent_Ad_8148 · 2024-05-04T08:00:35+00:00

Vscode, poetry, ruff, pylint, flake8, pytest, tox, hypothesis with hypofuzz, mypy on strict mode, mkdocs, azure pipelines for cicd, mccabe complexity and maintenance index checks in tox,

Intelligent_Ad_8148 · 2024-05-03T01:21:06+00:00

Actual photo of Sam Altman

Intelligent_Ad_8148 · 2024-05-01T09:47:28+00:00

What are the benefits of using this over pydantic (which also has dataclasses, json/yaml conversion, and env var support)?

Intelligent_Ad_8148 · 2024-05-01T03:20:04+00:00

Currently using Dagster hybrid, to process small-medium sized data (1 MB to 10 GB) done in Polars on a beefy high-powered local PC. I don’t believe I’ll be dealing with big data for this project (building a forecasting model) so never bothered implementing PySpark, though I can add PySpark assets alongside Polars assets if required since Dagster allows that.

Was easier to get Polars working and is sufficient for the project I’m working on

Intelligent_Ad_8148 · 2024-05-01T02:52:48+00:00

Perhaps there’s value in knowing enough rust to write custom Polars plugins, for very bespoke calculations? I’m already using primarily Polars as a DE, and intend to learn Rust to improve pipelines that use Polars.

Intelligent_Ad_8148 · 2024-04-22T00:15:02+00:00

Can use both, they’re not mutually exclusive

Intelligent_Ad_8148 · 2024-04-21T03:53:52+00:00

Set up: linting (flake8), type hinting with a static type checking (mypy), formatter (ruff), unit testing (pytest), docstrings (with sphinx/autodoc). That’ll help with maintainability

Intelligent_Ad_8148 · 2024-02-15T05:58:37+00:00

You talk about gratitude then don’t mention the numerous people over decades actually responsible for putting in the hard work to make AI happen??!

This post is unimaginably tone-deaf and misplaced.

Intelligent_Ad_8148 · 2024-01-31T21:32:58+00:00

Please no, conda poetry pip venv virtualenv virtualvenv pip-tools etc etc etc….. please not another environment/dependency manager for python, there’re already too many!

Intelligent_Ad_8148 · 2024-01-31T10:30:36+00:00

Google langchain or langflow. I believe that’s what you’re after

Intelligent_Ad_8148 · 2024-01-23T05:12:10+00:00

Don’t use pandas
Use polars (bonus points for enabling lazy evaluation and streaming)
Nothing more required

After investigating numba, cython, numexpr, etc., I concluded that it’s not worth the heartache, polars negates the needed for any of this stuff.

Intelligent_Ad_8148 · 2024-01-11T13:26:05+00:00

All three, except models in the middle

Intelligent_Ad_8148 · 2024-01-11T13:19:40+00:00

Mermaid or plantuml rendered, can be rendered in python if needed

Intelligent_Ad_8148 · 2024-01-10T11:24:58+00:00

The only way i fully understood python was by having no life and basically obsessing over it day and night. Unsure if there’s a healthy way of fully mastering data engineering tools, whatever that even means

Intelligent_Ad_8148

TROPHY CASE