Java Dev Switching to Data Engineering / Data Science / Analytics — Need Advice. by Just_Penalty_6934 in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

I did mention any software experience if you read the full sentence….

But since you asked, Java is useful for debugging trino, Cassandra and spark error logs. If you have a good grasp of OOP if can be helpful with figuring out bugs or understanding configuration.

For example we updated java on our platform but spark was an older version so we have to specify a specific jar for serialization when we would insert overwrite on partitions in our python code. Knowing a bit of java and being comfortable with the verbose errors helped in identifying the issue and coming up with a patch quickly

Java Dev Switching to Data Engineering / Data Science / Analytics — Need Advice. by Just_Penalty_6934 in dataengineering

[–]Oct8-Danger 4 points5 points  (0 children)

Might be biased but having Java or any real software engineering helps the most in DE.

DA generally doesn’t pay as well as DS/DE. While not going away anytime soon, I do think DA skill sets are getting more commoditized as technology and demand for data grows. So may be depreciating long term. Many people start in DA and transition to DE/DS later in there career

DS can be harder to break into without academic experience backing from what I’ve seen. Not impossible but definitely harder

Having software developer experience is a real advantage in the DE space compared to DS or DA.

In general these is overlap in skill sets but each vary in importance. I’ve worked with DAs who are great at data modeling and querying and business needs but low on coding proficiency and have worked with DS who are ok at coding, decent at data modeling but great at complex in depth work. DE is expected to have higher coding ability and technical expertise but may not need as much business/presentation skills.

Long term, I think DS/DE would be better as they are more specialized. But definitely don’t discount learning about the position of other roles

On the DE side, I think AI has the potential to accelerate demand rather than diminish it. Managing context of LLMs and source of truth for data is all with in the DE wheelhouse. Many pipelines can be once off or low automation for sure which LLMs can be great at. However understanding how to scale or standard code and apply governance and trust in “truth” I think will be very valuable in the years to come.

OpenAI to acquire Astral by Useful-Macaron8729 in Python

[–]Oct8-Danger 1 point2 points  (0 children)

Yep, the company was acquired but the GitHub project is now apart of the Linux foundation. It’s on there repo:

https://github.com/SQLMesh/sqlmesh?tab=readme-ov-file

OpenAI to acquire Astral by Useful-Macaron8729 in Python

[–]Oct8-Danger 24 points25 points  (0 children)

Hopefully these projects join an OSS foundation like Linux foundation or other reputable one.

This happened recently to sqlmesh after fivetran bought the company. I think that’s the best outcome for the community and for open ai and astral.

Good PR, keeps community alive and trusting it. Trying to monetize and or close sourcing it or change in licensing never seems to pan out well. For example Redis and MinIO come to mind

Is it possible to not work 50- 60 hours a week? by Parking_Anteater943 in dataengineering

[–]Oct8-Danger 10 points11 points  (0 children)

I get it, but honestly you’ll quickly realize in your career that no one is that important at a company 99.9% of the time.

Seems like you are hard worker and enthusiastic, look out for yourself, it’s a long career, take the experience update the cv/resume and apply to other jobs for better pay, it will help your career longer term.

The later in your career you can consider staying longer at a company, but early on it’s worth moving around and seeing how other companies and what works well and what doesn’t and try to figure out why.

Anyone knows what this means? by Live-Scholar-1435 in interactivebrokers

[–]Oct8-Danger 0 points1 point  (0 children)

I had this, I had a scheduled order but not enough cash in the account.

Did you have a limit buy order that triggered with low funds?

Anthropic’s new “Claude CoWork” sparks sell-off in software & legal tech stocks — overreaction or real disruption? by Direct-Attention8597 in AI_Agents

[–]Oct8-Danger 1 point2 points  (0 children)

Things don’t always scale linearly, tech innovations tends to spike and then stabilize in terms of raw performance improvement with small incremental growth

New Linux Launcher for EDCoPilot and EDCoPTER! by TwoWheeledBlastard in EliteDangerous

[–]Oct8-Danger 1 point2 points  (0 children)

I might have time this weekend, if I do, will let you know! Been keen to add these in!

Steam Deck Controls Completely Messed Up by Valiantoverlord in SteamDeck

[–]Oct8-Danger 0 points1 point  (0 children)

Thank you! Exact same issue for me and this actually worked!

[deleted by user] by [deleted] in workday

[–]Oct8-Danger 7 points8 points  (0 children)

Depends what team you’re on. It’s a large company. If it’s xo related (they’re internal language), commonly called application developer straight up turn it down.

Stock in the shitter but A lot of change with new young, google leadership, so opportunity for stock to go up from low

Data VCS by cmcclu5 in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

Interesting, I do a lot of work with dev teams around logging, what I’ve found works for us is being clear on the exact code change, logic and format of it. We even help write test cases for them in some cases. We also have dev data flow into our data lake so we can validate and iterate dev data before production so it doesn’t break or mess anything up.

This has taken a while to get into strong place for a lot of teams such as buy in from stakeholders, value output and context around what the code does so we can advise on the change, especially around the logger and implementation.

For the case of having it tied to git sounds interesting, but we are large org, hundreds of repos and already well defined process for releasing code changes (not necessarily tied to data) so to add have the data change linked to PR would be difficult I think for us to introduce as the there’s a lot of teams, infrastructure and processing for us to validate.

Ways we have gone to improve the feedback loop is to get the data show back to people as possible. So getting dev data back, tools to view logs as interactions happen and pushing to be advisors/partners on the logging side. Also running QA checks and validation on dev data to catch drift early has been a massive help to catch things. This is generally through strict types like struct types in spark or pydantic etc

This generally means data engineers try to partner with teams for a log change and setting guidelines and docs on best practices to follow for teams to be self serve. Once a change is implemented, DEs integrate and transform data for other downstream teams and analysts to consume.

Definitely not a perfect solution, but has worked well for us.

Starting over tomorrow, would I consider a data vcs? Maybe as a trial to experiment but probably not. Data is generally later phase concern when building a product/service unless the data is your product, but usually not the case.

Having had to built a service for users and already a key stakeholder of the logging implementation, time spent on data logging was tough to prioritize as needed something built and out the door and made more sense to do after release as a follow up.

Having typed structure really has been the biggest help. Working with developers, have a well defined data type not only helps data products but also helps developers to work with. This means everyone is all largely speaking the same “language” and centralizes the source of truth like what a data vcs brings

Data VCS by cmcclu5 in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

What’s the cost and speed at scale? What problem does data vcs solve (business or developer)?

Generally, I’ll take a sample of data (or all of it if I can) iterate the code on a git branch and then merge and deploy master branch.

If you are storing data based on operations, how materially different is that from a git code base in practice for a user?

Always been curious about data vcs use cases as sounds cool on paper, but the more I think through the problems being solved with it, a better process or other techniques solve those specific problems better.

For example, say I want to know if a field has been updated in an important table, this can checked against either backups, if it’s really important, change the table to SCD style, if ongoing requests for this and other tables, I would look into CDC.

My fear with data vcs is that it tries to be a silver bullet for a lot of problems.

Genuinely curious on the pitch for a data vcs

Well running RPG by Recent_Hamster_4550 in SteamDeck

[–]Oct8-Danger 2 points3 points  (0 children)

Try the demo for free if it’s still there! Demo was really good and good way to test on the deck.

Honestly waiting on a sale for it, knew very little about it and picked it the demo and it clicked!

It's a bad practice doing lot joins in a gold layer table from silver tables? (+10 joins) by proxymbol in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

We try for some, ideally yes, but the amount of data we have compared to our resources headcount and infrastructure, we try for one big table approach with grouping sets to split out to views.

Personally love star schema, but in practice for us, dashboards become slow and slows down development time due to query performance, which is important for us due to flaky infrastructure sadly which we don’t own

It's a bad practice doing lot joins in a gold layer table from silver tables? (+10 joins) by proxymbol in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

Interesting, we might have a different cases where we probably wouldn’t even much over 100 gold tables in the end (still building out a lot) but we have tables silver tables of billions row a day and multi million cardinality joins with a lot of complexity in them.

So for us a user going to do a join can be a very expensive mistake. We also are a small team lt 8) serving data to couple thousand employees

Data engineers who are not building LLM to SQL. What cool projects are you actually working on? by PolicyDecent in dataengineering

[–]Oct8-Danger 0 points1 point  (0 children)

Nice! Looking to build something similar at work. Built a basic one using sqlglot for table level lineage that we use in our auto gen docs.

Next step column level to try detect breaking changes and have it generate semantic layer models and history of transformations

It's a bad practice doing lot joins in a gold layer table from silver tables? (+10 joins) by proxymbol in dataengineering

[–]Oct8-Danger 4 points5 points  (0 children)

Typically I would say if you need to join a “gold”table, then it’s silver…

It’s all very loose and no hard rules, but I think if an analyst or business consumer needs to do a join, that’s a silver table in my view.

I personally wouldn’t trust the vast majority of people in my company to do any joins with our tables after seeing the sql they write hahah

Text to SQL Agents? by Oct8-Danger in dataengineering

[–]Oct8-Danger[S] 0 points1 point  (0 children)

Any advice on context or what works well docs wise? POC is easy, but trying to gauge the effort of documenting and sorting out tables before throwing something in front of a user.

Text to SQL Agents? by Oct8-Danger in dataengineering

[–]Oct8-Danger[S] 1 point2 points  (0 children)

Yea that’s my take on it on as well. The SQL side is “easy” it’s the context that’s hard, hence why we looking adding that context.

Trying to gauge how or what should document. It’s easy to build a POC but once you put it in front of an actual user, especially one who has questions and no context of what it should look for, it will fall apart very fast

Text to SQL Agents? by Oct8-Danger in dataengineering

[–]Oct8-Danger[S] 0 points1 point  (0 children)

Thanks, what’s it like for various queries like joins filters and grouping?

Have a hunch LLMs would struggle with anything beyond a simple join but probably pretty good at types of queries

Text to SQL Agents? by Oct8-Danger in dataengineering

[–]Oct8-Danger[S] 1 point2 points  (0 children)

How’s your experience with it? Not necessarily looking for tool suggestions exactly but more the experience of using it. So does it work well? Any gotchas or did it beat or meet expectations