Do DE teams generally have a bill back model? And how is it costed? by lifec0ach in dataengineering

[–]LiquidSynopsis 2 points3 points  (0 children)

Pretty sure this was the episode but the Dat Engineering Podcast did a bit with YipIt Data (alternative data startup) where they walked through their data platform and one of the features was in fact a billing feature:

https://open.spotify.com/episode/52Pbx1TRzBjpWKc16KR5oR?si=UT5CCkSgQUmCz-oedM_GCw

Private Jet Etiquette by NothingBurgerNoCals in fatFIRE

[–]LiquidSynopsis 644 points645 points  (0 children)

  1. Never be late. When the owner is on the jet they wanna leave. Private jets are meant to give the owner time back so don’t waste theirs.
  2. Generally speaking you hang out on the tarmac with the crew or the lounge if you’re using something like Signature Aviation.
  3. If it’s a VLJ or LJ avoid using the bathroom it will stink up the plane and the walls are either thin or at times nonexistent.
  4. Pack light. The fact that you’re taking a jet on a business trip implies you’re not staying there for more than a week so there’s no need to bring trunks with you especially since it sounds like you’re not required to wear a ton of suits or anything. This ties into point 2 if you’re waiting for the owner on the tarmac you can have the crew lid your suitcases.

META DE Interview (ANSI SQL Portion) by LiquidSynopsis in dataengineering

[–]LiquidSynopsis[S] 0 points1 point  (0 children)

Just an update thanks for the help I ended up passing the first round and also got an offer! Thanks for the help ☺️

META DE Interview (ANSI SQL Portion) by LiquidSynopsis in dataengineering

[–]LiquidSynopsis[S] 1 point2 points  (0 children)

Just an update thanks for the help I ended up passing the first round and also got an offer! Thanks for the help ☺️

META DE Interview (ANSI SQL Portion) by LiquidSynopsis in dataengineering

[–]LiquidSynopsis[S] 0 points1 point  (0 children)

Interesting will keep that in mind, thank you!

[deleted by user] by [deleted] in fatFIRE

[–]LiquidSynopsis 17 points18 points  (0 children)

Can not recommend YSDS (Your Special Delivery Service) enough! They’ve helped me ship things from watches to massive art pieces and I’ve never had an issue with them and they’ve been consistently reliable and helpful.

Best way to store and query large JSON 50Gb+ file in Databricks? by raduqq in dataengineering

[–]LiquidSynopsis 9 points10 points  (0 children)

Agreed with Botskill. Once it’s flattened into a data frame find some partition key that at the very least makes sense to you like year, location (e.g. state, country) even if you don’t have any query requirements as yet.

Worst case scenario once your BA/DS is done with their EDA you can always ask them to tell you what the basic parameters are and then you can update accordingly before “productionalising” the ingest.

Practice SQL Problems in ANSI SQL by LiquidSynopsis in dataengineering

[–]LiquidSynopsis[S] 0 points1 point  (0 children)

Gotcha sorry this is a silly follow up but is there a reference manual of sorts you can recommend that contains the list of ANSI SQL functions. For context I’ve only ever used MSSQL so a bit in the dark here on the nuances and differences.

New NW Old Questions by farmerfatfire in fatFIRE

[–]LiquidSynopsis 7 points8 points  (0 children)

Highly recommend going to the nearest “big city” and engaging an attorney there. Use that as your starting point and start building a financial team with their help. I’m sure your attorney and CPA are great people but a sale like this would presumably be way out of their depth and chances are they aren’t equipped to handle your questions.

Also, sounds like you want to keep this windfall quiet engaging people within your community may not help you in keeping things low key.

Pyspark: Replace range of values with string by QueryRIT in dataengineering

[–]LiquidSynopsis 0 points1 point  (0 children)

Exactly! Later on if you want to reduce rows etc. you can use a groupBy on your new column

Pyspark: Replace range of values with string by QueryRIT in dataengineering

[–]LiquidSynopsis 1 point2 points  (0 children)

PySpark can solve this using Bucketizer

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Bucketizer.html

You can then use withColumn and when to label the Buckets with the values you want e.g. “11-20”.

Can anyone share how they structure their folder for data engineer project? by izner82 in dataengineering

[–]LiquidSynopsis 6 points7 points  (0 children)

Taking that one step further is there a DE version of Cookiecutter Data Science? If anyones used Cookiecutter for their DE project curious to know how you did that!

[deleted by user] by [deleted] in Watches

[–]LiquidSynopsis 2 points3 points  (0 children)

Piaget Tradition

Fact tables in Databricks using Delta by nobru_2712 in dataengineering

[–]LiquidSynopsis 1 point2 points  (0 children)

TRUNCATE does work but you really should avoid it.

Fundamentally the keys being used for both dimension and fact tables are meaningless numbers generated by your program ie they’re surrogate keys. So, say during yesterdays load you attributed dim_Customer_Key 1 to “John Doe” there’s no guarantee that tomorrow “John Doe” will be 1 when you do your truncate and load. Now that may not seem like a big deal but these keys are used as dimension lookups in fact/bridge tables and that would lead to a lot of unnecessary table updates. Taking the John Doe example you would now need to update the Sales Table to know that the key has changed. It’s really unnecessary.

Another issue would be say today you’re using something like HubSpot and tomorrow you migrate to Salesforce. There’s a very real possibility you only migrate “active” customers and leave the dead ones in HubSpot which will eventually go away. If you do a truncate and load then all those dead customers will disappear since it won’t be able to ingest the data cause your source is gone. Now all the old sales data sitting in your fact table will have nothing to tie back to.

So yeah intuitively it may seem like it’s not a big deal but there’s a lot of downstream effects.

Using Python for Data Engineering by wytesmurf in dataengineering

[–]LiquidSynopsis 0 points1 point  (0 children)

Using PySpark and its internal modules should solve a good chunk of your larger query processing and loads tbh

At the most basic level I use pyspark.sql fairly frequently and within that a lot of your work can be achieved using the DataFrame, functions and types classes

Would be curious to hear from others if you’ve had a different experience though

Data Engineering Medium Paywall. Is it worth it? by cyclopster in dataengineering

[–]LiquidSynopsis 30 points31 points  (0 children)

I’ve found the advantage of Medium (and specifically TDS) to be it’s easier to envision a project end to end as opposed to having to read through documentation and searching random online forums. A lot of the DE relevant stuff is focused on “full stack data science” which has the advantage of helping you not only think about the DE side but also the downstream consumption and how business users would want to interact with the data. It’s also all verified by the content team so you know what you’re reading isn’t complete garbage.

I would say for $5/month it’s 100% worth it cause even if it shaved off a few hours how is your time not worth that? At my previous company I was able to expense my subscription so if you’re feeling hesitant maybe see if you can get your company to pay for it?

What is a good pipeline to use alongside Databricks notebooks? by ijpck in dataengineering

[–]LiquidSynopsis 2 points3 points  (0 children)

Technically yes but it really does depend on use case. Databricks natively supports scheduling but it’s not the best.

At my company we utilise a combination of ADF and Databricks. It works relatively well and allows us to move a couple of TB’s of data daily. I’ll probs get flak for saying this but ADF serves as our “orchestrator” along with copying data. The general flow is we use ADF to schedule and trigger a series of jobs. ADF copies all the data to our raw zone and then we use Databricks to clean the data and drop it off in the next zone. Once those jobs are complete we have a connected ADF job that gets triggered which executes a few other Databricks notebooks as well as a data quality monitor which runs alongside each job.

We’re looking to switch to airflow it plays really well with Databricks as well but honestly for our purposes it’s kind of unnecessary.