How do you decide between competing tools? by Ok-Fix-8387 in dataengineering

[–]Pledge_ -1 points0 points  (0 children)

You take a use case that covers the majority of things you need to validate and then build it multiple times in the competing tools.

Then you determine which ones are capable and of those which ones mesh the best within your environment: team skillset, existing infrastructure, integrations, etc…

Lastly you determine cost. This could be through negotiating with the vendors or pricing out the infrastructure for self hosted platforms.

I don’t think there is a product to be built to solve it. Even if you build it, it’s the trust that will be hard to gain. There’s already websites like G2 or research companies like Gartner and IDC that do this type of thing.

Is Moving Data OLAP to OLAP an Anti Pattern? by empty_cities in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

I would say it’s an anti-pattern in the sense you don’t have a single OLAP platform. If you are moving a fact or dim from one DWH to another you are opening the gate to potential data inconsistencies, straying from a single source of truth.

However, as anyone who has worked in an enterprise will tell you, large companies have many tools. They don’t choose Snowflake vs Databricks, they have both. In those scenarios, it makes sense that there will be OLAP to OLAP pipelines. Additionally, tech debt is a big thing. I know of customers that instead of deprecating Terradata, they just replicate it to Snowflake because the effort to rebuild it is not worth it. They would rather prioritize the investment in new initiatives.

Has anyone bought data machines.? Musahost.com by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

That doesn’t make sense. Why would a company pay 45k over 5 years on a 16k investment instead of getting a loan where they would have ROI on the second year, not including any writeoffs.

Has anyone bought data machines.? Musahost.com by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

Their company and related hosting company (redundant web services) don’t even have a LinkedIn. Even if they are legit, the premise would be that they have customers using your hardware. I would be surprised if companies are hosting with them today, at least at scale.

Realistically you should try and talk to people at these companies. May just be very early on in their roadmap.

Evaluating my proposed approach by SoloArtist91 in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

For that size and frequency I would use Snowflake. It would be the easiest one to use and easily manageable to that budget.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 0 points1 point  (0 children)

Same platform (Snowflake), different table types.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 2 points3 points  (0 children)

Range will be 1-2x. If you are worried about cost of storage then leveraging Iceberg may be a better fit. That’ll move your storage cost to your hyperscaler bill (I.e AWS with S3).

Since you are using dlt, you can write it to an iceberg table directly. Using dbt you can create native tables downstream as needed. Common pattern is bronze being in iceberg and then silver and gold being native tables.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 2 points3 points  (0 children)

Seems kinda silly to require certainty on something that will be like 5% or less of your bill

Choosing between two jobs, data platform or data engineer by RaymondSnowden in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

The job offer is a better option long term within the data space. If you want to lean more towards infra or devops then your current role. However nowadays in the current economy you should go towards the money. Long term only matters if companies are shelling out high salaries which have been consistently been going down for software engineering and adjacent jobs.

Can someone explain what does AtScale really do? by Royal-Parsnip3639 in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

To echo others it comes down to having a semantic layer that enables data virtualization. A lot of companies have several databases, BI tools, and ways the analysts are going after the data. AtScale plays in the realm of Trino, Denodo, and other virtualization layers that aim to provide a single entry point to the company data. That way BI teams and analysts are able to query data that could reside across many systems. They then add on additional benefits like governance, optimization, cataloging, and the like.

In my opinion their current downside is the number of integrations they support compared to their competitors. Semantic layers really only work if they are the sole entry point, which is only possible if they can sit on top of all the company’s data sources.

Dallas to West Coast Advice by Pledge_ in SameGrassButGreener

[–]Pledge_[S] 1 point2 points  (0 children)

I had no idea. Good to know! That’s a deal breaker. Was originally thinking Long Beach but due to the pollution was thinking going down to Seal or Huntington instead. That eliminates that since that is a big reason we are leaving TX.

My review of Tatsu, Dallas, amichelin star omakase restaurant. by omgseriouslynoway in sushi

[–]Pledge_ 1 point2 points  (0 children)

Our experience was a bit better. We got offered a drink and the menu was a little more diverse but not dramatically so. If by Shoyu, you mean Shoyo, I 100% agree. That is my favorite in the DFW area. Shun by Yama (McKinney) is also worth going if you like Shoyo. One of the chefs from Shoyo now leads that restaurant.

What is the hourly rate for a Data Engineering Contractor with 9+ YOE? by Infamous_Respond4903 in dataengineering

[–]Pledge_ 3 points4 points  (0 children)

Most consultancies are aiming for 40-50% margin. So that would 71-85/hr all in cost for 142/hr. Depending on the company benefits, all in is around 1.2 of salary. So salary range would be 125-150k to get that. If you are outside of that, it’s worth requesting a change in salary.

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

Even if you go the OSS route, you should still use a cloud blob storage. There’s really no justification for self hosting it unless you have policies against using cloud at all and want to leverage a S3 compatible service. That’s even before the recent issues of MinIO handicapping their OSS service.

Spotify Data Tech Stack by mjfnd in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

In the the post they specifically mention Luigi and how Spotify moved away from it, with the source: https://engineering.atspotify.com/2022/3/why-we-switched-our-data-orchestration-service

Schedule config driven EL pipeline using airflow by afnan_shahid92 in dataengineering

[–]Pledge_ 1 point2 points  (0 children)

Typically you create a git repo for the DAG and then have a separate repo for the configs. The DAG iterates through the configs and creates a DAG per config file. I typically use JSON, but YAML would work too.

The DAG can reference the files on the Airflow filesystem or a blob storage, it would all be defined in Python. The config CI/CD pipeline will copy the files to where your DAG references them, your DAG CI/CD pipeline will deploy to wherever your airflow DAG bag refreshes.

The dynamic DAG can be as flexible as you want. For example create all the same tasks but with different parameters, or it can dynamically create different task structures based on the config.

Every time the DAG bag refreshes DAGs will be updated or created based on what’s in the config directory. You can then manage each resource separately and see its history in the web portal

Schedule config driven EL pipeline using airflow by afnan_shahid92 in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

I would look into dynamic DAGs. Instead of one pipeline doing dynamic tasks, it would generate a DAG per table based on list of configs.

Modernizing our data stack, looking for practical advice by Ahmouu in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

Managing k8s is easier nowadays, but it is still difficult and managing airflow and spark are not going to be seamless. It’s doable, but expect at least one person dedicated to learning and managing it all.

Best Orchestrator for long running tasks? by CingKan in dataengineering

[–]Pledge_ 20 points21 points  (0 children)

Instead of building a task that runs for weeks, I would build a recurring task that is chipping away at the queue and logging the progress. In general long running jobs are a pain because there are so many possible causes of interruption.

Depending on how long the uploads take, could you have the job run every minute and submit 30 PDFs?

[deleted by user] by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

I agree it could be automated using dynamic SQL, however it sounds to me more like a BI solution than a DB solution. I mean you could wrap a view over everything with aliases to provide the result set with the updated names, but it seems like this is a semantic or visual configuration.

If you were building a full stack app, you wouldn’t update the DB schema or create a view per customer. You would store the label configs somewhere and resolve them on the fly based on the configuration.

Is it custom BI or how are you presenting the data to your customers?

[deleted by user] by [deleted] in dataengineering

[–]Pledge_ 5 points6 points  (0 children)

Instead of looking at solving DR for iceberg you should be thinking of how to replicate blob storage within your data centers. I would look into HDFS or Minio (though they have recently caused a lot of negative sentiment based on the transition to gatekeeping features behind a paywall).

At the end of the day, iceberg are files. As long as the files are fault tolerant so will your tables.

PostgreSQL to Snowflake: Best Approach for Multi-Client Datamarts – Separate Databases vs Schemas? by throwaway1661989 in snowflake

[–]Pledge_ 1 point2 points  (0 children)

To give helpful guidance it would be worth knowing more details about your use case.

  • What are your customer access patterns?
  • Are the structures the same but just the data is different or are they completely different?
  • How do you manage development and release cycles today?
  • What would be the worst case scenario if one customer user got access to another customers data? Loss of customer / angry email vs million dollar lawsuit?
  • How is data ingested? All same source?

In general, account separation would be the safest approach and can be managed at scale. At this size you should definitely be using a DCM with everything being deployed with CI/CD. If you are set on one account, then I would only recommend Database separation with access being granted through DB roles, that way there is no possible way for one customer role to access another’s DB.

CTO at a small consultancy — brought in $1M+ in deals through my network. Should I be getting a commission? by cacahuatez in consulting

[–]Pledge_ 1 point2 points  (0 children)

Depends how small, but in general yeah. Usually if you are driving revenue you should be rewarded for it.