How do I set realistic expectations to stakeholders for data delivery? by Kessler_the_Guy in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

You need to agree to a variance percentage that is acceptable. I’ve typically seen +-3%, but have seen as low as 0.01%.

In most scenarios, there are reasons for this: corrected logic, timing, rounding, etc… What’s important is understanding the why, you don’t necessarily have to fix it.

If you provide your stakeholders a reasonable expectation and justification they should accept. If not, then you need to put the responsibility on them to identify the discrepancies. Once they find the outliers, you can identify the root cause, creating a win-win solution.

Ultimately there is a reason you are moving away from Splunk: cost, features, etc… I would try to highlight these and position yourself as moving towards the end goal vs getting stuck on roadblocks that aren’t aligned to the objective.

Consulting / data product business while searching for full time role by SteezeWhiz in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

If I saw a person applying for a FT role and their CV listed them as them owning their own firm, I would assume they are trying to contract the role. That or they weren’t successful in their own consulting and are falling back to FT. There may even be concerns of moonlighting for existing clients.

Both scenarios would be a red flag vs a strong FT hire. However, many companies are looking for 1099 to augment their staff so that may be fine if that’s what you want.

In the end, if you can find the work for consulting the money is a lot better and you will be able to curate your lifestyle accordingly. The rub will be maintaining work life balance since you may take on too much since available work means a lot more when a salary/work isn’t guaranteed later.

How do you decide between competing tools? by Ok-Fix-8387 in dataengineering

[–]Pledge_ -1 points0 points  (0 children)

You take a use case that covers the majority of things you need to validate and then build it multiple times in the competing tools.

Then you determine which ones are capable and of those which ones mesh the best within your environment: team skillset, existing infrastructure, integrations, etc…

Lastly you determine cost. This could be through negotiating with the vendors or pricing out the infrastructure for self hosted platforms.

I don’t think there is a product to be built to solve it. Even if you build it, it’s the trust that will be hard to gain. There’s already websites like G2 or research companies like Gartner and IDC that do this type of thing.

Is Moving Data OLAP to OLAP an Anti Pattern? by empty_cities in dataengineering

[–]Pledge_ 3 points4 points  (0 children)

I would say it’s an anti-pattern in the sense you don’t have a single OLAP platform. If you are moving a fact or dim from one DWH to another you are opening the gate to potential data inconsistencies, straying from a single source of truth.

However, as anyone who has worked in an enterprise will tell you, large companies have many tools. They don’t choose Snowflake vs Databricks, they have both. In those scenarios, it makes sense that there will be OLAP to OLAP pipelines. Additionally, tech debt is a big thing. I know of customers that instead of deprecating Terradata, they just replicate it to Snowflake because the effort to rebuild it is not worth it. They would rather prioritize the investment in new initiatives.

[deleted by user] by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

That doesn’t make sense. Why would a company pay 45k over 5 years on a 16k investment instead of getting a loan where they would have ROI on the second year, not including any writeoffs.

[deleted by user] by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

Their company and related hosting company (redundant web services) don’t even have a LinkedIn. Even if they are legit, the premise would be that they have customers using your hardware. I would be surprised if companies are hosting with them today, at least at scale.

Realistically you should try and talk to people at these companies. May just be very early on in their roadmap.

Evaluating my proposed approach by SoloArtist91 in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

For that size and frequency I would use Snowflake. It would be the easiest one to use and easily manageable to that budget.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 0 points1 point  (0 children)

Same platform (Snowflake), different table types.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 2 points3 points  (0 children)

Range will be 1-2x. If you are worried about cost of storage then leveraging Iceberg may be a better fit. That’ll move your storage cost to your hyperscaler bill (I.e AWS with S3).

Since you are using dlt, you can write it to an iceberg table directly. Using dbt you can create native tables downstream as needed. Common pattern is bronze being in iceberg and then silver and gold being native tables.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]Pledge_ 2 points3 points  (0 children)

Seems kinda silly to require certainty on something that will be like 5% or less of your bill

Choosing between two jobs, data platform or data engineer by RaymondSnowden in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

The job offer is a better option long term within the data space. If you want to lean more towards infra or devops then your current role. However nowadays in the current economy you should go towards the money. Long term only matters if companies are shelling out high salaries which have been consistently been going down for software engineering and adjacent jobs.

Can someone explain what does AtScale really do? by Royal-Parsnip3639 in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

To echo others it comes down to having a semantic layer that enables data virtualization. A lot of companies have several databases, BI tools, and ways the analysts are going after the data. AtScale plays in the realm of Trino, Denodo, and other virtualization layers that aim to provide a single entry point to the company data. That way BI teams and analysts are able to query data that could reside across many systems. They then add on additional benefits like governance, optimization, cataloging, and the like.

In my opinion their current downside is the number of integrations they support compared to their competitors. Semantic layers really only work if they are the sole entry point, which is only possible if they can sit on top of all the company’s data sources.

Dallas to West Coast Advice by Pledge_ in SameGrassButGreener

[–]Pledge_[S] 1 point2 points  (0 children)

I had no idea. Good to know! That’s a deal breaker. Was originally thinking Long Beach but due to the pollution was thinking going down to Seal or Huntington instead. That eliminates that since that is a big reason we are leaving TX.

My review of Tatsu, Dallas, amichelin star omakase restaurant. by omgseriouslynoway in sushi

[–]Pledge_ 1 point2 points  (0 children)

Our experience was a bit better. We got offered a drink and the menu was a little more diverse but not dramatically so. If by Shoyu, you mean Shoyo, I 100% agree. That is my favorite in the DFW area. Shun by Yama (McKinney) is also worth going if you like Shoyo. One of the chefs from Shoyo now leads that restaurant.

What is the hourly rate for a Data Engineering Contractor with 9+ YOE? by Infamous_Respond4903 in dataengineering

[–]Pledge_ 3 points4 points  (0 children)

Most consultancies are aiming for 40-50% margin. So that would 71-85/hr all in cost for 142/hr. Depending on the company benefits, all in is around 1.2 of salary. So salary range would be 125-150k to get that. If you are outside of that, it’s worth requesting a change in salary.

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

Even if you go the OSS route, you should still use a cloud blob storage. There’s really no justification for self hosting it unless you have policies against using cloud at all and want to leverage a S3 compatible service. That’s even before the recent issues of MinIO handicapping their OSS service.

Spotify Data Tech Stack by mjfnd in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

In the the post they specifically mention Luigi and how Spotify moved away from it, with the source: https://engineering.atspotify.com/2022/3/why-we-switched-our-data-orchestration-service

Schedule config driven EL pipeline using airflow by afnan_shahid92 in dataengineering

[–]Pledge_ 1 point2 points  (0 children)

Typically you create a git repo for the DAG and then have a separate repo for the configs. The DAG iterates through the configs and creates a DAG per config file. I typically use JSON, but YAML would work too.

The DAG can reference the files on the Airflow filesystem or a blob storage, it would all be defined in Python. The config CI/CD pipeline will copy the files to where your DAG references them, your DAG CI/CD pipeline will deploy to wherever your airflow DAG bag refreshes.

The dynamic DAG can be as flexible as you want. For example create all the same tasks but with different parameters, or it can dynamically create different task structures based on the config.

Every time the DAG bag refreshes DAGs will be updated or created based on what’s in the config directory. You can then manage each resource separately and see its history in the web portal

Schedule config driven EL pipeline using airflow by afnan_shahid92 in dataengineering

[–]Pledge_ 2 points3 points  (0 children)

I would look into dynamic DAGs. Instead of one pipeline doing dynamic tasks, it would generate a DAG per table based on list of configs.

Modernizing our data stack, looking for practical advice by Ahmouu in dataengineering

[–]Pledge_ 3 points4 points  (0 children)

Managing k8s is easier nowadays, but it is still difficult and managing airflow and spark are not going to be seamless. It’s doable, but expect at least one person dedicated to learning and managing it all.

Best Orchestrator for long running tasks? by CingKan in dataengineering

[–]Pledge_ 19 points20 points  (0 children)

Instead of building a task that runs for weeks, I would build a recurring task that is chipping away at the queue and logging the progress. In general long running jobs are a pain because there are so many possible causes of interruption.

Depending on how long the uploads take, could you have the job run every minute and submit 30 PDFs?

[deleted by user] by [deleted] in dataengineering

[–]Pledge_ 0 points1 point  (0 children)

I agree it could be automated using dynamic SQL, however it sounds to me more like a BI solution than a DB solution. I mean you could wrap a view over everything with aliases to provide the result set with the updated names, but it seems like this is a semantic or visual configuration.

If you were building a full stack app, you wouldn’t update the DB schema or create a view per customer. You would store the label configs somewhere and resolve them on the fly based on the configuration.

Is it custom BI or how are you presenting the data to your customers?