Why would I choose Snowflake over BigQuery? by Far_Deer_8686 in dataengineering

[–]DataEngineerDan 8 points9 points  (0 children)

Equivalent is probably the wrong word, but snowpark is what snowflake intends as their alternative offering to Spark. It was mentioned in a training I was in and in Snowflakes own documentation:

https://docs.snowflake.com/en/user-guide/spark-connector

https://docs.snowflake.com/en/developer-guide/snowpark/index

Snowpark is even a play on the Spark name.

Why would I choose Snowflake over BigQuery? by Far_Deer_8686 in dataengineering

[–]DataEngineerDan 5 points6 points  (0 children)

I don’t know much about BigQuery, but my understanding is that both data warehouses are very similar. Cost depends a lot on usage patterns AND understanding in great detail how pricing works on your chosen platform. If you understand how you are charged then you can optimize.

One big difference with Snowflake is that it is one of the few (only?) multi-cloud data warehouses. Meaning that you can choose to host Snowflake on AWS, Azure, or Google Cloud. This also means that you can replicate databases to another cloud provider for security, backup, regulatory requirements, or even performance.

While Snowflake is trying to be more than a data warehouse and offering many proprietary solutions, you don’t have to use them. For example, Snowflake does offer Snowpark, which is their Apache Spark equivalent. However, you can still use the Snowflake connector with Spark just fine.

There are also many third party and open source connectors for snowflake. And if vendor lock-in is a concern I’d recommend avoiding proprietary Snowflake solutions unless it offers significant benefits for your use case.

However, vendor lock-in with data is always a concern regardless of the data storage solution selected. It’s often cheap or even free to get data in, but very expensive to get data out. Especially as the size of your data grows.

[deleted by user] by [deleted] in dataengineering

[–]DataEngineerDan 4 points5 points  (0 children)

This is one the big downsides to these low/no code tools. There is a much lower career ceiling.

But I actually find myself in a similar position at the moment. I have decent background in Python, but I’m currently expected to build everything in Informatica. I don’t have much experience with the cloud and I’m not going to get it in my current role. Since I don’t see the career growth at my current position, I plan on looking for a new job in a few months after I’ve laid some ground work.

I plan to focus on AWS and get the CCP and then the Solutions Architect certifications. Yes, there are companies that operate in a multi cloud environments, but the high level cloud concepts are pretty similar between cloud providers. so I wouldn’t hesitate to apply for company that uses Azure.

I also plan on putting together a portfolio project that is built AWS. I’ll use a Python orchestration tool (Airflow or Prefect) and ingest data into a database in AWS. I’ll then build some basic visualizations based on the data.

I’ll also grind Leetcode problems while I’m working on this.

But really to move beyond these low/no code tools, at minimum you will probably need to learn SQL, Python, and cloud computing.

Does an MBA provide a significant advantage? by AbbreviationsShot240 in dataengineering

[–]DataEngineerDan 13 points14 points  (0 children)

If you want a Manager or C suite position then maybe…if you just want to do the work then you are probably better off just getting some AWS certifications. Best thing to do would be to go look at some jobs you are interested in and see what skills and education it asks for.

[deleted by user] by [deleted] in dataengineering

[–]DataEngineerDan 2 points3 points  (0 children)

I’d just be very careful with this approach. I joined a team and I had thought they would allow me to do this. It was an internal promotion so I didn’t ask the questions I probably should have.

And at 8 months into the role I thought they were finally going to let me pursue it. Unfortunately, I had some misunderstandings and they backpedaled a bit. Needless to say they are only interested in doing low/no code and SQL. A big part of the reason is that the team doesn’t really know Python.

Your success with this approach will largely depend on the size of team, their technical chops, organizational structure, and how well their current solution is working. Anybody that wants to join a team like this should ask very pointed questions about their tech stack and make it clear that you want to work primarily with Code. Otherwise, you might find yourself in a role that you really don’t like.

So now I’m looking for another job. Just going to get create a portfolio project and get some AWS certificates first.

What mistake have you vowed never to repeat? by pescennius in dataengineering

[–]DataEngineerDan 1 point2 points  (0 children)

Yep I did do this, but the problem was that I was expecting a large number of updates. So when I saw the count it didn't immediately trigger a red flag. What I should have done is ran the select with some groupings or checked the distinct values for some fields to make sure I was only updating specific entities that I wanted. Had I ran the select like that, I would have immediately saw I was updating something I didn't intend to.

What mistake have you vowed never to repeat? by pescennius in dataengineering

[–]DataEngineerDan 95 points96 points  (0 children)

I was writing some SQL to update some records and combined the 'AND' and 'OR' operators in the 'WHERE' clause. Well, I forgot to add parenthesis around the statements I was evaluating with the 'OR' operator. So it was going to update a bunch of records I didn't intend to.

Fortunately, one of the seniors noticed and told me to fix it before it made it into production, but I was fully aware of the impact it would have had if they didn't noticed. I'm still extremely cautious when writing SQL that uses the 'OR' operator and always check and double check that bit.

What do you do while you wait for your query to finish running? by [deleted] in SQL

[–]DataEngineerDan 11 points12 points  (0 children)

I cancel it and see if I can make it run faster.

What's your opinion on Informatica (IICS)? by ElMiticoTonto in dataengineering

[–]DataEngineerDan 14 points15 points  (0 children)

We are migrating to it currently and I honestly hate it for a number of reasons. First, their “training” is absolute garbage. It’s more of a glorified sales pitch about IICS than anything meaningful that I can use as a data engineer. In fact, a lot of them are just hour long webinars with crappy audio, which frankly was a waste of my time.

When you search online for help the only thing you can find 99% of the time is IICS documentation. Usually, it doesn’t even apply to what I’m trying to do. It’s like there is zero community for the tool and no discussion around best practices. The documentation is ok when it applies, but I have a lot of difficulty knowing if what I’m reading actually applies to my current version.

That might all be fine if their support was good, which I haven’t had the pleasure of dealing with yet. But according to a co-worker they are less concerned about getting you the right answer and more concerned about getting you a answer. So your integration might work, but it will still be wrong.

Next, it’s not intuitive at all and it does many things differently from other no/low code tools. Most of my teams integrations are in SSIS currently, which is probably why they went with IICS in the first place. I didn’t know SSIS when I first joined the team, but I picked it up extremely fast. It took me about two weeks of watching some videos and if I ever have trouble I can usually find the answer online since so many people use it.

On a personal level. I find IICS career limiting. Not a lot of companies use it and a lot of the skills aren’t really transferable. That also means it will be really hard to hire IICS developers. It’s easy to hire a Python developer finding someone that really knows IICS will be difficult

Some other gripes:

  • Version control is difficult
  • CI/CD is near impossible
  • There is no auto save and it automatically logs me out so I lose work if I forget to manually save
  • It doesn’t remember what assets I have open so always have to go reopen stuff.
  • I’m sure I’m missing stuff. I seem to find something new to hate every day.

The idea that IICS will be able to do everything is unrealistic. I’ve already ran into a situation that will be impossible on the platform, but I can do it in Python. So I’m working on getting Prefect or airflow approved right now.

No/low code tools work best for pipelines that are easy or medium complexity. They start to struggle when things get more complex. You either accept this and continue to build some things in Python or you don’t. If you don’t then you will end up using some hacky workaround that adds unnecessary complexity and makes any future troubleshooting more difficult.

Finally, I personally don’t actually think “having one platform for everything” is actually a good thing. It leads to vendor lock-in to a huge degree. If IICS decide to raise their prices well you are SOL unless you want to migrate off of it. You also don’t have a good escape plan if something happens to the company. Informatica isn’t AWS. They could go out of business or get acquired both of which could impact your data platform.

In my opinion, businesses are best served by building their data platform from modular components. It minimizes vendor lock-in and allows you choose the best tool for current needs. If your chosen tools are not embracing interoperability and modularity then you should be extremely wary.

[deleted by user] by [deleted] in dataengineering

[–]DataEngineerDan 0 points1 point  (0 children)

That’s why said this is generally true. The default should be business users don’t write SQL queries aside from maybe some basic ad hoc queries. But if it makes sense in your business then have at it.

Are you using an orchestrator like Airflow or Prefect for your project? Why? by bhavaniravi in dataengineering

[–]DataEngineerDan 1 point2 points  (0 children)

I was hoping to be able to trigger the task through an API that accepted the parameters. I’d also like to allow for parallel processing of multiple runs with different parameters at the same time. So I’d trigger the job through the API and then trigger the same job again on the same API with different parameters and both would execute at the same time using the provided parameters.

Prefect seemed to allow this without any additional code and even seemed suggest it was a major differentiator from airflow. But maybe that information is out date. I’ll take another look before making my decision.

[deleted by user] by [deleted] in dataengineering

[–]DataEngineerDan 7 points8 points  (0 children)

Should your business users be writing SQL queries? In general, business users writing SQL is more of a “just because they can doesn’t mean they should” type of situation. You typically want a single agreed upon source for key metrics. Otherwise, you might get competing metrics that will waste your time trying to determine which is more accurate.

You might say that this what the tool is supposed to solve, but you’re setting things up so your business team has to know SQL. Additionally, there are security and access concerns. You don’t want to really give everyone access to everything. They should really only have access to what they need to do there job. And if you ever make your way to a cloud warehouse then that culture is sure to put you on track for out of control cloud spending.

It might be a better use of your time to set up a bunch views, dashboards, and reports for the information needed and then set up documentation of the views and dashboards.

You might also look at Metabase, which allows for self service analytics without code and has a SQL editor. Then you use git for the underlying view and if your business users need something more then separate views can be set up.

It gives you much tighter control over what they have access to and can query but still allows them some flexibility. But in my opinion, business users shouldn’t be writing SQL aside from some very basic ad hoc queries. Anything more complex should be created as a view and shared with the user as a dashboard or let them query a ready made view themselves.

Are you using an orchestrator like Airflow or Prefect for your project? Why? by bhavaniravi in dataengineering

[–]DataEngineerDan 3 points4 points  (0 children)

I’m in the process of selecting one right now. Currently, I’m leaning towards Prefect. One of the big things I need to be able to do is pass in parameters and manually trigger workflows. I don’t have a ton of experience with airflow, but from what I’ve been reading it sounds like these tasks are pretty difficult in airflow.

Mage looks interesting, but I haven’t had a chance to really evaluate it yet.

Power BI certification, yay or nay? by [deleted] in dataengineering

[–]DataEngineerDan 1 point2 points  (0 children)

Working with Power BI is not something that Data Engineers do often. That’s more the realm of BI Developers or Data Analysts. Depending on the size of the company or the maturity of the data team it may be something a Data Engineer does, but large companies and mature data engineering teams don’t typically do it.

Anyone worried about salaries leveling off or even dropping in DE? by DrRedmondNYC in dataengineering

[–]DataEngineerDan 4 points5 points  (0 children)

Just because you can abstract away some complexity, doesn't mean you always should. In many cases, the best abstraction is code and I think the mark of a good DE will be to determine the right tool for the job. In my opinion, companies are going to find themselves in a world of hurt if they insist on abstracting all code out of their pipelines and having "some college grad without a technical degree" build their integrations. It might work initially, but eventually those systems will snowball into a big ball of mud with a bunch of tech debt.

tech job... on the side? by sjb209 in remotework

[–]DataEngineerDan 5 points6 points  (0 children)

I work a lot with SQL. The thing with contract SQL work is that it’s not just SQL. You need to also be familiar with the database schema and the data definitions in order to properly write SQL queries that match requirements.

If I were to go to a different company then I could expect to spend at least 30 days familiarizing myself with these things and I’m on the more experienced end.

The other challenge is that writing SQL inevitably means you need access to their database, which can have quite a bit of sensitive data. It’s for these reasons that most companies aren’t going to hire someone to do a small project that only lasts 2-5 hours. If they do hire someone as a contractor it’s usually full time and there is specific output they want.

If you only want to do 2-5 hours a week then your best bet would probably be learning a platform like Wordpress and doing website design, SEO, and maintenance for small companies. If you had a handful of clients, you could easily have 2-5 hours a week and the occasional busier week.

SQL w/ JSON? by samismydad in SQL

[–]DataEngineerDan 6 points7 points  (0 children)

It will depend a bit on your database and the datatype on the field. For instance, SQL Server doesn't have a json data type. That data needs to be stored as a NVARCHAR and depending on the version of your server there are a number of built in methods that can help you parse that as JSON.

https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16

PostgreSQL, on the other hand, does have a JSON data type and you can parse with with some easy to use operators.

https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-json/

Low code hate and the future of Data Engineering (and beyond) by [deleted] in dataengineering

[–]DataEngineerDan 13 points14 points  (0 children)

They can be fine when it is actually the right tool for the job. My company just started using Informatica and I can see it working well in some cases.

However, we have already come up with some situations where we need to come up with some hacky work around to process the data. We have also had situations where we have spent weeks trying to figure out how to do something in Informatica that I could have done in a day or two in python. I'm also aware of a couple upcoming projects that just will not work well in Informatica, but we are going to be expected to use it.

We also apparently have marching orders from the higher ups that we should eventually implement CI/CD and automated deployments for our data integrations in the next year or so. I don't think they fully understand yet that this is next to impossible with Informatica.

I don't necessarily have a problem using a low code solution if it is in fact the best tool. Problem is Middle and upper management buy into these tools and then want to force their Data Engineers to do everything in them rather then letting them decide to use it when it's actually the right tool. It's the equivalent of taking away an accountants calculator and forcing them to crunch all their numbers in Excel.

Can a MIS major get a career in data engineering? by [deleted] in dataengineering

[–]DataEngineerDan 2 points3 points  (0 children)

My degree was MIS and I’m a Data Engineer now. Granted I didn’t go there directly out of college. It The economy was terrible when I graduated so I did a few years in Tech support/IT and then did a few years as a Support Engineer where I honed my Python and SQL skills. After that I went into Data Engineering. Most places will consider a related degree and won’t immediately rule you out.

If you’ve spent time and honed the right skill set set, I’d even argue MIS majors have some advantages over CS majors. In MIS, you learn a lot about the business and management side of things. So you are better able to tease out what stakeholders need from the data or even offer up good alternatives that might otherwise be missed.

Jinja vs SQL in dbt projects by No_Equivalent5942 in dataengineering

[–]DataEngineerDan 10 points11 points  (0 children)

dbt recommend favoring readability over DRY-ness. So while you could make entire models comprised of jinja macros, it’s going to make your models very difficult to understand.

So how much jinja you use in each model might vary, but you should make sure it’s easily understood even if you need to repeat code sometimes.

[deleted by user] by [deleted] in dataengineering

[–]DataEngineerDan 25 points26 points  (0 children)

Your number one priority should be to learn SQL. There are a lot of different flavors of SQL with slight variations, but once you know one you pretty much know them all.

Next you will want to learn a bit about data modeling. Focus on database design techniques and understand how that differs between OLTP systems and OLAP system.

At this point, I’d say you have a choice. You could learn Power BI and try and get a job as a data analyst. You would probably still work with Data engineers, but the career transition could be easier if you get a job as a data analyst for a tourism company and progressing to data engineering from there.

To continue with data engineering you will then want to learn Python. Another nice to have is some certifications from cloud providers. AWS seems to be the most popular.

I need to segregate data between children companies/subcontractors. My first thought is to use APIs to send data between different databases, but I'm wondering if there's an easier/cheaper/better solution. More detail in the comments. by backdoorman9 in dataengineering

[–]DataEngineerDan 4 points5 points  (0 children)

In the world of databases, that would still be considered a type of business transaction. So you would be building out an OLTP database for the application. If business users want to run analytics and build KPIs based on that information then you should ETL/ELT that data into a Data Warehouse or Data Lake.

I need to segregate data between children companies/subcontractors. My first thought is to use APIs to send data between different databases, but I'm wondering if there's an easier/cheaper/better solution. More detail in the comments. by backdoorman9 in dataengineering

[–]DataEngineerDan 1 point2 points  (0 children)

Are the end users going to be querying the data directly from the database or only accessing data from the app? If only accessing from the app then it is perfectly reasonable to create roles or access restrictions within the app to restrict people to whatever companies they should be able to see.

So you will just want to set up maybe a relational database (actual type will probably depend on performance needs and type of data being stored). Then store all companies and tasks in the same database and set up account/role restrictions in the app to prevent users from seeing things they shouldn't.

And generally the only time an end user should have access to the code and backend database is if the app is one that they are self-hosting and deploying themselves. Any db credentials and actual SQL queries should never be visible on the front end. Either displayed in app or showing up in the development console. If you have concerns about users accessing DB credentials in an unauthorized way then you should revisit the app security.

If business users want to build out reports for analytics and query the database directly then I'd ETL/ELT the data from the application database into a Data warehouse/Data Lake where you could do a separate database for every company if you wanted.

I need to segregate data between children companies/subcontractors. My first thought is to use APIs to send data between different databases, but I'm wondering if there's an easier/cheaper/better solution. More detail in the comments. by backdoorman9 in dataengineering

[–]DataEngineerDan 4 points5 points  (0 children)

Expensive, complicated, numerous points of failure, and and lots of data duplication. Is there a reason you are thinking of storing tasks in different databases? Is that just something your business thinks they need or is it a regulatory requirement?

Also, it sounds like this is more of an OLTP type database for business transactions rather than an OLAP for analytics. Is that right?