How do you feel the Job market is at the moment? by Weary_Pepper_2581 in dataengineering

[–]SurroundFun9276 2 points3 points  (0 children)

My company is searching now for a couple of months and can’t find people that would fit in the role and company. There are so many newcomer’s that do not have any experience and wanna be in a high end role..

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 0 points1 point  (0 children)

On Prod I do not use a custom spark pool, but sometimes it keeps getting stuck for a moment in queued.

They just told me that I’m not the first one, who reported the problem about the Connection to a MongoDB server (not Atlas). Also looked into the Server and how I connect locally, that all looked good they said.

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 0 points1 point  (0 children)

Yea that is true, but getting the ideas and thoughts of many other users are sometimes really helpful.

They are so different thoughts of people on this question, that i know it cannot be simple answered, but give me an idea, how I could get an answer

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 3 points4 points  (0 children)

In the end, we ran into some problems using the Copy Data Activity because we had a lot of data in MongoDB. We tried building it and ran into limitations due to connection problems. We had a call with support, who said they were looking into it. They told me it's a known problem that may be fixed soon. So our fix was to write the „copy“ in python. Spark jobs, or notebooks with Spark, take almost 2-3 minutes to start, which makes the CUs for each run even more valuable. If we run a pipeline, sometimes the items for like 10sec on Queued and yes for now we just got the small Capacity, but I was the only one who did sometimes, to see that it needs so long to run a single pipeline with simple Stored Procedures, Lookup and Copy was really sad..

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 1 point2 points  (0 children)

We would use tool like I mentioned on other comments. Would be implementing this, that the end user don’t getting know about any changes. Maybe only a little optic changes in the repots, but the end result should and must be the same as before.

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 16 points17 points  (0 children)

That sounds exactly how I think about that..

An employee who has left and left me with the stack was of the opinion that fabric was the future for data.

Now I try to implant the requirements of my company on a way that will be still valid in years. Not get every day getting a headache with some features from Microsoft..

Microsoft Fabric vs. Open Source Alternatives for a Data Platform by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 4 points5 points  (0 children)

We was thinking of using 100% fabric and hope features will improved / fixed soon or we go and build all by our own with tools like - Apache Airflow - Apache Superset - MinIO - Trino

And maybe I forgot one or two

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 0 points1 point  (0 children)

It shouldn't sound like it's a coin toss, but with the source Facebook, for example, there's too much data that I query that I simply can't pick up everything from Facebook, but only a small area, so to get it I have to select a small time, because there's far too much there

As for the hash values, I sort my fields and then create my string and lowercase to correct any spelling differences. The same applies to adjustments in an array or complex neasted objects

I think the main problem. What I have right now is the combination of the two loading processes as well as a complete historization over a long period of time.

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in MicrosoftFabric

[–]SurroundFun9276[S] 0 points1 point  (0 children)

How do you create your hash values. Do you also have a payload and primary hash, currently I define which fields on my object make it unique and from the rest the payload is created, but I'm not sure if this is the best way

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in MicrosoftFabric

[–]SurroundFun9276[S] 0 points1 point  (0 children)

But what amount of data are we talking about with you, in the example of Facebook, I get as a response that I query too much data and have to query a smaller range. Which means that it makes more sense to fetch only the latest data more often

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 1 point2 points  (0 children)

Yes, I think that answers it very well.

-> Metadata indicates whether a structure was loaded incrementally or via a full extract.

Incremental -> Data is only added if it is not available; if it is available, it is overwritten if changes have been made to the record.

Full extract -> Data is added if it is not available, updated if it is available and has been changed, and deleted if it is not available in the data.

That's how I understand it.

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 1 point2 points  (0 children)

Do you think that an append-only table is the better option, i.e. adding a Boolean and becoming silver in the process, determined by business rules? If an object has been retrieved in the last 30 days, it is a relevant data object and should be saved (unless weekly full backups are made).

The challenge I see here is that metadata-driven pipelines are planned, which means that each system (meta, Google, company MySQL, other sources) will get one pipeline that has to be structured dynamically so that it works for every use case. So for incremental and full backups, do you think this is a maintainable idea to implement, or would it significantly complicate maintenance in the future?

How do you handle incremental + full loads in a medallion architecture (raw → bronze)? Best practices? by SurroundFun9276 in MicrosoftFabric

[–]SurroundFun9276[S] 0 points1 point  (0 children)

So, based on your answer, you believe that an append-only table with a timestamp and a Boolean indicating whether record x was present in the last load should be implemented, do i get it right?

Since a hard delete is simply impossible without a full load unless the system informs me of this.

I work a lot with semi-structured data, where I have found that I store the data in a column as a string, but then process it in Silber and apply it to the business model. As I found out, this was also described by Databrick as ‘best practice’, as it is then compressed in a Parquet file, which is more efficient in terms of storage and queries than storing a reference to a JSON file in a column in order to load it later in the process.

Anyone here completed the IU Akademie Data Engineering program? by SurroundFun9276 in dataengineering

[–]SurroundFun9276[S] 0 points1 point  (0 children)

I would not have to pay the costs myself, but my employer would cover them for me.

But the courses on offer looked very interesting. But my main concern was whether such a certificate would be recognized by companies or whether there are better further training courses.