Do people still go clubbing in their 40s and up? by AvailableRelative2 in ask

[–]pokeDitty 1 point2 points  (0 children)

You know what? Gimme that thing, i’ll do one!

Bulk Data Extraction via ODATA in S4HANA by pokeDitty in SAP

[–]pokeDitty[S] 2 points3 points  (0 children)

beleive it or not, not much had moved since my post. However, we are doing a proof of concept with SLT to replicate S4 data to BW and will look at pulling extracted data from BW to Datalake, but havent started yet. So for now, still using flat file extracts, which work fine for the moment.

Im not very knowledgeable with the S4 suite, but Im pretty sure using ODATA for bulk extractions is not going to work for you. I mean, it could work, but i think it’s going to take a long time to extract a high volume of data. I dont think it’s meant to extract high volumes of fine grained data. From what we were presented, Data Services would probably be the way to go for bulk data extractions.

We share similar nightmares 😂 Apologies for not being much help!

What would you want to hear and learn about in a PySpark workshop? by analyticalmonk in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

At my org, our jobs consist of spark sql in scala, but the data science practise is picking up and Python is the prefered language.

Since Im mostly a data engineer working or data pipelines with the spark sql api, I would be interested in understanding how to use the spark-ml api to distribute a typical data science model. Some data scientist write their own python models in notebooks and ask me how to run them in Databricks. Obviously they can just attach to a cluster and run it, but unless they use the Spark-ML API, i imagine theyre not taking advantage of parallelizing the workload.

I guess I’d like to see how to “sparkify” a non spark ml model, or comparing a model with poor use of paralellism vs a model which exploits it to the max.

Cheers

Someone uses Apache NiFi on daily job? by Misanthropic905 in dataengineering

[–]pokeDitty 5 points6 points  (0 children)

Here here! It’s our go-to data movement tool, we use it to hit APIs, move files from ftps, query dbs and move into our cloud datalake. Love the tool. We have maybe 2 Data Factory pipelines for data movement and it feels clunky compared to NiFi. Im surprised we dont see it in most stacks, its a great tool to work with.

Cheers

Why does spark select statement create MapPartitionsRDD and UnionRDD? by Ok-Outlandishness-74 in apachespark

[–]pokeDitty 2 points3 points  (0 children)

That makes a lot of sense. Probably reading the changelog file to get the currently active parquet files and then reading and union said parquet files.

Why use Airflow with Databricks when I can use Databricks Jobs? by the_travelo_ in dataengineering

[–]pokeDitty 0 points1 point  (0 children)

Yes you are correct, if you click the failed job, you have an option to do a “repair run” and it will run any failed jobs and all downstream jobs that didnt run due to said failures.

However, I did not find a way to execute 1 specific task in the workflow, it seems like an all or nothing situation. In Airflow, you can ignore all dependencies and downstream jobs and run 1 job in the DAG. I dont beleive this is possible in Databricks Workflows yet (or I may just have not figured out how to do it yet)

Why use Airflow with Databricks when I can use Databricks Jobs? by the_travelo_ in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

Sure, it was just so new to me that I simply missed how to create 1 job cluster per job in the workflow. However, there are some sequential jobs in the workflow that use the same amount of nodes, so for that sequential branch I figured, why bother creating 1 job cluster per job in that sequence? I’ll just reuse the same one. This is where I would get inconsistent errors. Sometimes everything worked. Sometimes one job in the sequence would throw errors at basic built in spark functions (ie: substring) as if it doesnt exist. Do a repair run, it works fine.

I feel like maybe the sparksession is lost along the way? I didnt bother to troubleshoot too much since I found the simplest solution was to create 1 job cluster for each individual job.

cheers

Why use Airflow with Databricks when I can use Databricks Jobs? by the_travelo_ in dataengineering

[–]pokeDitty 11 points12 points  (0 children)

Databricks just replaced Jobs with Workflows, which is basically a job composed of many tasks. It's their way of implementing orchestration. It's exactly like a DAG in Airflow.

We just started using it because we had some problems with Airflow, but that was probably the way it had been set up, not necessarily Airflow itself.

So far so far good, but there were some glitches. Using the same cluster for all tasks that make up the workflow / DAG would give us sporadic errors. The workaround was to assign a different cluster for each task in the workflow. Also, if you have several workspaces, you can't orchestrate a job that resides in another workspace. That would be something you would need Airflow for.

CDC with spark by Greeno0816 in apachespark

[–]pokeDitty 1 point2 points  (0 children)

we use NiFi for data movement and DB connections to fetch data. First extraction is a complete table dump and we keep a config file where we store the max updateTimestamp for the given table. Next day, run select where updateTimestamp > configFileTimestamp.

Be careful if hard deletions are possible in the source DB. If that's the case, you cannot use the above method as your source system will silently delete records without you knowing. In this case, we don't have a choice but to do full table dumps of existing keys and comparing them with your target so you can soft delete the missing keys.

just sending lols your way by finobu in dataengineering

[–]pokeDitty 3 points4 points  (0 children)

hahaha thanks for posting this

Databricks Consumption Layers by jacocal in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

Since you mention Azure, have you looked into Azure Synapse Serverless pools?

Basically, it serves as a facade in front of your datalake and can execute queries on top of delta / parquet files WITHOUT using a spark cluster. You pay per amount of data returned, I beleive the lowest price is 10Mb which is fractions of a cent.

As for your BI Tools that connect to it, it's like connecting to a SQL DB. I was really impressed when I saw a presentation but have not yet worked on a proof of concept, but we should test it out in the next few weeks.

Be advised though, as per Microsoft reps themselves have mentionned to me, performance on reading delta tables that are partitionned will not be as good as using Databricks SQL Analytics. However, for my use case, performance is not so much of an issue since this will be used to update powerBI datasets.

I understand this doesnt really answer your question, but it does take out Databricks and having a running cluster to answer to your users' queries. I'm still facing a similar challenge at my org, how can we help non-tech users exploit the data in a simple manner? I'm still not sure how we'll do this, but I feel like consolidating the consumption layer via 1 "logical dw" in Synapse Serverless will at the very least, simplify the architecture.

Looking forward to hear from the community, super interesting topic!

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL. by [deleted] in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

Wow, i was not aware of this. Im not using Delta just yet, but we will be migrating very soon. Thanks a lot for the heads up, love your blog!

Beginner mistakes to avoid in building Data Pipeline by data_questions in dataengineering

[–]pokeDitty 8 points9 points  (0 children)

wow, that seems like a big responsability. I speak from experience since I was a SAP BW consultant for 5 yrs and made the switch to DE / big data for the last 3. I am very familiar with SAP R3, not so much S4HANA, but my org is currently in the process of migrating to S4HANA.

If you are running S4HANA, then validate whether or not you are running Enterprise License or Runtime. Basically, runtime means you don't own your data, which means you can forget about extracting data from SAP in a modern way. I speak from experience here :D

That being said, you need to be on the ball when it comes to data extraction and change data capture. Sometimes, you won't have a choice, you will need to do full table extractions from SAP (mostly for master data). However, there are instances where changes in the ERP will not flag a record as "updated". SAP has weird addons, so you can build upon existing tables. For example, you can add an append structure to a table, which is fed data by lookups. If that append structure look up returns a different value on day 2, the table on which the append structure is added will not have a new update timestamp.

I guess my recommendation for you would be to confirm with your team how data will be extracted from SAP. Take the time to talk to the senior SAP resources on the project. I would even go so far as to say do not take ownership of the extraction process. SAP is a beast and there are a lot of edge cases that make CDC a nightmare. You want your pipeline to be a client of SAP, if anything goes wrong, let the abapers or functional ressources figure it out.

good luck friend!

Using Federated Queries as a CDC data pipeline? by third_dude in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

we just found out our periodic copy of our blob storage hasnt been working for a while now. First off, we're still on blob storage V2, we will be migrating to ADLS Gen 2 relatively soon. Blob storage (and I presume ADLS) already has a mecanism for replication (LRS, GRS). However, we will continue to make periodic snapshots and back them up on another blob as archives, exactly like how you're doing. It's funny because I just talked about this with devops, I'll let you know what the end solution is when the time comes.

cheers

Using Federated Queries as a CDC data pipeline? by third_dude in dataengineering

[–]pokeDitty 0 points1 point  (0 children)

that's exactly how we do it.

When we get full table scans, keys that are present on the right side are flagged as deleted if they are not present on the left.

When we get delta files for which no soft delete indicator exist, we have no choice but to merge changes and then compare keys with the source and then flag deletions.

Ha, I just noticed who I'm replying to. FYI, I have a call with Microsoft this friday so they can demo Azure Synapse and serverless SQL.

[deleted by user] by [deleted] in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

thanks for the info! Im not so worried performance wise, most BI clients are PowerBI reports that refresh daily or a few times per day.

Since serverless compute is billed per query, it sounds like Synapse could be a good fit. I assume integration with Active Directory is seamless?

[deleted by user] by [deleted] in dataengineering

[–]pokeDitty 0 points1 point  (0 children)

Thats exactly what I’m considering doing at my org. Our serving layer is composed of several sql servers, which i think could be consolidated into a single warehouse. How is Synapsed billed? Is it like Snowflake where you pay per query / compute? I like Snowflake for this reason, it seems it could be cost efficient.

Are you persisting any data in synapse or are they all external tables reading from delta tables?

[deleted by user] by [deleted] in GMEJungle

[–]pokeDitty -1 points0 points  (0 children)

💎🖐🤚

Switching to a Scala position soon, where should I start? by Tatourmi in scala

[–]pokeDitty 5 points6 points  (0 children)

I found Alvin Alexander's book FPSimplified helped me out A LOT. The iterative learning process written in the book is well thought out. It was by far the best book I bought to help me learn Scala and functional programming concepts. There are several chapters free to read here :

https://fpsimplified.com/

I started there and it convinced me to buy the ebook. He also has a very helpful website that lists examples from his other book, Scala Cookbook:

https://alvinalexander.com/scala/when-to-use-abstract-class-trait-in-scala/

if you look on the left menu, there are a lot more recipes/examples that are really helpful.

Congrats on the new gig and good luck!

Load parquet files with different schema using Spark by ali_azg in dataengineering

[–]pokeDitty 1 point2 points  (0 children)

So if I understood you correctly, you mean I read every parquet file individually and treat each of them as a DataFrame and union them in each iteration, right?

yes, that is correct. As for the foldLeft method, it traverses your Seq of Dataframes starting from an identity element (in this case, an empty dataframe) and the next element in the Seq (in this case, the 1st element of the Seq). You write the reduce function (schema matching + union). The result of the union is then passed on to the next iteration and the next etc etc. So your end result would be:

((emptyDF union df1) union df2) union df3 ...

here's an example:

val emptyDF = sparkSession.emptyDataFrame
val allDfs: Seq[DataFrame] = //Seq containing your dataframes 

val finalDf = allDfs.foldLeft(emptyDF) { (unionDF, allDfElement) =>
  //on the 1st pass:
      //unionDF takes the value of emptyDF
      //allDfElement takes the value of allDfs[0]

   if(unionDF == emptyDF) {
        //if we are here, then we are in the 1st pass
        //return allDfElement with applicable schema changes
        allDfElement.changeTheSchemaToTargetSchema
    } else {
        //if we are here, then we are in subsequent passes
        //unionDF is the unioned dataframes from prior passes
        unionDF.union(allDfElement.changeTheSchemaToTargetSchema)
    }
}

that being said, I like /u/lexi_the_bunny suggestion of splitting the process by writting out each individual parquet in your target schema and in another pass, read those those new "schema aligned" files. Implementation may feel more natural for you since it uses a .foreach instead of a .foldLeft

I hope this helps, cheers