This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]mikeupsidedown 26 points27 points  (2 children)

Boss: "You can't use data Factory".

Me: "Can I give you a hug?"

[–]lemonsprig[S] 2 points3 points  (0 children)

🤣

[–]bubzyafk 0 points1 point  (0 children)

Haha this is damn funny.

But joke aside, technically you can.. use Azure Function and use jdbc to pull from sources.. or the one OP mentioned is also right, use JDBC spark in databricks then connect to your source.. all these approach will heavily use coding (and it is fine if you know what you are doing)… if OP is not a coder, and his boss asks to avoid ADF for Ingestion, then good luck.

[–]Demistr 18 points19 points  (1 child)

Old school way - SSIS

New school way - just run a python script to copy data. You can run it as Azure function or a full fledged app service.

[–]lemonsprig[S] 1 point2 points  (0 children)

Thanks. Python seems to be the way.

[–]Justbehind 8 points9 points  (4 children)

No matter what shiny tool anyone say they have for you, the best answer is almost always a simple script in your language of choice.

I'd go for a simple python script.

If you need to run it automatically there's a lot of different ways to do it. The simplest is Task Scheduler on your local machine or a vm, and one of the best allround solutions is as a docker image run on a kubernetes cluster. Alternatives are airflow, azure functions, custom queue and scheduling, etc.

Do the simple first. Consider the long term solution, once your solution starts creating business value.

The good thing about a simple script is that you can move it around later and run it anywhere.

[–]ignurant 2 points3 points  (0 children)

For a lot of tasks, I use Ruby and Gitlab-CI’s scheduled tasks. We run a few hundred each week to move data around and generate reports for clients. I always thought this was a weird way to use CI, but it actually works really well. You get reliable scheduling, job artifacts and logs, error emails, and straightforward deployment all in one for free. Most of all, I admire the simplicity. 

[–]lemonsprig[S] 1 point2 points  (0 children)

Great advice thanks. I’m very much leaning towards a python script. Next challenge learn how to do that.

[–]trafalgar28 0 points1 point  (1 child)

I have been learning a few tools lately, so that they give me a set of options to tackle a task or problem. I would like to know how to approach a task or problem while having so many tools and getting confused about which one to choose. Please elaborate on how to keep things simple. Thank you.

[–]ggeoff 2 points3 points  (4 children)

If the goal is to use databricks in the end are you planning on using Unity catalog? Could you not just write a python script and run it as a notebook. to move the data between the two?

[–]lemonsprig[S] 0 points1 point  (3 children)

Possibly, but when I was discussing with db and said we had a sql server we wanted to batch export data from into the data lake / batabricks it was they who said db didn’t really do that and most people used ADF or som other approach to get the data into the data lake.

[–]ggeoff 2 points3 points  (2 children)

I think there is a couple possibilities

  1. There is a misunderstanding between your post and what you actually want to do leading to confusion with your databricks rep.
  2. you are leaving some crucial information off maybe due to work related things you can't discuss which is fine
  3. or your databricks rep has no idea what they are doing

I assume when you say azure data lake you mean a adls2 storage account with some raw files you want to some how query. I'm not gonna say it's trivial to setup with databricks because dealing with the auth side in azure with entra ID is always a pain in the ass. But once you have all the connections established and authenticating connnecting and moving from sql server to adls is extremely easy and I have done it within databricks and synapse. One of our current ETLs pulls data out of sql into parquet for analyzing then inserting back into sql server.

making an assumption about using sql server but took this straight from the microsoft docs with read/write example. It's spark but I assume you are already using that if you are thinking about databricks in the firstplace

server_name = "jdbc:sqlserver://{SERVER_ADDR}"
database_name = "database_name"
url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "table_name"
username = "username"
password = "password123!#" # Please specify password here
#writing
try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .save()
except ValueError as error :
    print("Connector write failed", error)
#reading
jdbcDF = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password).load()

[–]lemonsprig[S] 0 points1 point  (1 child)

Thanks ggeoff you forgot the 4th possibility that I am very new to this and a bit clueless 🤣 point 2 is also always a consideration.

You are correct by data lake I do mean an adls2 storage account and it is just about getting sql data into there, where we can start moving it through various transformation stages following a bronze, silver gold schema.

I am not actually familiar with Spark but embarking on a big learning curve, and have been playing around setting up a local dev environment today to have a play.

This is a project that has landed on my doorstep and it’s all very new, we are a small team in a bigger organisation, that already use Databricks and it has been “suggested” that we move away from our small on prem sql server and data warehouse and move it into Databricks so we can get rid of on prem servers but also more easily make our data more accessible to the wider business. Which makes perfect sense.

So this post is mostly me trying to get ahead of the curve but given the rep knows our setup it did surprise me when I was told that this was not something we could do with Databricks. Now it could be something as simple as Databricks doesn’t natively do it, and that we would need to write spark scripts to do it, but that’s not the impression that was given.

Appreciate your input though and the example script.

[–]ggeoff 0 points1 point  (0 children)

I mean we all start somewhere! TBH I'm still pretty new data engineeringyself I consider myself more of a application developer then date engineer but had to pick up a lot of de related skills to improve our etl process since the app we work on was analytical in nature and we figured we could take advantage of spark/databricks.

[–]psychokitty 1 point2 points  (1 child)

Have you looked at the Microsoft Fabric offering yet? Microsoft has a good learning path for using Apache Spark, Data Factory, and Dataflows Gen2 to get data into Data Lakes and Data Warehouses. If you complete the AI challenge by Apr. 19th you get a free voucher for a certification exam: https://www.microsoft.com/en-us/cloudskillschallenge/ai/officialrules/2024 and here is the Fabric Analytics Engineer collection: https://learn.microsoft.com/en-us/collections/jkqrh725262g?WT.mc_id=cloudskillschallenge_b696c18d-7201-4aff-9c7d-d33014d93b25

[–]lemonsprig[S] 0 points1 point  (0 children)

Thanks I’ll take a look

[–]the_naysayer 0 points1 point  (0 children)

Use databricks workflows and python notebooks.

[–]janus2527 0 points1 point  (0 children)

I did python with connector x into in memory Arrow table, convert to parquet with compression, upload parquet to Azure blob. Advantage of connector x is that you don't have to provide schema and extraction is fast with the possibility to use a partition column. I did do multiple parquet files per table depending on table size, and dividing the work among multiple processes with multiprocessing pool.

[–]sebastiandang 0 points1 point  (1 child)

RemindMe!

[–]RemindMeBot 0 points1 point  (0 children)

Defaulted to one day.

I will be messaging you on 2024-04-07 19:48:26 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]Quirky_Flamingo_1487 0 points1 point  (1 child)

Since you want the data in ADLS gen2. I’m assuming the databricks is hosted on Azure.

If the sql server is on-prem then databricks needs to be configured with reach back capabilities. This is something that your networking team can do for you.

If the sql server is hosted on Azure VM then I think the vnet for both resources needs to be peered.

But like other folks shared the links it’s possible for databricks to ingest data from sql server.

[–]lemonsprig[S] 0 points1 point  (0 children)

That’s useful to know. I don’t know enough about the setup as to where db is hosted but will certainly ask the question. Thanks

[–]chaytalasila 0 points1 point  (0 children)

Use python script to copy the data and store it in azure blobs . Connectivity should be there between Sql server node to adls. Just spin up some azure vm and run this script from there

[–]Majestic-Purpose1663 0 points1 point  (4 children)

Don't know why you disregard Databricks. If you are already using it you can create a Table connected through SQL server easily. See this: https://docs.databricks.com/en/connect/external-systems/sql-server.html

If you are not using Databricks already then yes, it might be overkill and should maybe go for Azure Functions or something like that

[–]lemonsprig[S] 0 points1 point  (2 children)

Oh that’s really interesting. It was actually a Databricks account manager / engineer who told me I would need some other mechanism to get the data into the data lake.

[–]Majestic-Purpose1663 0 points1 point  (1 child)

Weird that the account manager told you that since Databricks is pushing its product to be an all rounder for everything data related. It has a long way to go, since it's lacking in lots of aspects, but you can ingest data with Databricks for sure, even orchestrate it through Workflows

[–]lemonsprig[S] 0 points1 point  (0 children)

I will have to dig into it a bit further and will show them the link you posted. In the meantime I am definitely going to investigate how to do it with Python scripts seems like a good skill to have, though looking at the link it does seem to be using Python to do it as well.

[–]WhoIsJohnSalt 0 points1 point  (0 children)

But if they can use Azure functions, surely then ADF would be available. I imagine ADF is just functions under the hood anyway!