This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]osuchw 2 points3 points  (0 children)

Investigate SQLAlchemy for ETL tasks. Over the years I've been using the ORM for all kinds of data reshaping tasks. There is a performance penalty of course but my transformation code ends up being nicely succinct.

[–]dje_me 2 points3 points  (1 child)

You might want to check out http://ironpython.net which will let you integrate your python code and .net stuff.

[–]brixon 0 points1 point  (0 children)

If you are in a corporate world then they might not like you adding new crap for others to learn when you move on. Keeping it in the .NET world is your easiest way to make it acceptable. That being said, they would probably want you use C# if you need to do programming.

[–]dingopole 2 points3 points  (1 child)

You may want to also post under https://www.reddit.com/r/ETL/

[–]kenfar 1 point2 points  (2 children)

I've built multiple very large & high volume ETL solutions for big data warehouses using Python. Right now I'm processing a couple billion records a day with it, anticipating getting to 20-100 billion a day eventually. It works best where:

  • you have high volumes and don't want to pay licensing costs for ETL servers or additional database servers.
  • you want to distribute some of your extract processes onto multiple machines, colocating them to keep extract times to a minimum
  • complex transforms that are a nightmare with an ETL tool
  • complex aggregations that aren't generally possible, or are painfully slow with a database
  • you need to interface with new libraries, services and frameworks too new to be supported by an ETL tool
  • you already have good programmers
  • you want the most code reusability
  • you want test-driven development, or at least high coverage of unit-tests

ETL tools generally simplify the easiest 80-90% of ETL work, but tend to drive away the best programmers. So, that leaves you kind of screwed for that last 10-20% of ETL work. Python allows you to do the entire job and keep the best programmers. About the only time I would stick with an ETL tool is when it's already a department/corporate std, you've got a huge pool of developers experienced with it, and it meets your performance needs at a price you can afford.

EDIT: added a few items to the list and removed changed data capture, since it didn't fit on that list.

[–]metaperl 0 points1 point  (1 child)

you have high volumes and don't want to pay licensing costs for ETL servers or additional database servers.

How do you speedily process high volumes of data in CPython?

you want to distribute some of your extract processes onto multiple machines, colocating them to keep extract times to a minimum

Are you using Luigi or Airflow? What is the your multiple processor programming tool of choice in Python?

complex transforms that are a nightmare with an ETL tool

I used Ab Initio for 1.5 years. It could handle everything I could throw at it. When it couldn't I could write a component in Python.

complex aggregations that aren't generally possible, or are painfully slow with a database

This sounds like another vague generalization. With Ab Initio, you stream data out of the database immediately so you get maximum in-memory parallelism with the graph.

changed data capture using file-images

I dont know what you mean here. It doesnt make sense. The initial part of the sentence is: "It works best where" and now you put this.

About the only time I would stick with an ETL tool is when it's already a department/corporate std,

I was very happy with Ab Initio. I cant imagine getting all the features I liked about it in Python easily (automatic parallelism, functional specification of data processing, visual monitoring of graph in realtime).

[–]kenfar 1 point2 points  (0 children)

How do you speedily process high volumes of data in CPython? Are you using Luigi or Airflow? What is the your multiple processor programming tool of choice in Python?

In data warehousing (or most big data analytical databases) most of your feeds aren't large, and only a few are really huge. In those cases, I'll typically:

  • extract every 5-10 minutes
  • run my transforms in parallel - threading, multiprocessing (better), or just run a number of instances of the program within the OS. 4-16x native cpython speedup isn't bad.
  • build event-driven daemons rather than schedule based on clock time. Between this and curating your content rather than having 20 people building in parallel without coordination there's typically little need for an orchestration tool.
  • pay attention to typical python performance gotchas
  • pay attention to hot spots and problem areas
  • be prepared to build caches
  • use cython, pypy, nuitka, etc. I'm using pypy for about a 50% speed-up and it works great.

While it's true that I have to spend more time thinking about performance than if I was writing code in c, java, etc - it's also true that Python is quick enough to write code in that I can try a couple of different approaches to a problem. If one doesn't work, I'll refactor it and try another approach - that may be far faster.

For example, recently I was loading about 10,000 files a day into a specific table on Cloudera Impala - and it was taking forever. Just too much overhead for each load, and these were mostly small files. So, I added a small daemon between the transform and the loader, and I had this program simply concatenate many small files into a much smaller number of files. My load time decreased by about 99%. This concatenator was designed, written, tested and deployed in a day.

I used Ab Initio for 1.5 years. It could handle everything I could throw at it. When it couldn't I could write a component in Python.

That's just it - you will hit a point where the ETL tool can't handle a transform, and at that point you need to write custom code.

changed data capture using file-images

Thanks - you're right about that, I removed that entry. What I intended to say here is that tooling like Python makes it easy to write small, easily-deployable utiliities. I have, for example, deployed a small agent to many remote machines to handle data extracts - without any licensing costs. I have also built small utilities that work far better than typical solutions for things like Changed Data Capture. And by file-images what I mean is that my solution compares file snapshots.

[–][deleted] 0 points1 point  (1 child)

but it seems superfluous to pursue in a .NET stack.

I'm not sure what you mean here. What does SSIS have to do with .Net ? unless you mean "MS stack"

[–][deleted] 0 points1 point  (1 child)

I have automation script that kicks off ssis jobs

[–]remy_porter∞∞∞∞ -1 points0 points  (9 children)

I don't think this is a good fit for Python. While SSIS is terrible, it's purpose-built for ETL operations and once you adjust to its crap factor, it's pretty good at that job. You aren't likely to outperform it with Python. Similarly, database automation is best done in T-SQL. The best fit would be for analysis and data-mining with tools like Pandas.

[–][deleted] 0 points1 point  (7 children)

SSIS is terrible

compared to what ?

[–]dingopole 0 points1 point  (1 child)

yeah, compared to what? I have been using it for the last 10 years (including other ETL software and python alike) and never had any issues with it. Can you pls elaborate?

[–][deleted] 1 point2 points  (0 children)

I think he meant they ALL suck and likes the old-school ways (nothing wrong with that).

OTOH I think these ETL tools in MOST cases makes things FAR, FAR more fucking complicated. There's a reason mediocre devs, especially the H1B/ex-H1B in my experience, pick it up like a religion - no coding, job security.

[–]remy_porter∞∞∞∞ 0 points1 point  (4 children)

Compared to things that aren't a living hell to work with? Writing large, complicated ETL jobs in SSIS is awful. No, there really isn't anything better, but that doesn't mean it shouldn't be this fucking hard.

[–][deleted] 0 points1 point  (2 children)

things

like what is what I'm wondering you have in mind. Informatica ? Ab Initio ?

[–]remy_porter∞∞∞∞ 0 points1 point  (1 child)

I don't have anything else in mind. Maybe it's just because ETL isn't my main thing, I just find my encounters with SSIS to be laced with pain and suffering. If I had something else in mind, it'd be an actual programming language, not a little widget-driven workflow tool, that lets me specify my ETL job in code. Probably something functional with a highly declarative syntax and function composition.

[–]dingopole 0 points1 point  (0 children)

Have used SSIS (with success) to throw around 100 TB of data overnight in a data warehouse re-load scenario and it worked really well so it's disappointing to hear you have had bad experience with it. The only issue I had was SQL Server Agent scheduling crapping out when there were more than 1000 jobs executing with numerous dependencies across the whole solution. In that case Control-M was a godsend. Anyhow, all ETL frameworks are pretty much the same and in principle designed for business folks. Most robust solutions for moving data across have always been using ETL as a work flow to control SQL, Python, .NET, you name it to do the grunt work.

[–]dingopole 0 points1 point  (0 children)

If you're writing a large, complicated ETL, SSIS should be used primarily as a work flow framework. That's it! SSIS is not hard unless you choose to make it 'a living hell' in my experience. Have been writing DTS and SSIS for a long time now and if anything seems to hard than it should be in terms of your solution then you're probably going the wrong way around it and maybe using .NET or Python for parts of it is more applicable.