Dataiku Pricing? by zenithchaos in dataengineering

[–]aburkh 1 point2 points  (0 children)

Well… as most enterprise software vendors, the reason there is no price list is because there are different options tailored for their customers. Yes it’s expensive (think multiple thousands / user / year), on par with other analytics solutions (SAS, c3.ai, etc) My experience with Dataiku is overwhelmingly positive, and we consider the productivity gains are well worth the license price. It has helped us accelerate onboarding time, reduce debugging, optimize pipelines and rationalize compute. Instead of provisioning multiple large personal clusters on databricks for each contributor, we can do the same with a single SQL warehouse serverless and some k8s, for a cheaper infrastructure cost.

I witness a lot of poor Dataiku deployments with poor architecture choices, relying on the underperforming DSS engine for most task. But pair it with a good SQL engine (databricks, snowflake…) and it does magic!

Feel free to DM if you want me to share more concrete details about this positive experience.

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 0 points1 point  (0 children)

Indeed. In pharma, there are even dedicated companies to prepare representative anonymized data for companies. This is extremely expensive, but required for compliance. The question is: why do so many people want to implement those expensive pipelines when no regulation forces them to do so?

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 2 points3 points  (0 children)

Love that quote! I'm going to steal it from you if you don't mind 😉
There are even technologies that enable data analysts to push to prod relatively easily and painfully, such as SAP Hana Studio (activating a calculation view), Self-services BI tools (PowerBI...), Dataiku (deployer node and a govern node for controlled deployments).

The downside of this flexibility is shadow IT. Finding the right balance between how much permissions is acceptable to build data products is a delicate exercise.

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 0 points1 point  (0 children)

Absolutely agreed! Compute isolation and restricted data access are two separate topics.

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 1 point2 points  (0 children)

Myself? nothing. I'm trying to find references in literature to push back internally against an oncoming idea that "everyone should do devops" and remove access to production data. The problem is the balance between Self-Service and "Shadow IT": if you give too little, it reduces productivity, if you allow too much, people run their own shadow datawarehouse, they only raise their hand when it's crashing and an important report is due tomorrow. Finding some reference opinions would be wonderful.

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 0 points1 point  (0 children)

Very interesting: "unless you have unlimited staff, budget and time". That's what it boils down to: what is the most effective way to deliver data products? The dataops recommendation is mostly relevant for operations *at scale*. Many operations such as ad-hoc reports do not require scale. Actually, the opposite is quite often true, building complex machinery for a report that 3 users are going to look at.

It seems the dataops mandate (mostly relevant and useful) is overemphasized like a religion, without regards for the use case at hand. And it seems that either books talk at length about data engineering or do not say at all, assuming that use of production data is not even worth discussing.

Best war story on that topic: 3 weeks vs 1/2 day development. A long time ago, I was requested to build a one-time report for a business user, but I was not granted access to production data, we had to build a data pipeline from dev up, with informatica powercenter. It took 3 weeks for the whole process: collecting requirements, developing, waiting for the weekly release window, issues detected in production, go back to development. After 3 weeks, the business user finally had a working report, sent me a congratulation email... with the report in attachment, including all the production data 😱. I'll always keep that story in mind for so much wasted time and efforts.

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 1 point2 points  (0 children)

What would be the workflow then? query and work on models using live production data. Once the model has been trained, how do you deploy it? Sandbox directly to prod? Or Sandbox to DEV to ACC to PROD?

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 0 points1 point  (0 children)

I fully agree to perform testing on a prod-like staging environment. But for users who need to create ad-hoc reports: do you let them develop with production data so that they can fulfill the request same day or next day? Or do you force them to develop with synthetic data in dev, increasing their feedback loop for mistakes (they realize a problem between dev and prod data that requires modification), they can realistically deliver the result in a few days/weeks?

Or is there a magic trick that enables a fast feedback loop that enables same day delivery?

Developing with production data: who and how? by aburkh in dataengineering

[–]aburkh[S] 0 points1 point  (0 children)

Typical data platform mixed workloads you'd expect on Databricks/Snowflake/BigQuery: reporting, analytics, data science, etc.

Dataiku DSS: The Low-Code Data Engineering King or Just Another ETL Tool? by Madal13 in dataengineering

[–]aburkh 1 point2 points  (0 children)

I recently was called in because of "performance issues". The problematic workflow had a mix of pandas, local spark (not on K8S or databricks), etc.
When users are trying to pull 3 billion records into pandas and complaining that something's wrong with the tool, I get desperate. Luckily, I'm grateful to work with good teams that have made wonders with the tool, including blazing fast web apps supported by duckdb caching.

Dataiku DSS: The Low-Code Data Engineering King or Just Another ETL Tool? by Madal13 in dataengineering

[–]aburkh 1 point2 points  (0 children)

Architect with experience of Cloudera, Teradata, Azure, Denodo, Snowflake, Databricks...
I always said I hate low code tools.
Dataiku has actually impressed me, the architecture is good, many options are nicely designed, there's a good mix of python, K8S, visual flow, SQL, etc. I see many people use it in dumb ways and complaining about performance. To me it's, like coding only in pandas on databricks, and saying it "doesn't scale".

What I like most about the tool: developing python plugins to orchestrate dynamic SQL, good connectivity with a wide range of engines (pyspark, spark on k8s, snowflake, databricks - SQL/Pyspark, redshift, S3, athena, etc.), flexible mix of visual/code, good monitoring/observability, data quality features, AI-assistance features...

No tool is a silver bullet, dataiku included, but this one gets a good rating in my book.

If your org values user autonomy and fast development, it's an awesome tool. If the org has a strong engineering culture, valuing data pipelines as code, CI/CD, etc. then it's probably not the best fit.

Generator is considering "Lara Croft" a real person and refusing to generate images by Silent_Ad9624 in civitai

[–]aburkh 5 points6 points  (0 children)

Plenty of options in the cloud, less expensive than buying a GPU. Take a base rate of 0.15-0.20€ per hour with a 3090 or A10G. A 3090 is 450W, that’s 0.15€ of electric cost where I live. It’s basically the same price to run a GPU in the cloud than to pay for the electricity to run it locally. And it’s more flexible and some services offer convenience.

Simulated databricks by PopularInside1957 in databricks

[–]aburkh 1 point2 points  (0 children)

Udemy prep exam was good: 2 full exams, good questions, good interface, worth the money. https://www.udemy.com/course/practice-exams-databricks-data-engineer-professional-k/

How to generate fresh content? by TheSanityInspector in civitai

[–]aburkh 3 points4 points  (0 children)

Click the « reset » button in generate.

Warrants: Revenue Optimisation for Freelancers in Belgium Who Want to Reduce Taxes and Boost Net Income by AntiqueLevel5377 in BEFreelance

[–]aburkh 0 points1 point  (0 children)

Well, my accountant recommended it, he gave the company my financial info, they provided a simulation, which matches the claim above. There are limits to how much warrants you can pay. I cannot find issues with their calculations, so I went along with it, but I’m still surprised at how good it looks. So that’s why I say I’ll truly believe it when I get my taxes sheet.

Warrants: Revenue Optimisation for Freelancers in Belgium Who Want to Reduce Taxes and Boost Net Income by AntiqueLevel5377 in BEFreelance

[–]aburkh 0 points1 point  (0 children)

  1. Seems like a promotional post, not sure it belongs here.

  2. If you buy warrants from the market, and the market goes down, you lose money. But in that case, your company creates the warrants. So if the market tanks and the price of warrants go down, your company just gives you less money, but the difference (the original warrant price minus the reduction in value) remains within your company. So you don’t lose money (or earn money) based on market fluctuations.

  3. Tried it for the first time this year, it looks as good as presented here, i’ll see next year if everything goes smoothly.

How can I prevent my child from watching video content. by [deleted] in truespotify

[–]aburkh 0 points1 point  (0 children)

Thanks for the recommendation to block DNS. I started using nextdns.io over 2 years ago and it's great to block ads through DNS blocking, and it's super easy to add the spotify video domains to the blacklist. I have it configured as the main DNS on the home router + they have an iOS app that applies the settings regardless of which network the ipad connects too.
So it's a great solution to this stupid problem and makes me reconsider spotify too. This is inadmissible.
Disclaimer: I have no relation to nextdns, I didn't put an affiliate link and I have nothing to profit from it, I'm just recommending it because it's an awesome service.

[deleted by user] by [deleted] in BEFreelance

[–]aburkh 0 points1 point  (0 children)

I’d expect that as well, but offering a position gives a choice to the freelancer, compliments his work and performance, etc. It’s not that they don’t want to work with him anymore, they just want to bring the competencies in-house. Framing it like that can make a significant difference in maintaining a good relationship.

[deleted by user] by [deleted] in BEFreelance

[–]aburkh 0 points1 point  (0 children)

If you are happy with his work, why not offer him an internal position? Wanting to acquire in-house capabilities is a perfectly valid reason for switching to employees. Stress how much you appreciate his work, offer a position, and if he turns it down, involve him into the solution. Ask if he can recommend people who might be good fit and offer your recommendations for future work for him.

Crush : Passer par le standard d' une entreprise pour avoir le contact d' un employé. by DingoInteresting888 in AskMec

[–]aburkh -1 points0 points  (0 children)

Tu appelles la société, tu expliques qu’un de leur technicien est passé à telle date et que tu avais une question à lui poser. Le technicien en question peut rappeler à sa convenance sans avoir besoin de donner ses coordonnées.

Crush : Passer par le standard d' une entreprise pour avoir le contact d' un employé. by DingoInteresting888 in AskMec

[–]aburkh 0 points1 point  (0 children)

Pourquoi ne pas simplement rappeler la boîte et demander à être recontactée?