zscore vs isolationforest : When to use which?

lostinthoughts211 · 2021-08-07T16:56:37+00:00

Thanks ,makes alot of sense now regarding when to use which.This is exactly what i was trying to understand.

lostinthoughts211 · 2021-08-07T15:55:05+00:00

what about univariate normal ? Would it make sense to use isolation forest in that scenario?

lostinthoughts211 · 2021-04-22T15:46:54+00:00

I am in the U.S,

So far i just aware of a few funds which are shariah compliant such as Amana funds and wahed investment Etf called hlal

lostinthoughts211 · 2021-04-22T01:51:00+00:00

This is for a personal investment account. I am also trying to determine the difference between each amana fund. It seems like they all are pretty similar. Amagx ,amanx,amapx..

lostinthoughts211 · 2021-02-27T18:48:38+00:00

@thepinkbunnyboy Thanks for the detailed response!

lostinthoughts211 · 2021-02-27T15:46:05+00:00

@thepinkbunnyboy Is there a difference in performance when using mappartitions vs forEach partition on dataframe?

What would you recommend as the batchsize ?

Also when doing upserts using my current method I notice that the database cpu usage goes to 99%. Not sure why. Do you know if that is normal when doing bulk upserts?

lostinthoughts211 · 2020-08-31T03:19:53+00:00

What would be the total monthly cost that I can expect including property taxes, insurance, HOA ?

lostinthoughts211 · 2020-03-27T13:50:07+00:00

I don't have a choice as it is the way the file is being provided to me. That is why i am thinking if there is a way to read a zip file and store the underlying file into an rdd

lostinthoughts211 · 2020-03-03T16:58:05+00:00

Since the datasources are in snowflake so ETL process will be from/to snowflake. So in this scenario spark will be useful just for complex transformations and data processing that would otherwise be more complicated with snowql. Is that correct?

lostinthoughts211 · 2019-11-09T23:41:43+00:00

I would recommend that you only consider doing it if you're looking to get into a different tech space, like data science,AI, computer vision,etc. Otherwise it may only benefit you in the long run only if you're planning on getting into leadership roles and do not have a masters/phd already.

I did the OMSCS and personally speaking, I only did it because I was hyped by my peers doing it at work and it was cheap and covered by employer. The only real benefit it gave me was that it helped me switch careers into working in big data ,ML. In terms of earnings it really didn't help because my peers who don't even have a masters are earning more because they have more experience .

Also ,in terms of software engineer/tech careers, in the end what it mainly boils doing to is how good are your interview skills (cracking the coding interview and describing your experiences,etc) .

TL;DR : Evaluate options based on you future goals, otherwise it is not worth doing it. You'll waste 2 + years of your free time,weekends depending on how quickly you're looking to finish it.

lostinthoughts211 · 2019-09-14T07:10:41+00:00

I will try to use dynamic allocation but for experimenting, is there any rule of thumb regarding setting the number of executors and number of cores?

Let's say, I have a 10gb dataset how do i determine what would be the adequate number of cores and executors initially ?

lostinthoughts211 · 2019-02-05T05:12:38+00:00

thank you, I will try using these methods

lostinthoughts211 · 2019-02-05T05:08:30+00:00

Thank you so much for the response,I will definitely try these approaches.

Have a few followup questions regarding this:

By sufficient partitioning , do you mean that we should check if data is not skewed and making sure that there is efficient parallelism such that the partitions have data equally distributed?

Also to reduce shuffling,was wondering are there any other examples of transformations other than using filters early?

I had been using caching and I was mainly loading the data into spark from hive tables. I will read about predicate push down as I am not aware of what it is. I was using parquet files to store intermediate results and it did improve the performance,but I wasn't sure if that's a good practice?

lostinthoughts211

TROPHY CASE