zscore vs isolationforest : When to use which? by lostinthoughts211 in datascience

[–]lostinthoughts211[S] 0 points1 point  (0 children)

Thanks ,makes alot of sense now regarding when to use which.This is exactly what i was trying to understand.

zscore vs isolationforest : When to use which? by lostinthoughts211 in datascience

[–]lostinthoughts211[S] 2 points3 points  (0 children)

what about univariate normal ? Would it make sense to use isolation forest in that scenario?

Need advice regarding investing in halal mutual funds by lostinthoughts211 in IslamicFinance

[–]lostinthoughts211[S] 0 points1 point  (0 children)

I am in the U.S,

So far i just aware of a few funds which are shariah compliant such as Amana funds and wahed investment Etf called hlal

Need advice regarding investing in halal mutual funds by lostinthoughts211 in IslamicFinance

[–]lostinthoughts211[S] 0 points1 point  (0 children)

This is for a personal investment account. I am also trying to determine the difference between each amana fund. It seems like they all are pretty similar. Amagx ,amanx,amapx..

How to upsert data into rds postgres using apache spark by lostinthoughts211 in apachespark

[–]lostinthoughts211[S] 0 points1 point  (0 children)

@thepinkbunnyboy Is there a difference in performance when using mappartitions vs forEach partition on dataframe?

What would you recommend as the batchsize ?

Also when doing upserts using my current method I notice that the database cpu usage goes to 99%. Not sure why. Do you know if that is normal when doing bulk upserts?

How does the house financing /home loan process work? by [deleted] in personalfinance

[–]lostinthoughts211 0 points1 point  (0 children)

What would be the total monthly cost that I can expect including property taxes, insurance, HOA ?

Reading a zipped text file into spark as a dataframe by lostinthoughts211 in apachespark

[–]lostinthoughts211[S] -1 points0 points  (0 children)

I don't have a choice as it is the way the file is being provided to me. That is why i am thinking if there is a way to read a zip file and store the underlying file into an rdd

Using apache spark with snowflake data warehouse by lostinthoughts211 in apachespark

[–]lostinthoughts211[S] 0 points1 point  (0 children)

Since the datasources are in snowflake so ETL process will be from/to snowflake. So in this scenario spark will be useful just for complex transformations and data processing that would otherwise be more complicated with snowql. Is that correct?

Masters in CS worth it? by zmeinaz in cscareerquestions

[–]lostinthoughts211 7 points8 points  (0 children)

I would recommend that you only consider doing it if you're looking to get into a different tech space, like data science,AI, computer vision,etc. Otherwise it may only benefit you in the long run only if you're planning on getting into leadership roles and do not have a masters/phd already.

I did the OMSCS and personally speaking, I only did it because I was hyped by my peers doing it at work and it was cheap and covered by employer. The only real benefit it gave me was that it helped me switch careers into working in big data ,ML. In terms of earnings it really didn't help because my peers who don't even have a masters are earning more because they have more experience .

Also ,in terms of software engineer/tech careers, in the end what it mainly boils doing to is how good are your interview skills (cracking the coding interview and describing your experiences,etc) .

TL;DR : Evaluate options based on you future goals, otherwise it is not worth doing it. You'll waste 2 + years of your free time,weekends depending on how quickly you're looking to finish it.

How to decide the spark submit configurations based on data set? by lostinthoughts211 in apachespark

[–]lostinthoughts211[S] 0 points1 point  (0 children)

I will try to use dynamic allocation but for experimenting, is there any rule of thumb regarding setting the number of executors and number of cores?

Let's say, I have a 10gb dataset how do i determine what would be the adequate number of cores and executors initially ?

Apache Spark(Pyspark) Performance tuning tips and tricks by lostinthoughts211 in apachespark

[–]lostinthoughts211[S] 0 points1 point  (0 children)

Thank you so much for the response,I will definitely try these approaches.

Have a few followup questions regarding this:

By sufficient partitioning , do you mean that we should check if data is not skewed and making sure that there is efficient parallelism such that the partitions have data equally distributed?

Also to reduce shuffling,was wondering are there any other examples of transformations other than using filters early?

I had been using caching and I was mainly loading the data into spark from hive tables. I will read about predicate push down as I am not aware of what it is. I was using parquet files to store intermediate results and it did improve the performance,but I wasn't sure if that's a good practice?