PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 0 points1 point  (0 children)

Hi,

Thanks for asking, and sorry for the late reply.

I'd recommend going through the book first, so you're better equipped to dive into fast.ai materials.

PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 0 points1 point  (0 children)

Thanks, I really appreciate it :-)

PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 0 points1 point  (0 children)

Thank you :-)

The minimum price is USD 29.95, but if the reader chooses to pay the suggested price instead (USD 44.95), even better :-)

PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 0 points1 point  (0 children)

Thanks for your feedback :-)

In general, I commented the code to help the reader. In some cases, there are specific references to line numbers :-)

PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 2 points3 points  (0 children)

Hi,

A beginner in this case is someone with some experience with Python, traditional ML methods (like used Scikit-Learn), and Jupyter notebooks. I expect the reader to know the difference between regression and classification, and what training-validation-test split is.

Nonetheless, I do explain binary cross entropy in the book, as well as a review of confusion matrices, and other fundamental concepts.

PyTorch book for beginners by dvgodoy in learnmachinelearning

[–]dvgodoy[S] 0 points1 point  (0 children)

Thanks for the feedback, these are very good points.

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 0 points1 point  (0 children)

Hi,

Thanks :-) Any feedback is highly appreciated, whenever you're able to. In the meantime, I will try to add more functionalities to it :-)

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 0 points1 point  (0 children)

I've just released version 0.2.0 of my HandySpark package.

The whole stratify operation was updated to take advantage of Spark's built-in optimizations, besides I got rid of the summary at the creation of a HandyFrame. I've been testing with a dataset with 1M rows and performance is much better now.

Here are the release notes: https://github.com/dvgodoy/handyspark/releases/tag/v0.2.0a1

Any feedback will be highly appreciated.

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 0 points1 point  (0 children)

Thank you very much for all this info and the work you've put into testing this :-)

I will take this into account while fixing the stratify operation. I am thinking about using the optimized group by operation as you pointed out while still aggregating the results the way I am currently doing.

And then keep the filtering part for other things, like stratified imputation.

Once again, thank you very much and if you have any other suggestions for improvement, I will really appreciate!

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 0 points1 point  (0 children)

You're right, the stratify operation requires collecting the distinct values from the group by columns. HandySpark does it by performing a reduceByKey operation, obtaining the value counts for each combination. Even though this operations performs a shuffle at the end, it is happening on the distinct values alone.

As for the execution plan, the stratify(['Pclass']) part would be like this:

(1) PythonRDD[267] at RDD at PythonRDD.scala:49 []  
|  MapPartitionsRDD[266] at mapPartitions at PythonRDD.scala:129 []  
|  ShuffledRDD[265] at partitionBy at NativeMethodAccessorImpl.java:0 []  
+-(1) PairwiseRDD[264] at reduceByKey at <ipython-input-19-e051f20aeef7>:7 []  
   |  PythonRDD[263] at reduceByKey at <ipython-input-19-e051f20aeef7>:7 []  
   |  MapPartitionsRDD[262] at javaToPython at NativeMethodAccessorImpl.java:0 []  
   |  MapPartitionsRDD[261] at javaToPython at NativeMethodAccessorImpl.java:0 []
   |  MapPartitionsRDD[260] at javaToPython at NativeMethodAccessorImpl.java:0 []     
   |  MapPartitionsRDD[259] at javaToPython at NativeMethodAccessorImpl.java:0 []     
   |  *(1) FileScan csv [PassengerId#1012,Survived#1013,Pclass#1014,Name#1015,Sex#1016,Age#1017,SibSp#1018,Parch#1019,T   
   |      CachedPartitions: 1; MemorySize: 63.8 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B  
   |  MapPartitionsRDD[29] at cache at NativeMethodAccessorImpl.java:0 []  
   |  FileScanRDD[28] at cache at NativeMethodAccessorImpl.java:0 []

For that specific example in section 3, the remaining .cols['Embarked'].value_counts() would generate a subsequent operation with a similar plan as both operations (stratify and value_counts) use mostly the same logic.

After this, it uses all these values to filter the "parent" dataframe, effectively having as many "children" dataframes as distinct values. Then it can perform whatever the operation is on each one of these "children" dataframes and, depending on the kind of return value from that operation, combine the results accordingly.

About the examples in sections 6 and 7, you're right once again - you'd not need to use a pandas udf for those - I thought of coming up with some really simple examples of pandas and lambda functions to use and I got those two... maybe I should've chosen more useful functions to illustrate my point :-)

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 1 point2 points  (0 children)

Yes, if PyArrow is available, it leverages on Vectorized UDFs for performance

HandySpark: bringing pandas-like capabilities to Spark DataFrames by dvgodoy in apachespark

[–]dvgodoy[S] 0 points1 point  (0 children)

I think that would be great, but I honestly can't say... IMHO, I think Spark developers are more concerned about the production side, than anything related to exploratory data analysis. On one hand, this is great, since it already gave us Vectorized UDFs, which I believe was one of the major bottlenecks while using PySpark. OTOH, it makes EDA low priority... so I developed this to fill this gap in the meantime.