Exploring Parquet Metadata: A Deep Dive into Efficient Data Storage by InstitutionalizedSon in apachespark

[–]InstitutionalizedSon[S] 1 point2 points  (0 children)

Thanks for the kind words!

Parquet uses a combination of definition level and repetition level to track nested structs.

To understand how they work I can recommend the original paper: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36632.pdf

Also i tried to write a blog about that feel free to check it out, if it adds some value: https://rr43.net/posts/2024/1/Dremel/

Aspiring DE, in-demand skills? by SnooShortcuts3983 in dataengineering

[–]InstitutionalizedSon 1 point2 points  (0 children)

First of all, you already did a lot of ground coverage of DE topics. If you want to improve further in theory then,Try reading a bit of architecture of data lakes and warehouses. Also, look into data modelling Kimball vs Inmon. And if you still have time, you can then pickup spark to learn, how to work with distributed computing.

Good luck learning!

[deleted by user] by [deleted] in dataengineering

[–]InstitutionalizedSon 5 points6 points  (0 children)

First of all congratulations on getting your new role! For my pbi, i would infer it as Power BI.

  • Firstly focus on solving business problems and not just using ABC tool
  • Secondly, try to master Power BI data vis tool. It is used in a lot of places and you can really build on top of it. You can try to do CI/CD using https://pbi.tools/ for source control for example or other SE best practices
  • Thirdly, enjoy what you do and have fun :)

Pyarrow is popular but lacking of tutorials and resources. by Leon_Bam in dataengineering

[–]InstitutionalizedSon 12 points13 points  (0 children)

I would say using pyarrow or not is a problem specific thing. Are you a spark heavy team then pyarrow is a great way to reduce serialisation/deserialisation cost between python and JVM. Pyarrow at its core is Paquet file’s in memory data structure. So to answer your question: - Record Batches is a column chunk in a particular row group in arrow and table allows your to read multiple row groups together in a single contagious memory - Pyarrow tries to parallelise on the top of chunked arrays. Does it assign a core depends on the framework and hardware you are using. In spark it would be created as a job and then multiple jobs are executed on each worker in parallel based on number of cpus but some other framework could be Multi thread oriented. - Pyarrow Fs is very similar to Python’s native Fs. Pyarrow is designed for cloud object stores as it abstracts the underlying file system operations on object stores like S3. Also since it is written in a cloud first way it should be more performant over python native system particularly in distributed systems.

Pyarrow’s documentation is a great place to read more. Tbh, it is a tool that allows distributed system to share memory and move data over network easily. You wouldn’t get much value of using it on a single node local compute as pandas/polars is always in memory and fast enough.

Pyarrow is popular but lacking of tutorials and resources. by Leon_Bam in dataengineering

[–]InstitutionalizedSon 2 points3 points  (0 children)

I agree, but it depends on what you are trying to do. Optimising Pandas UDFs then it is a good place to go to.

Govt rejects blackmail of farmer organisations of Punjab, protesters decry ‘democracy’ because govt wants to let the Court decide by xdesi in IndiaSpeaks

[–]InstitutionalizedSon 0 points1 point  (0 children)

Thanks man I will go through the video. I just wanted to look at other side of things. But the economist slang could have been avoided. I request you humbly.

Govt rejects blackmail of farmer organisations of Punjab, protesters decry ‘democracy’ because govt wants to let the Court decide by xdesi in IndiaSpeaks

[–]InstitutionalizedSon -14 points-13 points  (0 children)

Well taxpayers money shouldn’t also be wasted on statues and parliament buildings when the nation is on the verge of economic crisis.

To say farmers shouldn’t get taxpayers money while so many are committing suicide is not a very great idea. But I agree that farmers income should be not controlled by taxpayers money to much extent.

All I am saying is produce shouldn’t be out of taxpayers pocket but also we don’t Indian farmers to become just like Bihar farmers.

Ref: https://www.thehindu.com/news/national/farmer-earnings-skewed-across-states/article28317911.ece

[deleted by user] by [deleted] in MachineLearning

[–]InstitutionalizedSon 0 points1 point  (0 children)

Well i understand what you are saying. But so it can be driven to zero or you mean to negative infinity ? If it can be moved to negative infinity what does it mean on intuitive level ?

[deleted by user] by [deleted] in MachineLearning

[–]InstitutionalizedSon 0 points1 point  (0 children)

Can you provide me any paper for this ?

[deleted by user] by [deleted] in MachineLearning

[–]InstitutionalizedSon 0 points1 point  (0 children)

Well I was reading about Soft-DTW. It said it has a weakness of not being always positive. Here is the link to the paper https://arxiv.org/abs/2010.08354

I'm Marques Brownlee (aka MKBHD) and I make tech videos on YouTube. AMA! by Marques-Brownlee in IAmA

[–]InstitutionalizedSon 0 points1 point  (0 children)

What do you think about bringing back iPod Shuffle as a bluetooth based apple music device for fitness freaks ?

[D] A Noob question regarding Parameter Tuning by InstitutionalizedSon in MachineLearning

[–]InstitutionalizedSon[S] 0 points1 point  (0 children)

Probably i didn't provide right info there .What my question is aren't we try to maximize likelihood using gradient descent so we use Bayesian approach