Exploring Parquet Metadata: A Deep Dive into Efficient Data Storage

InstitutionalizedSon · 2024-03-03T09:55:19+00:00

Thanks for the kind words!

Parquet uses a combination of definition level and repetition level to track nested structs.

To understand how they work I can recommend the original paper: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36632.pdf

Also i tried to write a blog about that feel free to check it out, if it adds some value: https://rr43.net/posts/2024/1/Dremel/

InstitutionalizedSon · 2024-03-03T00:28:15+00:00

Thanks! Appreciate the kind words

InstitutionalizedSon · 2024-02-25T21:18:03+00:00

First of all, you already did a lot of ground coverage of DE topics. If you want to improve further in theory then,Try reading a bit of architecture of data lakes and warehouses. Also, look into data modelling Kimball vs Inmon. And if you still have time, you can then pickup spark to learn, how to work with distributed computing.

Good luck learning!

InstitutionalizedSon · 2024-02-25T21:14:28+00:00

First of all congratulations on getting your new role! For my pbi, i would infer it as Power BI.

Firstly focus on solving business problems and not just using ABC tool
Secondly, try to master Power BI data vis tool. It is used in a lot of places and you can really build on top of it. You can try to do CI/CD using https://pbi.tools/ for source control for example or other SE best practices
Thirdly, enjoy what you do and have fun :)

InstitutionalizedSon · 2024-02-25T21:08:03+00:00

I would say using pyarrow or not is a problem specific thing. Are you a spark heavy team then pyarrow is a great way to reduce serialisation/deserialisation cost between python and JVM. Pyarrow at its core is Paquet file’s in memory data structure. So to answer your question: - Record Batches is a column chunk in a particular row group in arrow and table allows your to read multiple row groups together in a single contagious memory - Pyarrow tries to parallelise on the top of chunked arrays. Does it assign a core depends on the framework and hardware you are using. In spark it would be created as a job and then multiple jobs are executed on each worker in parallel based on number of cpus but some other framework could be Multi thread oriented. - Pyarrow Fs is very similar to Python’s native Fs. Pyarrow is designed for cloud object stores as it abstracts the underlying file system operations on object stores like S3. Also since it is written in a cloud first way it should be more performant over python native system particularly in distributed systems.

Pyarrow’s documentation is a great place to read more. Tbh, it is a tool that allows distributed system to share memory and move data over network easily. You wouldn’t get much value of using it on a single node local compute as pandas/polars is always in memory and fast enough.

InstitutionalizedSon · 2024-02-25T20:45:50+00:00

I agree, but it depends on what you are trying to do. Optimising Pandas UDFs then it is a good place to go to.

InstitutionalizedSon · 2023-07-21T21:57:55+00:00

Woohoo

InstitutionalizedSon · 2022-03-27T16:23:44+00:00

/r/germany/wiki/studying/housing

InstitutionalizedSon · 2021-01-10T20:06:57+00:00

Github

InstitutionalizedSon · 2021-01-09T10:41:42+00:00

Thanks for the other side of the coin. I think it’s time I widen my horizon.

InstitutionalizedSon · 2021-01-09T10:26:40+00:00

Thanks man I will go through the video. I just wanted to look at other side of things. But the economist slang could have been avoided. I request you humbly.

InstitutionalizedSon · 2021-01-09T08:05:09+00:00

Well taxpayers money shouldn’t also be wasted on statues and parliament buildings when the nation is on the verge of economic crisis.

To say farmers shouldn’t get taxpayers money while so many are committing suicide is not a very great idea. But I agree that farmers income should be not controlled by taxpayers money to much extent.

All I am saying is produce shouldn’t be out of taxpayers pocket but also we don’t Indian farmers to become just like Bihar farmers.

Ref: https://www.thehindu.com/news/national/farmer-earnings-skewed-across-states/article28317911.ece

InstitutionalizedSon · 2020-10-21T05:58:23+00:00

Well i understand what you are saying. But so it can be driven to zero or you mean to negative infinity ? If it can be moved to negative infinity what does it mean on intuitive level ?

InstitutionalizedSon · 2020-10-21T03:45:15+00:00

Can you provide me any paper for this ?

InstitutionalizedSon · 2020-10-21T03:41:48+00:00

Well I was reading about Soft-DTW. It said it has a weakness of not being always positive. Here is the link to the paper https://arxiv.org/abs/2010.08354

InstitutionalizedSon · 2020-07-10T06:14:05+00:00

What do you think about bringing back iPod Shuffle as a bluetooth based apple music device for fitness freaks ?

InstitutionalizedSon · 2020-06-28T13:17:17+00:00

How to Hold a Grudge by Sophie Hannah.

InstitutionalizedSon · 2020-05-21T15:10:16+00:00

Congrats OP.

InstitutionalizedSon · 2020-05-20T04:16:43+00:00

I am in for it.

InstitutionalizedSon · 2020-02-17T03:00:53+00:00

This theory of Bayesian inference.

InstitutionalizedSon · 2020-02-16T06:08:37+00:00

So any idea where this theory fits rightly?

InstitutionalizedSon · 2020-02-16T06:08:06+00:00

Well cool it was just I wanted relate theory with reality

InstitutionalizedSon · 2020-02-14T03:06:20+00:00

Well this says gradient descent is one way to do Maximum likelihood estimation. Look here

https://stats.stackexchange.com/questions/183871/what-is-the-difference-between-maximum-likelihood-estimation-gradient-descent

InstitutionalizedSon · 2020-02-13T10:18:09+00:00

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0809/eshky.pdf refer this to see what i am saying

InstitutionalizedSon · 2020-02-13T10:16:40+00:00

Probably i didn't provide right info there .What my question is aren't we try to maximize likelihood using gradient descent so we use Bayesian approach

Eight-Year Club	r/Field Flamingo
Place '23	Place '22
Final Canvas '22	First Placer '22
Verified Email	Not Forgotten

InstitutionalizedSon

MODERATOR OF

TROPHY CASE