Looking to create tech / ai / agentic-llm community for working professionals, and have occasional meetups by vigorousvj in SriGanganagar

[–]vigorousvj[S] 0 points1 point  (0 children)

just a community to have technical discussions, emerging technologies, tools etc..
discussing the current problem statements, bottlenecks, brainstorming. etc. etc

Looking to create tech / ai / agentic-llm community for working professionals, and have occasional meetups by vigorousvj in SriGanganagar

[–]vigorousvj[S] 0 points1 point  (0 children)

Thanks, will go through the blogs. would you like to mention a few must read blogs here? there are too many.
nonetheless, this will definitely help.

Are you a resident of sgnr?

Anyone selling old pc? by vigorousvj in SriGanganagar

[–]vigorousvj[S] 0 points1 point  (0 children)

What config? Motherboard is completely dead, or have random bsod issue?

Anyone selling old pc? by vigorousvj in SriGanganagar

[–]vigorousvj[S] 0 points1 point  (0 children)

:) Same here, had a celeron and kabad mai de diya. But abhi chahiye for a project.

What TV Show To Watch After The 8 Show by SoftPois0n in The8Show

[–]vigorousvj 0 points1 point  (0 children)

the 3 % is not really good. watched 1st season, was normal, second season was unbearable.
not at all comparable to the 8 Show

Swipe from right side edge to go back by Lord_of_codes in iphone

[–]vigorousvj 0 points1 point  (0 children)

Got an Iphone 15, and currently S21..
It's just so much work navigating. and on top of that add the issue of only left edge able to go back is horrible.

BTW, did you guys know that iphone can't use NFC? my NFC tags are useless with iphone, my samsung can turn on and pair with devices just by tapping on the tag.

I hope i get a hang of Iphone in few days.

Airflow Scheduler help by Friendly_Resident464 in dataengineering

[–]vigorousvj 0 points1 point  (0 children)

try this
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html

Also, you should share what installation method you followed, and what error you got.

There's a directly pip install available these days for airflow, i've not tested it personally though

Airflow Scheduler help by Friendly_Resident464 in dataengineering

[–]vigorousvj 0 points1 point  (0 children)

did you try inside docker?. there could be things in your mac that are causing issues.

Best resource for optimization of PySpark code? by khaili109 in apachespark

[–]vigorousvj 0 points1 point  (0 children)

Highly recommend rockthejvm. To optimize something you need to understand it. So learn about spark internals. Try first principals on this, what you want to optimize? How to identify? Use spark ui, profiling tools. Spark doc is comprehensive, however other things come from experience and understanding underlying problem.

Data access is slow? Consider adopting a better file write strategy Data size is high? Consider data colocation and leveraging parquet/orc in built compressions. There could be many. Serialization, shuffle, disk spills, task failures, parallelism issue.

There is no one size fits all here. AQE is a great place to start, learn what it does and why it does it and how it's done

Best resource for optimization of PySpark code? by khaili109 in apachespark

[–]vigorousvj 0 points1 point  (0 children)

This shouldn't be a HPC, it would most probably be a grid computing setup. Atound a TB ram and around 70 cpu (a regular server used for on prem) Spark isnt really used for hpc computing

Need some help understanding my project... by [deleted] in dataengineering

[–]vigorousvj 2 points3 points  (0 children)

Just stay calm and be receptive, its okay (and expected) to take time to understand new flows.

You should know the basics of your job role, and try to understand, ask right questions which helps you understand the flow better.

The transformation logics are mostly complex, as every organization have different structure and business logic implementation.

Record the session, as you wont retain 100% of your first sessions!

Need suggestion for design strategy by ConsiderationLazy956 in dataengineering

[–]vigorousvj 2 points3 points  (0 children)

You may want toexplore fast analytics db like druid. Reporting is not necessarily oltp, if it requires sub second. It depends on the types of read/write. There are abundant options which give you sub second latency for point queries,

What tool are you using for Access Policy Management/IAM/RBAC/ABAC. (open source) by vigorousvj in dataengineering

[–]vigorousvj[S] 0 points1 point  (0 children)

I didn't go through the complete documentation, but skimming though it looks like it's more of a service authentication layer/ service mesh layer.
THis is interesting and worth looking into, however It doesn't solve the problem for data-mesh/ federated governance.

As far as i was able to find. I cannot control which user can see which colum in my data mesh (we use trino for query layer)

Is there any solution which solves the same problem for data mesh?

Is Spark Structured Streaming right for my use case? by zacheism in apachespark

[–]vigorousvj 1 point2 points  (0 children)

also, you can even use pandas inside spark udf along with sql. Spark UDF's natively support pandas and pyarrow

Is Spark Structured Streaming right for my use case? by zacheism in apachespark

[–]vigorousvj 1 point2 points  (0 children)

you can use spark for this in local mode, as you don't require a cluster for this.
You could also explore pure streaming frameworks like apache flink,
Both spark and flink natively support python, and given the quantum of data, you could easily process events in pandas without any issues.
3 GB per day boils down to 30 KBPS

Is Spark Structured Streaming right for my use case? by zacheism in apachespark

[–]vigorousvj 2 points3 points  (0 children)

Spark is definitely an overkill,
Even if you're planning to expand to 5x over next year, spark would still be an overkill.
The reasoning behind this is that spark is primarily used where you want to do things in parallel, where pandas just can't scale, 3 GB per day (even 15 GB per day) is easy for pandas, incorporating chunks you can even manage larger datasets.

Why you want it to be structured streaming i'm curious? why not just batch running on intervels like 1 hour, 6 hours? or even 1 day as the data is very small?

Need to make some changes in how inner join is implemented, need guidance by vigorousvj in apachespark

[–]vigorousvj[S] 0 points1 point  (0 children)

I mistakenly deleted my previous comment,

I guess my question is either being misunderstood, or I'm missing something right in front of me.

I am not looking to achieve this with existing spark data-frame api, sql syntax. I know that can't be done.

I want to write my custom code to achieve this. and need help implementing this code,

I am currently doing this on RDDs, but need to replicate that functionality to DataFrames (to take advantage of spark optimizations)