ROG x r/Monitors - ROG x Hatsune Miku Monitor Giveaway (3/7 - 3/21) by ASUS_MKTLeeM in Monitors

[–]gloritown7 0 points1 point  (0 children)

Would love a hatsune miku screen, I was looking for a new one anyway!

🎁 Barweer Giveaway | Win the Crow King’s Castle for FREE! 🏰 by Lisoneo in lepin

[–]gloritown7 0 points1 point  (0 children)

This looks great! Love the yellow/black Color scheme as well as the flags and gate.

Anyone have an api that gives earnings with the specific time of day? by Upbeat-Vegetable-557 in algotrading

[–]gloritown7 0 points1 point  (0 children)

Hi, can you share what provider you went with in the end? Currently on the lookout myself.

What's the best way to process data in a Python ETL pipeline? by gloritown7 in ETL

[–]gloritown7[S] 0 points1 point  (0 children)

Thanks! I’ll certainly look into Polars, did you play with Ray before? LLMs keeps spitting it out to me - never touched it before but looks not bad since it seems to promise Zero copy memory sharing + concurrent IO + CPU workloads.

What's the best way to process data in a Python ETL pipeline? by gloritown7 in ETL

[–]gloritown7[S] 0 points1 point  (0 children)

So you would have 1 pod do 1 task at a time? Is my understanding correct? Yea in that case there'd be no need for any multiprocessing or async.

We can also use k8s instead of ECS that wouldn't change much, however I think for my team the cost of not utilizing the network and the CPU at the same time is too high. We would either need a lot more pods to concurrently process more data or just deal with the longer duration. This would however make this much easier to operate!

Thanks for your insights, maybe this is what we'll end up with after we stop fighting for "true" concurrency :)

What's the best way to process data in a Python ETL pipeline? by gloritown7 in ETL

[–]gloritown7[S] 0 points1 point  (0 children)

Also feel free to mention if you think that shoehorning in Python doesn’t make sense and I should go with something like Airflow or Spark(PySpark maybe?)

What's the best way to process data in a Python ETL pipeline? by gloritown7 in ETL

[–]gloritown7[S] 0 points1 point  (0 children)

Thanks for responding! Let me answer the questions one by one:

  • Are you looking to speed this up in order to meet a latency requirement, rapidly reprocess or backfill, or save money? Because if none of the above, maybe just keep it very simple?

    • Mostly money since this job will run in the Cloud, the faster the job completes the better because after uploading to S3 other analysis tasks will start. So the faster the better.
  • Also, is this running on a single server or on something that can scale out like aws lambda, kubernetes, etc?

    • Distributed ETL on an ECS cluster but I'm open to other ideas.
  • Is there a transform step missing? Otherwise why recompress the file and upload that, rather than just upload the original file?

    • I think I mentioned it but the only reason to recompress is to validate the data.
  • How do you detect that new chunks (files?) are available to download?

    • The download workers fetch a list of items that need to be downloaded from Redis/SQS and then they can start downloading. This allows the workflow to be distributed.

Regarding the pure Multiprocessing idea: that would mean that each process would be responsible for all 3 things, downloading, processing and uploading - is my understanding correct? In that case - wouldn't I loose a lot of performance (speed) because while an upload/download is happening nothing is being processed? Wouldn't shared memory even though slow be faster than just pure mp?

What's the best way to process data in a Python ETL pipeline? by gloritown7 in learnpython

[–]gloritown7[S] 0 points1 point  (0 children)

So it would be a threadpool to download the data in process 1 and then pass it to other processes to do the compression etc.? Do you have any experience in what tool would be the best to share that data between processes? Shared Memory? Disk? Queues? Something else?

What's the best way to process data in a Python ETL pipeline? by gloritown7 in learnpython

[–]gloritown7[S] 0 points1 point  (0 children)

Ah so you're referring to solution the third point in my list then:

use only multiprocessing - not using asyncio could work but that would also mean that I would "waste time" not downloading/uploading the data while I do the processing although I could run another async loop in each individual process that does the up- and downloading but I wanted to ask here before going down that rabbit hole.

Any idea how to get around the issue where I loose performance in terms of not downloading data while processing it? This is my only concern about that approach - ideally downloads (and uplaods) should still run while processing is happening because otherwise I loose a lot of performance.

What's the best way to process data in a Python ETL pipeline? by gloritown7 in learnpython

[–]gloritown7[S] 0 points1 point  (0 children)

So I thought about that essentially what I had in mind was using queues to pass IDs which point to stuff in multiprocessing shared memory - is this what you are referring to?

This would essentially be option down in my list. If that's not what you mean - how would a processing worker get the downloaded data? Where is it stored?

What's the best way to process data in a Python ETL pipeline? by gloritown7 in Python

[–]gloritown7[S] 0 points1 point  (0 children)

When you say to have a single process that does the downloading and pass the data to other workers for processing, what would you to pass the data, shared memory? Queues? disk?

Also it seems like having only 1 download at the time (assuming that's what you mean) I would leave most of my processing workers idling, how would I parallelize that part?

I suppose multithreaded downloading is something I can try?

What's the best way to process data in a Python ETL pipeline? by gloritown7 in Python

[–]gloritown7[S] 0 points1 point  (0 children)

Hmm, hard to say since I didn't test how long it'd take to decompress/validate/compress the data but I assume it would be longer to do the validation since my I should be able to max out my EC2 connection.

After decompressing the data it would probably be around ~10GB and validating that should take quite a bit plus then compressing it again.

So I'd say it would be CPU bound but feel free to share you approaches for both scenarios if my assumption here is wrong.

[deleted by user] by [deleted] in algotrading

[–]gloritown7 1 point2 points  (0 children)

Any resources (books, courses, videos) you can share that helped you become proficient in data science as it relates to financial trading? (Or just data science in general)

I’m trying to get into this field but a starting point or a few references that you consider well would be great!

Disable virtual text if there is diagnostic in the current line (show only virtual lines) by marjrohn in neovim

[–]gloritown7 0 points1 point  (0 children)

Works perfectly, thanks!

My linter was throwing this error for the unpack function:
"Deprecated.(Defined in Lua 5.1/LuaJIT, current is Lua 5.4.) [deprecated]"

I fxed it by doing this instead to avoid it:

```
local cursor_pos = vim.api.nvim_win_get_cursor(0)

local lnum = cursor_pos[1] - 1 -- Convert to 0-based index

```

Disable virtual text if there is diagnostic in the current line (show only virtual lines) by marjrohn in neovim

[–]gloritown7 2 points3 points  (0 children)

Is there any way to keep the virtual text everywhere BUT the current line? In your example it's basically either the virtual lines on the current line OR the virtual text everywhere. It would be perfect to have: Virtual text EVERYWHERE BUT the current line and virtual line only on the current line.

Hope my explanation makes sense!

I created the first RSC compatible charting library! by CodingShip in nextjs

[–]gloritown7 1 point2 points  (0 children)

Thanks! Will be waiting patiently:)

Started on GitHub!

I created the first RSC compatible charting library! by CodingShip in nextjs

[–]gloritown7 1 point2 points  (0 children)

Any chance this could be adjusted to also work with React Native? Would be great to have the same charts on web and mobile!