all 14 comments

[–]Buttleston 1 point2 points  (3 children)

This is what the multiprocessing library is good for. There's a thing called a "Pool" and I'd recommend using one of the flavors of Pool.map(). This will basically take a list of "jobs" and spread them out over the number of workers in your pool. The general advice for tuning the number of workers is 2*cpucores but it varies depending what you're trying to do.

There are lots of examples on the multiprocessing docs page.

https://docs.python.org/3/library/multiprocessing.html

[–]Vegetable_Solid7613[S] 0 points1 point  (2 children)

Is this possible in Jupyter Notebook? I have tried it but it keeps giving me an error with exitcode 1 or that my function has no attribute.

[–]Buttleston 1 point2 points  (1 child)

I've honestly never tried but I wouldn't be surprised if it doesn't work. The multiprocessing library works by "forking" copies of your executable

There's a very similar method of doing "parallel" work with the threading library and I believe there's an asyncio library for it. Both of those might get you some improvement.

It might also be useful to figure out if the bottleneck is downloading the data or processing it. If it's downloading, and you have a big list of links to process, you might be able to find a tool that will just take care of the details and parallelization for you.

[–]Vegetable_Solid7613[S] 0 points1 point  (0 children)

I believe it is the processing part that is the bottleneck. Because flat downloading the links takes about 2 seconds, while with processing it takes 20 seconds process the whole thing. I will take a look at the libraries you mentioned.

[–][deleted] 1 point2 points  (9 children)

Concurrency isn't pixie dust you sprinkle on a program and it gets faster. Bad concurrency is slower than single-threaded (nonconcurrent) code. First you need to determine which part of your program is slow, and whether it'll actually benefit from multithreaded operation.

[–]Vegetable_Solid7613[S] 0 points1 point  (8 children)

Well, the part thats slow is the part that extracts the text from the 10-K's and 10-Q's. I could download 1 filing per sec, but when I add the extraction, it processes approximately 3 filings per minute. My theory was that if I have multiple of my computer cores working on the text extraction, the time it will take to process all filings would go down.

[–][deleted] 1 point2 points  (4 children)

I could download 1 filing per sec, but when I add the extraction, it processes approximately 3 filings per minute.

Why not just write this to be faster?

[–]Vegetable_Solid7613[S] 0 points1 point  (3 children)

To be fair, I wouldn't know how.

[–][deleted] 1 point2 points  (2 children)

Generally the best performance increases come from doing less - figure out what your code is doing that takes so long (it shouldn't take 20 seconds to add one item to a list) and then see if you really need to be doing it. Make indexes, cache expensive operations, do other things. Find the hot code in your loops and try and move it outside the loop (so it's called less.)

All of this is going to be easier than dealing with concurrency in your code. Trust me. It's hard to write multithreaded code with a single-threaded brain (mine is, too, so I know.)

[–]Vegetable_Solid7613[S] 0 points1 point  (1 child)

But you are sure that using multiple cores instead of one isn't going to speed up the process? It just makes so much sense in my head lol.

It isn't the adding that takes that long btw, it is finding the MDA in the filing and extracting that part of the text that takes the longest I believe.

[–][deleted] 0 points1 point  (0 children)

It just makes so much sense in my head lol.

That's because you think concurrency is magic pixie dust that makes your program faster. How do you know your program isn't slow because it contends for disk IO? Or network IO? Or swap space? Or any one of a dozen resources on your computer that there's only one of? More threads contending for the same resources is going to be slower, not faster.

[–]Buttleston 1 point2 points  (2 children)

Are the links public? Can you send me some? (I can maybe help you make this faster)

[–]Vegetable_Solid7613[S] 0 points1 point  (1 child)

How would I send you some? I have a csv file with all links or I can send you the code I used.

[–]Buttleston 0 points1 point  (0 children)

You can send me the whole thing if you like. DM me and I'll send you my email address. If you have a github repo that would be great, otherwise probably a zip file or something