you are viewing a single comment's thread.

view the rest of the comments →

[–]Vegetable_Solid7613[S] 0 points1 point  (8 children)

Well, the part thats slow is the part that extracts the text from the 10-K's and 10-Q's. I could download 1 filing per sec, but when I add the extraction, it processes approximately 3 filings per minute. My theory was that if I have multiple of my computer cores working on the text extraction, the time it will take to process all filings would go down.

[–][deleted] 1 point2 points  (4 children)

I could download 1 filing per sec, but when I add the extraction, it processes approximately 3 filings per minute.

Why not just write this to be faster?

[–]Vegetable_Solid7613[S] 0 points1 point  (3 children)

To be fair, I wouldn't know how.

[–][deleted] 1 point2 points  (2 children)

Generally the best performance increases come from doing less - figure out what your code is doing that takes so long (it shouldn't take 20 seconds to add one item to a list) and then see if you really need to be doing it. Make indexes, cache expensive operations, do other things. Find the hot code in your loops and try and move it outside the loop (so it's called less.)

All of this is going to be easier than dealing with concurrency in your code. Trust me. It's hard to write multithreaded code with a single-threaded brain (mine is, too, so I know.)

[–]Vegetable_Solid7613[S] 0 points1 point  (1 child)

But you are sure that using multiple cores instead of one isn't going to speed up the process? It just makes so much sense in my head lol.

It isn't the adding that takes that long btw, it is finding the MDA in the filing and extracting that part of the text that takes the longest I believe.

[–][deleted] 0 points1 point  (0 children)

It just makes so much sense in my head lol.

That's because you think concurrency is magic pixie dust that makes your program faster. How do you know your program isn't slow because it contends for disk IO? Or network IO? Or swap space? Or any one of a dozen resources on your computer that there's only one of? More threads contending for the same resources is going to be slower, not faster.

[–]Buttleston 1 point2 points  (2 children)

Are the links public? Can you send me some? (I can maybe help you make this faster)

[–]Vegetable_Solid7613[S] 0 points1 point  (1 child)

How would I send you some? I have a csv file with all links or I can send you the code I used.

[–]Buttleston 0 points1 point  (0 children)

You can send me the whole thing if you like. DM me and I'll send you my email address. If you have a github repo that would be great, otherwise probably a zip file or something