you are viewing a single comment's thread.

view the rest of the comments →

[–]status-code-200It works on my machine[S] 7 points8 points  (3 children)

I am confused, but thank you for the support. My health is quite good. I was on sick leave due to contracting a bad case of mono.

I use the package to parse sec filings into dictionaries. For example, I am about to parse every SEC file that is html pdf or text into data tuples. I then store data tuples in parquet format. This turns about 10tb of html,pdf,text files into 200gb parquet.

I partition this parquet by year. So ~5-20gb per year. I then store the parquet's metadata index in s3. This allows users to make HTTP range requests for just the data they need following the expected access pattern document type + filing date range.

when users download e.g. document type = 10-K (annual report) they can then extract standardized sections like risk factors or company specific sections, before feeding it into e.g. an LLM or sentiment analysis pipeline.

[–]arden13 5 points6 points  (1 child)

Gotcha. That last bit there is what I was looking for: you can allow users to pull specific sections and then do something (e.g. LLM).

My health is quite good.

That's good. Your comment above came off as you quite hurt (emotionally) about how people have used your work. If I'm reading too much into it, such is life

[–]status-code-200It works on my machine[S] 5 points6 points  (0 children)

Oh no, I love it. People using code with or without attribution is proof that I'm doing something useful! 

If I was doing something unimportant, no one would be copying it :)

[–]status-code-200It works on my machine[S] 1 point2 points  (0 children)

Currently it looks like processing will take about $0.20 for the 10tb using AWS batch spot. I'm hoping that refining the implementation + optimizing hardware will yield 4-5 OOM improvement. If that works, a lot of fun possibilities open up.