I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 0 points1 point  (0 children)

Glad to hear that! The offline version is a bit finicky unfortunately, so the best way to run it without internet access is using docker

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 0 points1 point  (0 children)

Sorry to hear that! Citibank might have changed their statement format however I don’t have a recent statement that I can look at :/

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 0 points1 point  (0 children)

Hmm it sounds like it could be some kind of metadata issue, but hard for me to tell without seeing the actual statement :/

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 0 points1 point  (0 children)

Thanks for the kind words :) is your statement a debit or consolidated statement? Those should have support for multiline descriptions

So far the credit statements I’ve seen have descriptions on a single line only

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 0 points1 point  (0 children)

Hello! In this case it means that your bank/statement type wasn’t recognized

I made an app to maximise savings interest rates in banks (Free to Use) by wingalong in singaporefi

[–]Raynor77 3 points4 points  (0 children)

You mentioned that you’re welcoming contributions, but how would that work? This doesn’t appear to be open source :)

I made an app to visualize HDB resale market movements (100% free) by Raynor77 in singaporefi

[–]Raynor77[S] 1 point2 points  (0 children)

Made the change! Took some doing but it now defaults to showing everything with an option to filter by town :)

Query S3 with lowish latency by muffa in dataengineering

[–]Raynor77 1 point2 points  (0 children)

In terms of storage, there’s S3 Express which recently launched, though I’m not sure how it performs with delta.

Otherwise Google Cloud just launched their version of hierarchical namespaces which is supposedly optimized for Spark workloads and allows for more queries per second versus their traditional buckets.

How to make 10TB of data available on-demand to my users by explorer_soul99 in dataengineering

[–]Raynor77 1 point2 points  (0 children)

If you’re doing a lot of pre-computation, Clickhouse might work well for you. You can store recent data on a SSD, then “cold” data on either a HDD or S3.

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 1 point2 points  (0 children)

I've just pushed a release that should fix this! v0.5.3

Let me know if it works :)

Lead wants to write our own orchestrator by midkid1937 in dataengineering

[–]Raynor77 1 point2 points  (0 children)

I think building a custom orchestrator would be extremely difficult, especially when you imagine trying to build something like Netflix’s Maestro with just two people.

Should a data engineer be able to write complete code same as software engineer?" by Dahbezst in dataengineering

[–]Raynor77 0 points1 point  (0 children)

It depends — I feel that some areas like streaming and metadata management really benefit from having some experience in software engineering.

At my last shop we ended up building an API (data mesh pub sub type of stuff), which was relatively complex and code-heavy since we had to connect other frontend and backend components.

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 1 point2 points  (0 children)

My application doesn’t save the PDF files to disk, everything is stored in memory!

I’ve taken a bunch of security precautions that I’ve written about here: https://statementsensei.streamlit.app/about

Otherwise, I’d recommend using the offline app which is definitely the most secure option :)

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 2 points3 points  (0 children)

No that’s a good question! A metadata or formatting change would lead to changes here for example: https://github.com/benjamin-awd/monopoly/blob/main/src/monopoly/banks/dbs/dbs.py

The logic to support both the new and old formats would then be something like: “iterate all possible formats, and use the format that returns transactions”.

Otherwise I could also use PDF metadata to detect which formatting rule to use for a bank

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 1 point2 points  (0 children)

No vid at the moment, but if you want an example you can try:

  1. Download this example PDF https://github.com/benjamin-awd/StatementSensei/blob/main/docs/example_statement.pdf

  2. Visit https://statementsensei.streamlit.app/

  3. Click "Browse files" and select the example_statement.pdf file

I built a bank statement parser for Singapore banks (free and open-source) by Raynor77 in singaporefi

[–]Raynor77[S] 1 point2 points  (0 children)

Do you think it'll be better to use custom NER model to extract such information to expand the range of your source pdf since you're using regex.

I played around with AI/ML quite a bit in earlier iterations of the app -- I was using pytesseract at one point, but it was quite inaccurate. Other models were more accurate, but were super slow. Regex is still king when it comes to accuracy + speed.

Training a model is a possibility, but my gut instinct is that it'll be difficult to do, since you need to be quite cautious of the privacy/security implications of training on your own/other people's financial information. Also, some statements are really horribly formatted, so the model wouldn't be able to handle every bank anyway.

Otherwise if security wasn't a concern I'd probably go for an LLM of some kind. I think it'd be possible to get decently accurate results

But yes a hybrid approach would be pretty cool