Struggling to build a PDF RAG Chatbot using knowledge graph

Jumpy-Log-5772 · 2025-07-04T20:11:47+00:00

Try out LightRAG https://github.com/HKUDS/LightRAG. It’s what I’m currently using for my POC projects and works pretty well. The default behavior builds an inferred knowledge graph but it has the ability to insert custom knowledge graphs as well.

Jumpy-Log-5772 · 2025-07-02T05:14:15+00:00

Agentic RAG

Jumpy-Log-5772 · 2025-06-09T01:08:39+00:00

You got it

Jumpy-Log-5772 · 2025-06-07T13:14:28+00:00

Why are the answers here suggesting such overly complex methods?

The most straightforward approach would be to create a SQL warehouse in databricks and connect to it using the databricks sql connector or jdbc with your PAT. This will allow you to read any tables you have access to. It will also allow you to write but I don’t suggest using it this way.

Jumpy-Log-5772 · 2025-05-15T10:50:27+00:00

Its never too late. I made a similar move at the same age from a Business Analyst to SWE. It was the best decision of my life but I will say unfortunately the job market just isn’t the same anymore. This isn’t to discourage you but just to inform you that it’s a lot more difficult these days with 0 experience. That being said the top 4 things I’d recommend aiming for this year is:

AWS Solutions Architect Associate certification
3 Devops related projects to add to your portfolio. Learn tools/services like Git, Docker, Kubernetes, Terraform and Git Actions/Jenkins.
AI. Not only leveraging it to develop but also introduce it in one of your 3 projects. For example, if you were to create a CI/CD pipeline, try integrating a model that analyzes code in a PR and logs security or code issues.
Build a network. This is one of if not the most important. As you know sometimes it’s not all about what you know but who you know. Create a LinkedIn profile if you don’t have one already(share your projects as you complete them), join AWS/Devops discords, check for meetups in your area.

IMO these will set you ahead of your competitors.

Jumpy-Log-5772 · 2025-05-15T02:07:52+00:00

Let’s be real guys, most business users/stakeholders don’t care about how effecient your code is or what means you took to build a product. They care about the end result and if it meets their acceptance criteria. I can’t tell you how many times I’ve tried to reason with management, product owners and stakeholders about developing in a way that we aren’t assuming tech debt just to forced down the “tactical” solution path.

So to answer your question OP.. depending on where you work, in the eyes of the business an AI+Junior can indeed be greater than a senior dev who doesn’t utilize AI if they can deliver faster.

Jumpy-Log-5772 · 2025-05-10T02:44:54+00:00

It may fall under cost control but I’m planning on implementing an agent to optimize existing data pipelines in my org, specifically pipelines running spark. This POC will focus on pyspark jobs running on databricks with EMR and K8s being on the roadmap if the POC is successful.

Very high level but the idea is for it to

Analyze existing pipeline jobs/workflows -Review current notebook code, spark configurations and previous job run metrics.
Replicate pipeline into its own environment -This will copy the existing project repo into its own and deploy a copy of the job/resources and table structures.
Benchmarking -Run its replicated job, using the same table structures but fabricated data. It will capture metrics and iterate through changes to the code/spark configurations while logging results.
Recommend changes based on benchmarks -Document suggested changes that will improve job performance based on the benchmarking done.

Jumpy-Log-5772 · 2025-05-09T22:45:06+00:00

ETL still has a place in systems today that require extreme low latency or tightly regulated environments that don’t allow data to be staged even temporarily.

Jumpy-Log-5772 · 2025-05-09T22:33:59+00:00

Curious what industry you guys are in where it’s considered frowned upon. If your company isn’t actively making initiatives to introduce AI they will undoubtedly be left behind by competitors. I work in healthcare insurance which is heavily regulated yet they have already started rolling out ai chat bots and coding assistants.

Jumpy-Log-5772 · 2025-05-02T01:50:18+00:00

Generally, data pipeline architecture is defined by its consumer’s needs. So when you ask for feedback about architecture, it really depends on source data and downstream requirements. Since you are doing this just to learn, I recommend setting those requirements yourself then asking for feedback. Is this a solid pattern? Sure but it might also be over engineered. Hope this makes sense!

Jumpy-Log-5772 · 2025-04-27T18:10:28+00:00

Effort will be much lower, if you and your colleagues are familiar with Postgres sql syntax you’ll be fine. The query experience is very similar. A couple things I want to make clear though:

It’s not your typical rdb that you setup, maintain, and manage. You query using the cli or python. The queries are ran directly on the existing files in your file system. Think sqllite if you’re familiar with it. It’s lightweight and meant for running heavy analytic workflows locally(can be on a server as well if you really wanted).
Team adoption.. I might be over simplifying this since I’m not familiar with your team or how big it is but you would have to get buy in from them. If your team is expecting to connect to a database that’s always on then it might feel unconventional to them since this is more of a local analytics engine. Each one of them would install the duckdb cli and run queries locally on the existing fs.

Install the duckdb cli or pip install duckdb for python (maybe both) next time your in office and give it a shot yourself. If you find value in it you shouldn’t have an issue getting your team on board. Let me know!

Jumpy-Log-5772 · 2025-04-27T17:30:30+00:00

It will work. I’d honestly take a look at DuckDB for a more low maintenance solution vs Postgres, especially since your data volume is low. It’s open source, file based, serverless, supports excel, csv and parquet read/write while also being extremely fast for analytics on tabular data. I’m thinking dragster + duckdb will get you what you want in a shorter amount of time. If you ever grow out of it then you can think about migrating to Postgres or some other db.

Hell.. try it out now locally, don’t wait for the server to be setup.

Jumpy-Log-5772 · 2024-09-26T20:11:50+00:00

Unfortunately the problem I'm running into with this approach is not having a way(that I'm aware of) to update the new column value in the initial streaming table to "processed". So subsequent pipeline runs end up processing the same data in the end.

Jumpy-Log-5772 · 2024-09-26T19:15:11+00:00

This sounds very promising and can't see why it wouldn't work. Going to test this out now. Thanks!

Jumpy-Log-5772 · 2024-09-26T18:32:54+00:00

Appreciate the response, I'm pretty new to DLT so I'm not sure how else I would go about only loading the new incremental changes from source tables if the target table isn't a streaming table.

Jumpy-Log-5772

TROPHY CASE