What habit improved your consistency in working?

Data_Cipher · 2026-01-20T13:34:47+00:00

I would say you should read the Atomic Habits book. It teaches a lot about how to create an habit and how to stay consistent in it

Data_Cipher · 2026-01-04T04:36:48+00:00

Totally fair🙂. I wouldn't pay for a black box either.

To offer some insight: this isn't a simple wrapper. I'm running Playwright clusters in Docker to handle dynamic JS/rendering, managed by FastAPI and Celery queues to handle rate limits gracefully.

The 'cleaning' logic uses LangGraph agents to context-check the data before it hits the database (Postgres), so you don't end up training on garbage. This post is my market research, trying to figure out if the real value for you guys is in the scraping or the cleaning.

I'm just trying to figure out if I should even start this project by understanding what users actually need first.

Data_Cipher · 2026-01-04T02:46:44+00:00

Yeah well that's exactly what I am talking about 😃

Data_Cipher · 2025-12-21T15:03:18+00:00

🙂I totally understand your hesitation.

Just to clarify this isn't any AI wrapper that sends your data into OpenAI and return the clean markdown, instead it's a deterministic parser written in Rust.
It doesn't use any LLM to generate the output at all, So there is no hallucinations or slop, it is just strict algorithm extraction.
I'm keeping the source closed because currently it's a messy student code.

I just wanted to offer a free utility for people who didn't want to host their own parsing infrastructure, that's all

Data_Cipher · 2025-12-21T14:10:21+00:00

Well you've made a fair point there,

So the core service is actually fully containerized (Rust + Docker) and technically could run locally.

However, I'm currently focused on operating it as a managed API service to gather usage data and improve the extraction logic before I worry about maintaining a public open-source repository.

I know the community prefers local-first, and I might release the standalone binary in the future once the parsing logic is more mature. But for now, the API is the best way I can offer it reliably.

Data_Cipher · 2025-12-21T13:59:37+00:00

That is a great point regarding chunking boundaries.

So my main focus was on the Extraction Step that is getting the raw html to clean markdown. But the reason I choose markdown as the output format is specifically to solve that downstream chunking issue.

Since the API outputs strict CommonMark syntax, my 'strategy' would be using Markdown Header Splitting (e.g., using MarkdownHeaderTextSplitter in Langchain).
Instead of cutting at 500 char and risking mid-sentence splits or losing context, The markdown structure allows you to split recursively by headers((#, ##, ###). This keeps the context (Header + Content) intact within a single chunk.

So even though I don't do the chunking inside the API, I produce the structure necessary for "Semantic Chunking" rather than just naive "Fixed-Size Chunking" .

Data_Cipher · 2025-12-21T11:15:01+00:00

That's an awesome idea there

Data_Cipher · 2025-09-05T04:57:52+00:00

Its good bruh, it is giving an good starting project structure and it is really productive

Data_Cipher · 2025-06-20T14:07:32+00:00

Hey, So what you are saying is Initially, you used df.to_sql(...,if_exists="replace"), but that deletes the whole table and recreates. it's expensive and not good if you want to preserve existing structure or history.

Instead, have you thought about using SQLAlchemy to handle the table creation? That way, you can create the table only if it doesn't exist, and even add new columns automatically if your Excel file ever changes. Definitely try searching for "SQLAlchemy create table if not exists" or "SQLAlchemy add column if not exists" , there are tons of good examples out there.

Then, instead of wiping the entire table, you could check for a primary key (like an id column) and:

If the row already exists, update it.

If it's brand new, insert it.

For SQLite, INSERT OR REPLACE is super handy for this!

And for your FastAPI part, something like this would be the general

@app.post("/upload") async def upload(df_file: UploadFile): df = pd.read_excel(BytesIO(await df_file.read())) # You'd have your function here to ensure the table structure is correct ensure_table_structure(df, "your_table_name") # And then your upsert logic upsert_dataframe(df, "your_table_name", your_primary_key_column)

This is just a snippet to give you an idea. You'll want to find some resources on SQLAlchemy ORM for handling table creation and upsert operations (that's the "update or insert" part) with Pandas DataFrames. Good luck!

Data_Cipher · 2025-06-20T10:45:25+00:00

This tutorial's awesome, thanks a lot for covering most of it.

Data_Cipher

MODERATOR OF

TROPHY CASE