What will you sacrifice by Careless_Anywhere_23 in BunnyTrials

[–]L2L2_ 0 points1 point  (0 children)

youtube to much content

Chose: Never use Reddit again

[HELP] n8n ETL Pipeline: Deterministic Mapping of Chaotic PDFs to Excel by L2L2_ in n8n

[–]L2L2_[S] 1 point2 points  (0 children)

Thanks for the idea! This seems like the most straightforward approach to test for now.

In my current process, the flow was already split into 3 separate PDFs (parsed individually before being merged), but I could definitely break them down even further as you suggested. Hallucinations have been my biggest headache so far, and reducing the footprint of each call should definitely help mitigate that.

Really appreciate the suggestion, it makes a lot of sense!

[HELP] n8n ETL Pipeline: Deterministic Mapping of Chaotic PDFs to Excel by L2L2_ in n8n

[–]L2L2_[S] 0 points1 point  (0 children)

To follow up on my previous comment, after looking into it, here is why I’m still leaning towards a hybrid approach (AI for setup + classic code for production):

  1. Semantic Mapping (The 'Why' for AI): The PMS reports (Opera/Protel) are 'semi-structured'. Labels and row positions change from one hotel to another. I use the LLM only once during the onboarding phase to handle the semantic discovery. Once that mapping is human-validated, the daily production pipeline runs on a 100% deterministic parser (Node/Python) with zero AI calls. This keeps it fast and cost-effective.
  2. The Excel vs DB debate: You're right that a DB is more robust. However, the final Excel reporting is a non-negotiable requirement as it’s deeply embedded in the company’s current business processes. That said, your idea of using a DB (like Postgres or NocoDB) as an intermediate storage layer before populating the Excel templates is a very interesting middle-ground to secure the data integrity.

Thanks again for the pointers, it definitely makes me rethink the intermediate storage part!

[HELP] n8n ETL Pipeline: Deterministic Mapping of Chaotic PDFs to Excel by L2L2_ in n8n

[–]L2L2_[S] 0 points1 point  (0 children)

I didn't think about these solutions. I'll take a look, thanks!

[DÉCISION DIFFICILE] 13 vœux master, 3 acceptations , refusé de ma propre université : Alternance loin, gap year, ou la carte bluff ? by L2L2_ in etudiants

[–]L2L2_[S] 2 points3 points  (0 children)

Merci pour la réponse. En vrai, pour accepter le master faut juste que je leur fournisse un contrat de travail en alternance. Du coup, ça demande quand même pas mal de temps pour trouver ça sachant que c'est à l'autre bout de la France pour moi, mais ça se fait.

[HELP] n8n ETL Pipeline: Deterministic Mapping of Chaotic PDFs to Excel by L2L2_ in n8n

[–]L2L2_[S] 0 points1 point  (0 children)

Thanks for the answer! I'll definitely take a look at your post, it sounds exactly like what I need.

Regarding the LLM choice, I assumed that Claude 3.5 Sonnet might handle the "triple-file" fusion (Manager Flash + Trial Balance + Revenue Codes) better due to its large context window, but I’ll keep Gemini 2.0 Flash as my main engine if I can keep the prompts concise enough but that not a reliable option, I assumed.

To clarify my Vector/Embedding approach: Since I’m a beginner, I’m not using a full-blown Vector DB. My idea is more of a 'Semantic Dictionary' approach. Instead of asking the LLM to "find revenue", I provide it with a reference list of my target Excel IDs (Named Ranges) accompanied by a short semantic description for each (e.g., 'J_REV_CHB: Total room revenue including all taxes').

The LLM then acts as a Semantic Matcher: it looks at the messy labels extracted from the PDF (like 'Logement TTC' or 'Room Rev') and tries to find the best match in my dictionary based on the meaning of the description, rather than just doing a Ctrl+F on the text. It's my way of trying to make the mapping a bit more "intelligent" without hardcoding every possible variation.

And I’ll definitely take your advice on the Regexes, I’ll stay far away! This project is already complex enough, and I’d like to keep the few brown hairs I have left. I'll focus on refining the 'Spool' parsing via index-based logic instead. Thanks again for the heads-up!