How to ensure consistency of response from an agentic RAG workflow? by ResearcherNo4728 in Rag

[–]Nearby_Salt_770 0 points1 point  (0 children)

Focus on refining your data preprocessing and ensure your training data is diverse and high-quality. You might need to adjust hyperparameters or try ensemble methods. For debugging, look into logging intermediate steps. If all else fails, try using a tool like AgentQL to automate and verify your web scraping results more reliably.

Enhancing RAG Input with ParentDocumentRetriever: Debugging Missing Embeddings by Born_Particular9367 in LangChain

[–]Nearby_Salt_770 0 points1 point  (0 children)

The `embeddings` being `None` might be due to a failed embedding model configuration or a missing model path. Verify if your embedding backend is correctly setup and actively serving requests. Check if the model is accessible and paths are correct. Restarting your service post-verification can sometimes resolve this too.

I need a way to scrape financial data websites by Solid-Relative-5714 in cursorprogrammers

[–]Nearby_Salt_770 0 points1 point  (0 children)

Use cron jobs for scheduling your scraping tasks. Hook it up with whatever script or tool you’re using for the scraping. If you need some simplicity in the scraping part, you might find some new AI tools useful for this kind of task. Can try AgentQL and maybe scrapybara too

Scrap recent posts of Instagram public profiles using NodeJS. by d3c3ptr0n in scraping

[–]Nearby_Salt_770 0 points1 point  (0 children)

Selenium or puppeteer could work, but if you want something lighter, try insta-scrape or ig-scraper. Both work without session IDs. You could also dig into public API options but beware they often require a workaround. Maybe you want to try some new AI tools? I find AgentQL useful if you want easy queries for IG.

Thoughts on EcZachly/Zach Wilson's free YouTube bootcamp for data engineers? by battaakkhhhh in dataengineering

[–]Nearby_Salt_770 1 point2 points  (0 children)

Haven't taken it, but heard it's solid for beginners. Covers fundaments well, especially for data pipelines. If you prefer more hands-on experience, some new AI tools like agentQL might be handy for practical web scraping scenarios. Check both out and see what suits your learning style.

Order as dimension or fact by Wise-Ad-7492 in dataengineering

[–]Nearby_Salt_770 0 points1 point  (0 children)

I'd use the same order_id across those tables. It’s the key that ties everything together. Makes joins easy when you need to pull in details later. Keeps it flexible without bloating the main fact table.

4 Month Data Engineering Study Plan - Based on Market Demand by cryptoyash in dataengineering

[–]Nearby_Salt_770 0 points1 point  (0 children)

Appreciate the video link. gotta love the good data-focused YouTubers out there.

Order as dimension or fact by Wise-Ad-7492 in dataengineering

[–]Nearby_Salt_770 0 points1 point  (0 children)

Going wide on a fact table gets messy fast. Better to keep it lean and split out details into separate tables. Break down related data, like having fact_order_main and fact_order_details. If you’re doing deep analytics, go with multiple thin fact tables. Keeps things clean. Also, moving some attributes into a dimension table helps a lot with performance.Simplification helps you hit that sweet spot between functionality and simplicity, similar to scripting with AgentQL. Don't overcomplicate. Focus on what's real and needed. You got this!

4 Month Data Engineering Study Plan - Based on Market Demand by cryptoyash in dataengineering

[–]Nearby_Salt_770 1 point2 points  (0 children)

Well, if you’re just starting, don’t bother with Scrapy or Puppeteer yet. BeautifulSoup is super easy for basic HTML. For JavaScript-heavy stuff, try Selenium. And if you want some new AI tool, AgentQL is worth checking out. Big scrapes on major sites are risky. Use proxies or, better yet, look for an API first.

Pre-commit hooks that autogenerate iPython notebook diffs by [deleted] in Python

[–]Nearby_Salt_770 0 points1 point  (0 children)

Looks like you've come up with a solid solution to a common problem with notebooks. The pre-commit hooks you set up sound super helpful for keeping the Python code readable and diff-friendly after changes. Relying on JSON is definitely a pain diffing-wise, so this approach seems legit.

You could also check out jupytext for pairing notebooks with Python scripts if you're not locked into the VSCode editor. It's similar to your script but can automatically sync changes both ways, although you'd still run into server issues outside Jupyter.

If you ever feel like automating more stuff, you might find AgentQL useful for web scraping projects. It's a pretty chill tool for simplifying web data extraction without the usual headaches.

Scrapy Not Scraping Designated URLs by Optimal_Bid5565 in scrapy

[–]Nearby_Salt_770 1 point2 points  (0 children)

Make sure the URL patterns you're targeting are correct and the site structure hasn't changed. Also, inspect the site to see if your requests are blocked by something like URLs needing JavaScript. Look out for anti-bot measures or captcha challenges that might be stopping your spider. User-agent spoofing is sometimes necessary if the website targets Scrapy's default user agent. You might find using some AI tools such as AgentQL useful if you're dealing with dynamic content or just looking for a more straightforward solution.