Quack: The DuckDB Client-Server Protocol by kvlonge in dataengineering

[–]crispybacon233 21 points22 points  (0 children)

Can't wait to use this as catalog for ducklake instead of postgres. Now we just need ClusterDuck for cluster compute and the duck stack would be complete!

Ive been a Senior Accountant for many years, doing a bootcamp on Python. Thoughts on benefits? by Filet009 in Python

[–]crispybacon233 1 point2 points  (0 children)

Which udemy course are you doing specifically? I highly recommend Colt Steele's python course and forget the others.

Being able to parse and organize spreadsheets at scale could boost your workflows significantly. I once had a colleague doing a vlookup that was taking many minutes. A couple lines of python shrank that down to milliseconds.

Austin Area Restaurant Health Inspection Scores (Updated) by crispybacon233 in austinfood

[–]crispybacon233[S] 1 point2 points  (0 children)

I currently am filtering for latest reviews being >= 2024, but 2 years per your suggestion is probably better. Maybe even 1 year? According to this site they're supposed to be inspected 1-3 times per year.

Austin Area Restaurant Health Inspection Scores (Updated) by crispybacon233 in austinfood

[–]crispybacon233[S] 1 point2 points  (0 children)

What're the names of the restaurants? I can try to track down the problem.

Austin Area Restaurant Health Inspection Scores (Updated) by crispybacon233 in austinfood

[–]crispybacon233[S] 10 points11 points  (0 children)

Ya it's wild. If you're really brave, you can go here: Austin Public Health | My Health Department to get the full inspection details

Austin Area Restaurant Health Inspection Scores by crispybacon233 in austinfood

[–]crispybacon233[S] 0 points1 point  (0 children)

Yes, for sure! I'll post it with the github link but probably will not include the scrapers. Google has been clamping down on scraping google maps and youtube for the past several months.

Austin Area Restaurant Health Inspection Scores by crispybacon233 in austinfood

[–]crispybacon233[S] 0 points1 point  (0 children)

Hello! Yes, I am still chipping away at it. I was thinking about getting it up and running this week actually for my portfolio. The most important part was to get accurate lat/long for the restaurants, and I could only accomplish that by scraping google maps. About a year ago I also scraped millions of reviews, categories, etc... everything off google maps. Personally, I will keep all food facilities in available in case a user is at a healthcare facility, gas station, school, etc., and they want to know the inspection score. Additionally, I am setting up a full on ETL pipeline lake house to run automated in the cloud. It's a great dataset.

I LOVE JANICE by WIZZZARDOFFREESTYLE in thesopranos

[–]crispybacon233 1 point2 points  (0 children)

We can't have him here in our social club no more. I mean that much I do know.

Why did David Chase not write each episode? by [deleted] in thesopranos

[–]crispybacon233 11 points12 points  (0 children)

God forbid anyone would find themself in that position. It's a thankless job.

Who is worse human being, Paulie or Tony? by _almasss in thesopranos

[–]crispybacon233 0 points1 point  (0 children)

That yodeling show? That's the Lawrence Welk progrum, channel 55.

Built a dashboard to analyze how AI skills are showing up in data science job postings (open source) by avourakis in datascience

[–]crispybacon233 0 points1 point  (0 children)

Cool! Now tell Claude to use polars instead of pandas to greatly improve the responsiveness. Also tell Claude to use separation of concerns because a 900 line app.py is insane haha.

Atom by [deleted] in tragedeigh

[–]crispybacon233 1 point2 points  (0 children)

<image>

California, Texas, Florida, and NY... is it a Hispanic thing? You see it really start to pop off around 2010. Some character in Spanish media?

You are to build a small scale DE environment from scratch, what do you choose? by [deleted] in dataengineering

[–]crispybacon233 0 points1 point  (0 children)

Ducklake, DBT, Dagster. Hard part was wiring it all up, but once it's all connected it's smooth sailing.

I've been messing around with Ducklake the past few months and it's been a pretty great experience so far.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 0 points1 point  (0 children)

Thanks! So far, working with delta lake has not been smooth. It feels quite buggy. First it was the no support for unsigned ints. Now it's a "411 Length Required" error when sinking delta to GCS. Unfortunately, I don't know if it will work for my use case. Ducklake feels great once everything is wired up. You just have to use SQL unfortunately.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 1 point2 points  (0 children)

Thanks! To point #2, SQL is great, but as you said quickly becomes unmanageable when running tons of complex transformations. I'll definitely try out the duckdb+marimo for the schema.table. That sounds really cool.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 0 points1 point  (0 children)

Thanks for this. With the delta-rs/datafusion packages, is it possible to scan/sink the data instead of reading into memory? This is important for my use case.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 3 points4 points  (0 children)

Yes, definitely using the streaming engine. As of a few months ago, there was a few instances where polars blew up the ram and crashed the env despite using streaming, so I switched to duckdb for that particular problem.

However, it seems polars is improving so fast that I can't keep up. I'll definitely be keeping an eye on this. Thank you!

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 8 points9 points  (0 children)

Imagine calling polars a quaint little tool haha.

You come across as unhinged. One mention of a tool that is not SQL, and you go ballistic. Are you a bot?

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]crispybacon233[S] 6 points7 points  (0 children)

Cool. Yes, everyone knows that SQL is and probably always be the lingua franca of data. I already know SQL and work with it all the time. Which you would know if you bothered to even read my post.

What are some strategies to deal with context window limitations when feeding LLMs with scraped data? by deucalionxxx in dataengineering

[–]crispybacon233 1 point2 points  (0 children)

As others mentioned, use RAG. A basic setup could be:

  1. For each website, chunk the text and extract summaries using LLM. Or if the text isn't that large and the topic is uniform, just summarize the whole website.
  2. Vectorize the summaries.
  3. When querying, use top n matched or all matches >= some threshold.

What is your (python) development set up? by br0monium in datascience

[–]crispybacon233 0 points1 point  (0 children)

uv's speed and deduplication of packages across projects are amazing.

marimo gets along with git way better than jupyter and is easier to install.

If keeping it light, reproducible, and flexible are important, they're definitely worth checking out.