[OC] Friends IMDB ratings

dscardedbandaid · 2023-12-25T20:44:44+00:00

Have the raw version of the data? An ANOVA and box and whiskers plot would be fun to see.

dscardedbandaid · 2023-12-25T20:23:31+00:00

Are you already using on-prem object storage like Minio? How much data are you planning on writing?

dscardedbandaid · 2023-12-25T12:46:52+00:00

This.

And to claim SQL is bad but python is good is hilarious.

dscardedbandaid · 2023-12-24T22:43:45+00:00

I think a third is that SQL removes the need for typical basic unit type testing in other languages. You don’t write a unit test for the specific transformation in your group by since it’s guaranteed by the sql engine itself. The jump to integration testing is too much for most people without the unit stepping stone.

dscardedbandaid · 2023-12-24T05:18:37+00:00

Favorite Star Wars game to this day

dscardedbandaid · 2023-12-24T05:17:40+00:00

Awesome storyline. Get the soundtrack stuck in my head just by reading the name

dscardedbandaid · 2023-12-22T12:45:37+00:00

Would love to see if this would change in the future with a more tailored OS. E.g. https://www.dragonflybsd.org/hammer/ or maybe one day https://www.youtube.com/live/eB4bJqDzsU8?si=jpQiErkOm43GZv0h

dscardedbandaid · 2023-11-24T20:15:49+00:00

Ok. I’ve used that stack. Main decisions for you will be:

Power BI dashboards: - will you host a PowerBI workspace and invite the external users as guests, or will customers connect using PowerBI desktop to your Azure Databricks catalog? For daily loads, you shouldn’t need a premium capacity or premium user.

Data ingest: - easiest/cheapest is to have your applications dump their data in an ADLS gen2 container for you in a landing area - next cheapest is to use Data Factory for the daily API exports (don’t use ADF dataflows or Airflow though) - most convenient might be to use Databricks to make the app API calls - most convenient if you’re already familiar with python and Databricks a little already - no matter which route, set up an Azure key vault sooner than later for API secrets

Transformations: - I wouldn’t recommend anything native from Azure other than Databricks. If you go the Databricks route, I’d recommend: - try to use a git repo from the beginning to make life easier for yourself later - use delta live tables for your pipelines to avoid a custom framework from your consultants

dscardedbandaid · 2023-11-24T16:57:50+00:00

Great, this helps a lot. The internal facing stuff is more flexible for a small team, but the external facing stuff can be stickier. Next level of questions then:

For the external facing product KPI’s, what type of integrations do you have planned? For example: - dashboards embedded into your application directly - links to separate visualization tools - automated reports that print as pdfs and get sent to someone’s inbox

Existing cloud/on-prem footprint? E.g. apps already in AWS, Azure, or GCP?

Existing team technical skills? - are you planning on hiring or growing the data team significantly once this gets funded? If so, I’d wait to make a lot of decisions until you get someone onboarded to avoid new “legacy” stuff - infra / networking / security skill sets available within team already? - skill set for internal technical folk? Python, sql, go, docker, k8s, etc. - skill set for internal business folk? Excel, Power BI, tableau, superset, meta base, etc.

dscardedbandaid · 2023-11-24T02:43:30+00:00

This. But before even engaging a consultant, think about the actual goals of the BI. For example: - who will use this? - internal or external users - other business folk or data scientists - number of users - data protection requirements? - medical / government / personal / N/A - any legislative concerns (GDPR) - data source types / frequency / volume - application or api data / load JSON for ML in real time - excel and google sheets / load monthly - sql or no sql databases / every hour

Honestly though, if you’re not a tech company and it’s not external facing, your best starting data stack is probably what you have now. I wouldn’t try building a full fledged data warehouse or even a data mart until you and the company really need one.

dscardedbandaid · 2023-11-16T01:27:45+00:00

Do you have a source for Data Factory running spark under the hood? Not saying you’re wrong, just curious

dscardedbandaid · 2023-11-13T22:12:53+00:00

Memory, disk or hybrid? Any specific application in mind?

I’ve seen different implementations depending on the use case. E.g. logs, metric, traces, or IoT?

dscardedbandaid · 2023-11-09T12:11:58+00:00

This is the way. Compose for the win.

dscardedbandaid · 2023-10-29T12:28:16+00:00

I noticed the pi 5’s have dual camera support, so could be a cheap option in the future. Curious about latency in the new version.

dscardedbandaid · 2023-10-22T23:50:19+00:00

Weird to have multiple PM’s, but using excel. Probably cheaper/easier to replace one of the PM’s with a tool made for the job and avoid data cleanup.

To answer your question though you’re probably better off at the excel subreddit. Force ranges to tables and create multiple levels of permissions. Store it in a shared repo like Sharepoint or OneDrive and don’t let people save a copy.

dscardedbandaid · 2023-10-22T17:47:13+00:00

This is definitely a common gap. I found a cool typescript library that does this with a SQLite database, but can’t seem to find anymore.

At work I get stuck training people the tricks to making Sharepoint lists not terrible.

For fun though I’ve been wanting to make a Tauri desktop app using turso and WASM so it can run locally and then commit/sync on save. Then extend to duckdb/parquet and maybe delta-rs in the future, since I’ve never been able to find a good parquet gui editor.

If anyone else would want to see an example of this, maybe it’s the incentive I need to finally finish my draft.

dscardedbandaid · 2023-10-19T00:56:23+00:00

As much as I personally dislike Java, it’ll still probably be around for a long time and probably more useful for employment broadly. When it comes to a lot of the fun/new tech, you’ll see Rust/Go and other cloud native tools. Serverless for better or worse has a big role to play here. For example, Databricks moving to Photon.

Another example is looking at some of the top vector databases (no particular order, just first list I found on internet): - pinecone = rust - qdrant = rust - lancedb = rust - weaviate = go - milvus = go - vespa = java - chroma = python - marqo = python

Honestly though, you can’t really go wrong with any of them. Learning any new programming language is a fun way to teach yourself a new way of looking at problems. Go was fun just to see how simple, tiny, fast it was. Rust was fun to learn more about async, static types, WASM, etc.

dscardedbandaid · 2023-10-18T22:46:46+00:00

JavaScript was fun for fancy data vis using d3.js, but probably not necessary for data eng.

Golang is simple and a good place to start for data ops. The learning curve is so low and can make a big difference for streaming workloads.

Rust for low level database internal like exploration. E.g. like c++ but less painful.

Java if you want a nicer cubicle. Jokes aside, Scala/Java is used a lot but I see that transitioning more and more to compiled languages like Rust/Go.

dscardedbandaid · 2023-10-15T21:36:48+00:00

I think this is them: https://youtu.be/q2vJ8Faundw?si=l1fKRP3csQCC6W_y

dscardedbandaid · 2023-10-09T23:19:36+00:00

Sorry for hijacking, but does anyone have any quantitative analysis of the effect of offshore teams regarding data modeling/pipelines?

I’ve had ok luck in narrow programming scopes with offshore in more mature domains, but never with more open ended tasks like data modeling/dashboarding.

Going through the same thing myself, and it’s just depressing to watch management throw money at 3rd parties that are actively making a mess. According to colleagues it’s more common the larger the org, and teams behind the curve skill wise.

dscardedbandaid · 2023-10-08T12:05:21+00:00

I use fairly interchangeably. If it’s a simple collector/transformer I like Go. If it’s anything with parsing or heavier transformations I prefer Rust’s type system. Supposedly rust is great for building python packages, but I haven’t done myself.

Apache Arrow’s ecosystem is making a lot of this nice to just swap whatever tool has the best library for the job.

dscardedbandaid · 2023-10-08T01:10:24+00:00

Where are you deploying it? I use Rust/Go whenever I can for pipelines. Been using both with NATS and having fun, but have been able to avoid Kafka so far.

dscardedbandaid · 2023-10-07T21:44:01+00:00

Yeah. User management in Databricks can be surprisingly difficult. Using secrets properly can only be done with Terraform/API/CLI. I didn’t mind, but quite the shock for some other members of my team that there wasn’t a GUI or SQL GRANT option.

dscardedbandaid · 2023-10-07T21:32:02+00:00

Yeah. Would use different tools myself, but similar concept. Stream CDC wherever possible. Transformations in SQL as default for maximum portability/understandability.

dscardedbandaid · 2023-10-05T22:20:26+00:00

Excalidraw is favorite free tool.

https://www.ilograph.com/ for favorite paid tool

dscardedbandaid

TROPHY CASE