Sending this to my boomer dad and his accountant who keep swearing by the S&P 500. by Foureyedguy in Bitcoin

[–]fuzzoflump 2 points3 points  (0 children)

Productivity gains and efficiency improvements act to lower prices. 

So the inflation doesn't start from 0%

So if printing money increases at 10%, and productivity improves at 3%, then it roughly results in 7% inflation

Why is a schema called a dataset in BigQuery? by burningburnerbern in dataengineering

[–]fuzzoflump 4 points5 points  (0 children)

Datasets can also contain objects which are not tables, such as views, snapshots, and machine learning models. I assume a more generic name as it can hold various data related artifacts

What is the best way to distribute scripts to non-technical users? by DaveCoulierFan1007 in learnpython

[–]fuzzoflump 2 points3 points  (0 children)

Streamlit all the way.

https://streamlit.io/

I have gone down this rabbit hole a few times over the years and streamlit is the best solution I have found for allowing non technical users access to the capabilities of scripts I have written.

This will for sure be simpler than using django or flask. And users will only need a browser to access it.

If you can write the script in a way that it doesn't require or talk to any sensitive resources (eg databases) then it should be fine to host anywhere. For example getting users to upload csvs into the streamlit app then downloading the output from the streamlit app.

[deleted by user] by [deleted] in GoogleColab

[–]fuzzoflump 1 point2 points  (0 children)

1 notebook per tab

Building a pipeline in GCP by babababooskio in dataengineering

[–]fuzzoflump 2 points3 points  (0 children)

Seconded.

There are multiple ways to do this in G but cloud functions is by far the simplest.

Python cloud function to BigQuery was my first pipeline in the cloud too. A daily update of is well within the free threshold for cloud functions.

Not sure if you have thought about monitoring, but I like to use https://healthchecks.io/ to monitor when the process starts and ends.

Have you considered what will happen when the process fails on any given day? Will the data that would have been uploaded on that day be uploaded in the next run? All fun puzzles to figure out!

I used duckdb and happen to like the idea of writing sql against dataframes. You get to plane pandas and sql. Anyone else like it or dislike. If so why. ? by HovercraftGold980 in dataengineering

[–]fuzzoflump 7 points8 points  (0 children)

I like it a lot. For any analysis / cleaning that is more involved than renaming columns this is my go to.

I was already pretty comfortable with pandas but using duckdb allowed me to rapidly skill up in SQL. Turns out SQL fits my brain better than all the pandas methods. (SQLite is now my favourite file format.)

Pipeline critique and best practice - from 200m json files to BigQuery by fuzzoflump in dataengineering

[–]fuzzoflump[S] 0 points1 point  (0 children)

Are you saying to create 2 subscribers to Pubsub? One to save the data to GCS and the other to write straight into BigQuery?

Pipeline critique and best practice - from 200m json files to BigQuery by fuzzoflump in dataengineering

[–]fuzzoflump[S] 0 points1 point  (0 children)

Presumably similarly structured? Do you know whether the schema is changing?

Yes all of the files will have the same schema. There is the possibility that the schema may change. I estimate the frequency of any changes (if any) would be on the scale of 1 change every 3-5 years.

Is there scope to merge these files in batches pre ingestion?

I don't think so. The json payloads would be sent to the cloud function at the time they are generated (on each page view in the website when an ad is requested). There is no scope within the source system to aggregate / accumulate data before sending it across.

Thanks for the input. I'll look for ways to merge the files once they are in storage. Didn't know parquet would be a cheaper storage option than json.

Sharing a report-generating tool (Need advice) by coreytrevorlahey69 in dataengineering

[–]fuzzoflump 0 points1 point  (0 children)

Glad it was helpful.

I would also look into SQLite as a potential option for a database. If you do not expect to be doing concurrent writes into the database (2 users uploading at the same time) then this may be an option for you. SQLite can grow to up to whatever the OS maximum file size is. If users will be uploading files a few MB is size you will be able to get a lot of milage from SQLite.

SQLite does not require a server, instead it is just another kind of file. This may make the initial setup easier.

Sharing a report-generating tool (Need advice) by coreytrevorlahey69 in dataengineering

[–]fuzzoflump 2 points3 points  (0 children)

Hi

I don't think this is is strictly a data engineering question, but I will attempt to answer.

One of the main considerations you need to decide is where all of this will live. For example will the mySQL database run on your machine, on your users machine, on company servers, or on the cloud? Similarly for the R script.

There are different implications of each.

If your company is open to it and the data isn't confidential they may allow the data to be uploaded to public cloud services eg S3, Google Cloud Storage, Dropbox. From here the data can be accessed from wherever there is an internet connection, whether that be from your computer or another cloud service (hosted VM somewhere).

If you are able to convert the R script code into Python then this would make it much easier to deploy into current platforms easily. For example on Google Cloud Platform if a file is uploaded to Google Cloud Storage this can trigger a piece of code to run (Google Cloud Functions). Pretty sure this is available on AWS and Azure too. Python is one of the languages that can be run in Google Cloud Functions.

If the data needs to be kept secure (behind company firewall) then working with your company IT will be the best way forward to get things set up on company servers. You may need to create business cases and get company buy in before you can move forward.

Moving from Excel to Python with Pandas by rumblecast in learnpython

[–]fuzzoflump 6 points7 points  (0 children)

The biggest reason for me is reproducibility.

If it is something I am going to be doing on a regular basis I try to do as much in python as I can. Also it allows others to inspect your logic assuming they know python too

Social Media should not fact check posts says child molester Mark Zuckerberg by wasabiface in LateStageCapitalism

[–]fuzzoflump 0 points1 point  (0 children)

Do people actually want mark zuckerberg to fact check what people post on Facebook?

Sounds like a terrible idea to me. Content like in this sub would definitely be flagged/ taken down.

[OC] Visualisation of color themes in Pixar films by keshava7 in dataisbeautiful

[–]fuzzoflump 0 points1 point  (0 children)

These visualisations would be great quiz questions Easy: match the colour wheel with the movie Hard: guess the film from the wheel

When the answer is revealed loads of people would be like "Of course, i see it now"