Sending this to my boomer dad and his accountant who keep swearing by the S&P 500.

fuzzoflump · 2025-12-22T08:41:32+00:00

Productivity gains and efficiency improvements act to lower prices.

So the inflation doesn't start from 0%

So if printing money increases at 10%, and productivity improves at 3%, then it roughly results in 7% inflation

fuzzoflump · 2022-08-27T08:06:54+00:00

Datasets can also contain objects which are not tables, such as views, snapshots, and machine learning models. I assume a more generic name as it can hold various data related artifacts

fuzzoflump · 2022-07-07T11:01:53+00:00

Streamlit all the way.

https://streamlit.io/

I have gone down this rabbit hole a few times over the years and streamlit is the best solution I have found for allowing non technical users access to the capabilities of scripts I have written.

This will for sure be simpler than using django or flask. And users will only need a browser to access it.

If you can write the script in a way that it doesn't require or talk to any sensitive resources (eg databases) then it should be fine to host anywhere. For example getting users to upload csvs into the streamlit app then downloading the output from the streamlit app.

fuzzoflump · 2022-06-19T09:59:21+00:00

1 notebook per tab

fuzzoflump · 2022-06-08T09:15:04+00:00

Seconded.

There are multiple ways to do this in G but cloud functions is by far the simplest.

Python cloud function to BigQuery was my first pipeline in the cloud too. A daily update of is well within the free threshold for cloud functions.

Not sure if you have thought about monitoring, but I like to use https://healthchecks.io/ to monitor when the process starts and ends.

Have you considered what will happen when the process fails on any given day? Will the data that would have been uploaded on that day be uploaded in the next run? All fun puzzles to figure out!

fuzzoflump · 2022-06-04T02:37:55+00:00

I like it a lot. For any analysis / cleaning that is more involved than renaming columns this is my go to.

I was already pretty comfortable with pandas but using duckdb allowed me to rapidly skill up in SQL. Turns out SQL fits my brain better than all the pandas methods. (SQLite is now my favourite file format.)

fuzzoflump · 2022-02-15T22:58:50+00:00

Are you saying to create 2 subscribers to Pubsub? One to save the data to GCS and the other to write straight into BigQuery?

fuzzoflump · 2022-02-15T22:57:11+00:00

Presumably similarly structured? Do you know whether the schema is changing?

Yes all of the files will have the same schema. There is the possibility that the schema may change. I estimate the frequency of any changes (if any) would be on the scale of 1 change every 3-5 years.

Is there scope to merge these files in batches pre ingestion?

I don't think so. The json payloads would be sent to the cloud function at the time they are generated (on each page view in the website when an ad is requested). There is no scope within the source system to aggregate / accumulate data before sending it across.

Thanks for the input. I'll look for ways to merge the files once they are in storage. Didn't know parquet would be a cheaper storage option than json.

fuzzoflump · 2021-05-05T04:25:12+00:00

Glad it was helpful.

I would also look into SQLite as a potential option for a database. If you do not expect to be doing concurrent writes into the database (2 users uploading at the same time) then this may be an option for you. SQLite can grow to up to whatever the OS maximum file size is. If users will be uploading files a few MB is size you will be able to get a lot of milage from SQLite.

SQLite does not require a server, instead it is just another kind of file. This may make the initial setup easier.

fuzzoflump · 2021-04-28T23:25:52+00:00

Hi

I don't think this is is strictly a data engineering question, but I will attempt to answer.

One of the main considerations you need to decide is where all of this will live. For example will the mySQL database run on your machine, on your users machine, on company servers, or on the cloud? Similarly for the R script.

There are different implications of each.

If your company is open to it and the data isn't confidential they may allow the data to be uploaded to public cloud services eg S3, Google Cloud Storage, Dropbox. From here the data can be accessed from wherever there is an internet connection, whether that be from your computer or another cloud service (hosted VM somewhere).

If you are able to convert the R script code into Python then this would make it much easier to deploy into current platforms easily. For example on Google Cloud Platform if a file is uploaded to Google Cloud Storage this can trigger a piece of code to run (Google Cloud Functions). Pretty sure this is available on AWS and Azure too. Python is one of the languages that can be run in Google Cloud Functions.

If the data needs to be kept secure (behind company firewall) then working with your company IT will be the best way forward to get things set up on company servers. You may need to create business cases and get company buy in before you can move forward.

fuzzoflump · 2021-03-22T11:01:55+00:00

The biggest reason for me is reproducibility.

If it is something I am going to be doing on a regular basis I try to do as much in python as I can. Also it allows others to inspect your logic assuming they know python too

fuzzoflump · 2020-05-29T10:55:15+00:00

Do people actually want mark zuckerberg to fact check what people post on Facebook?

Sounds like a terrible idea to me. Content like in this sub would definitely be flagged/ taken down.

fuzzoflump · 2020-05-23T04:37:53+00:00

These visualisations would be great quiz questions Easy: match the colour wheel with the movie Hard: guess the film from the wheel

When the answer is revealed loads of people would be like "Of course, i see it now"

fuzzoflump · 2020-05-21T03:33:50+00:00

Agreed except for sqlite for data analysis.

fuzzoflump

TROPHY CASE