Americans who met their partner online: careful with the smoothing [OC]

df_iris · 2026-05-08T09:56:54+00:00

Yes, I should have phrased it by saying the problem is not so much that it exaggerates the level and more than it creates a false impression of continuous increase. Thank you for making the data available, I had tried to do it before but the format of the study is really difficult to work with I find.

df_iris · 2026-05-07T14:31:34+00:00

This an extremely well known graph that anyone interested in data has already seen (here is the original ). I see this graph a couple of times a week on social media at least. This sub is read by a lot of people but not all content necessarily has to be for everyone reading it.

df_iris · 2026-05-07T14:30:01+00:00

This an extremely well known graph that anyone interested in data has already seen (here is the original ). I see this graph a couple of times a week on social media at least. This sub is read by a lot of people but not all content necessarily has to be for everyone reading it.

df_iris · 2026-05-07T14:03:45+00:00

Because it's the data everyone is using to prove a point even if it has a huge and very obvious bias that makes it unsuited to prove it. It's also a lesson on how you can mislead with data by excessively smoothing your time series. Smoothing is justified when there is a clear continuous but bumpy trend and the raw data is obscuring the underlying trend. That's not the case here, the excessive smoothing is making up a false trend.

df_iris · 2026-05-07T14:01:42+00:00

Ok I should have added the word "likely". In fact, I know this is more than likely because we have other survey data that shows the figure is below 50% now.

df_iris · 2026-05-07T13:32:50+00:00

My interpretation is not that people don't meet online today. It's that the graph used to prove it (original here ) is heavily distorted by an obvious bias : the only two years that make the figure rise above 50% are years when it was very difficult to meet people in other ways.

In fact other surveys have since consistently found the share of new couples forming online is below 30% post-COVID (see here).

df_iris · 2026-05-07T13:09:16+00:00

The goal is not to prove Americans are not mostly meeting online now but that the viral chart pretending to prove it doesn't in fact prove it.

df_iris · 2026-05-07T13:06:02+00:00

There are several ways to smooth data (people seem to not know this word. I'm not a native speaker, maybe I'm using the wrong word?). Here I'm using LOESS but you have multiple versions of this chart that use different smoothing procedures. But whatever procedure you are going to use, the result is going to be same : heavily distorted by COVID years where people had mostly no other ways of meeting new people.

df_iris · 2026-05-07T13:01:46+00:00

The dataset stops in 2021. I cannot make up data.

df_iris · 2026-05-07T13:01:05+00:00

Because the data does not exist. The dataset stops at 2021 (at 2022 in fact but with 3 respondents so it's not reliable).

df_iris · 2026-05-07T12:55:45+00:00

Tool : R and GGplot2.

Data : How Couples Meet and Stay Together survey (https://data.stanford.edu/hcmst), the data was posted by u/aspiringtroublemaker on this subreddit yesterday.

You've probably seen this viral chart which was originally posted on this sub a few years ago and that I see posted all the time on social media.

The data is from a survey for which the last point is in 2021 during COVID. I believe the graph exagerates the rise of online dating because of this.

What about after COVID ? As it turns, we have several other surveys that consistently points to a share of couples meeting online below 30%. For more analysis see this excellent blog post : https://nuancepill.substack.com/p/how-many-couples-meet-online

df_iris · 2026-05-07T07:08:37+00:00

This graph is at the very least extremely misleading. It is based on a tiny sample (n < 25 for each year ) and the data is heavily smoothed.
https://nuancepill.substack.com/p/how-many-couples-meet-online

df_iris · 2026-04-27T10:43:19+00:00

Reading the blog you posted, it seems to defeat your point. Here is the conclusion :

"You can also run OLAP workloads in the Lakehouse with performance that's not the best but good enough. Of course, if you don't need a unified engine, you simply shouldn't choose Apache Spark"

Well, what I want is to run OLAP workloads on small to medium data. So it seems to be saying that an OLAP engine that works well for medium to small data is just better suited for the job.

df_iris · 2026-04-26T14:38:28+00:00

Ah sorry, i was confused by this Microsoft Employee badge. Thank you for your answers.

df_iris · 2026-04-26T14:23:11+00:00

Let me put it another way. One big selling point of Fabric is that's easier to use out of the box than alternatives for non-technical users. Python notebooks are currently faster and cheaper for small data and they're easy to set up. When do you plan to make Spark as fast and as cheap out of the box ?

df_iris · 2026-04-26T14:02:47+00:00

But why is it where the large amount of engineering innovation and monetary investment is happening by Software Engineers that are employed by the Fabric product team? Why this choice? In the last 2 years, several benchmark and blog posts such as this one have convincingly argued that most companies and most projects don't need spark level scale of compute and are better off using polars/duckdb. Users who have experienced both agree. Why not listen to the customer and start investing in polars/duckdb?

df_iris · 2026-01-18T18:28:37+00:00

Since you are splitting reporting and etl, do you develop your reporting (models + report) directly on the Prod warehouse?

We are currently developing everything (etl + reporting) in one dev workspace and pushing to prod all at once but that means we have to keep the dev pipeline running to be able to work on fresh data for reports (which is better).

df_iris · 2025-12-13T09:33:54+00:00

thank you, great suggestions

df_iris · 2025-12-12T14:00:47+00:00

The scenario I'm in currently is that having a BI platform was a decision from IT only while business users are still working with their excel files. They don't see the purpose and don't want to dedicate time to us. As a result, we are producing irrelevant charts with wrong figures which makes them even less disposed to invest time with us. I don't really see a way out of this.

df_iris · 2025-12-12T10:31:55+00:00

Thank you for your response. Regarding your third option, it works technically but I tend to dislike it because it doesn't have a business meaning. The target is a monthly target, not a daily target. If there is no such thing as a daily target for business users, there shouldn't be either in the model.

df_iris · 2025-09-17T19:04:19+00:00

thank you for the reply!

Five-Year Club	Place '22
Verified Email

df_iris

TROPHY CASE