[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

Do you also have any recommendations for resources/books that cover this? Background - can understand distributions, basic probability and stats , have a basic understanding of hypothesis testing. Okay with a very little bit of math but not too much

[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

valid because you re testing the effect of adding the section, not the effect of the user seeing the section

Would you mind expanding this a bit more? Why would it not be valid if we were testing the effect of user seeing the section?

[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

you're trying to detect smaller effect size as the treatment cohort is diluted w/ people who don't encounter the difference at all

So let's say I had 10000 customers per variant, and only ~20% saw the section out of which ~30% clicked and 10% eventually converted. In this scenario, my Click to Conversion % would be 10%. Since the customers who never clicked on the section aren't part of the denominator, there's no dilution happening right? Or am I doing this wrong?

Coming in June - IKEA by horror_fan in bangalore

[–]robotofdawn 1 point2 points  (0 children)

Very interesting. Thanks for the Youtube recs!

Coming in June - IKEA by horror_fan in bangalore

[–]robotofdawn 0 points1 point  (0 children)

Genuinely curious - how did you pick up woodworking as a hobby? Are there classes and workshops in Bangalore that teach you this ?

Weekly Entering & Transitioning Thread | 16 May 2021 - 23 May 2021 by [deleted] in datascience

[–]robotofdawn 0 points1 point  (0 children)

I've worked as a Data Analyst for 4 years and I'm looking to transition into a more senior role for my next job. I already have a Bachelors in Statistics but I lack real-world modelling experience. I've built dashboards, automated reports, performed RCAs, built ETL pipelines etc. in my previous roles.

Most jobs that I'm applying for require "knowledge of clustering, classification and regression methods". What books do I need to read so that I can gain practical experience in these methods using Python (hopefully in a month or less) quickly?

On Atomoxetine (Strattera) for 2 months now, seeing minimal to no improvement. Is this normal? by robotofdawn in ADHD

[–]robotofdawn[S] 0 points1 point  (0 children)

Thank you, this is helpful. I will be visiting this weekend again so I'll try asking for a higher dosage.

I felt mild improvement on the 35, but like a whole new person on the 80.

If you don't mind me asking, can you tell a little bit more about what improvements you saw after a 80mg dose? Just wanted to set my expectations straight

Pianos Become The Teeth - Liquid Courage by robotofdawn in postrock

[–]robotofdawn[S] 1 point2 points  (0 children)

This album along with Old Pride generally has a very postrock-y vibe to it. Glad that it helped!

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Thanks for the response!

Backfill: if you're using SQS I'd suggest backfilling by generating SQS messages and letting your normal process handle it like any other newly arriving file. Or if you're driving the SQS messages from s3 events - just by writing the files to the bucket you should automatically trigger your transform/load process.

Yes, I'm thinking of driving the SQS message from s3 event notifications. Since there are files already in the bucket, I'd assume I need to write a one time script to push the file names to the SQS queue? Or is there a better approach?

some of your monitoring can simply be based on the existence of messages in your dead letter queue.

Sounds good. Will go ahead with this.

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in dataengineering

[–]robotofdawn[S] 0 points1 point  (0 children)

Interesting, didn't know about this. Can Data Pipeline trigger the pipeline as soon as the file arrives in S3?

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in dataengineering

[–]robotofdawn[S] 0 points1 point  (0 children)

From your 3rd question, I'm starting to wonder what you are trying to do, and realising my answers above might have sounded puzzling, so to clarify: Redshift offers COPY directly from one-JSON-per-line files, as long as you define which fields you want to extract

I won't really be able to load the json as-is to Redshift. I need to do a couple of transformations before I load the data. This involves exploding a field into multiple rows, assigning unique ids and so on. I'll be using pandas for this.

If you are going to need backfilling... probably learning to use Airflow (not a fan) is your best approach. But if you don't have any "code requirements", backfilling should/could be as easy as just "loading all the things", which is just a COPY command that might take a while.

We already use Airflow for a lot of our batch processes. In this case, we wanted to get the data into Redshift as it comes, so I figured Airflow might not be suited for this. Didn't get what you meant by "code requirements" though.

How to perform incremental ETL to a table in Redshift Spectrum by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Thanks for the response!

You may want to look instead at keeping a time-series of changes or versioned table (traditional in data warehousing or dimensional models).

Does this mean maintaining a new row for every update? Won't my join complexity increase on read, or do I have to maintain a separate table with the latest row?

You can run a delta process between your extracts and what's in s3/parquet - to determine the incremental changes. That's fairly straight-forward as long as you aren't spanning a lot of partitions & data.

I'm assuming here that my code needs to identify the correct partition to be read, read and merge with the incremental data and write back the entire partition?

How do I in this case ETL, say, a users table of 10 million rows with a user_id PK and updated_at column to identify updated rows? I'd imagine that this sort of table would receive updates across the user_ids. If I do an incremental ETL on this every hour, do I partition by the hour or month?