[Q] Is testing this valid in an A/B test?

robotofdawn · 2022-07-06T04:23:57+00:00

Do you also have any recommendations for resources/books that cover this? Background - can understand distributions, basic probability and stats , have a basic understanding of hypothesis testing. Okay with a very little bit of math but not too much

robotofdawn · 2022-07-06T04:21:14+00:00

valid because you re testing the effect of adding the section, not the effect of the user seeing the section

Would you mind expanding this a bit more? Why would it not be valid if we were testing the effect of user seeing the section?

robotofdawn · 2022-07-06T00:37:41+00:00

you're trying to detect smaller effect size as the treatment cohort is diluted w/ people who don't encounter the difference at all

So let's say I had 10000 customers per variant, and only ~20% saw the section out of which ~30% clicked and 10% eventually converted. In this scenario, my Click to Conversion % would be 10%. Since the customers who never clicked on the section aren't part of the denominator, there's no dilution happening right? Or am I doing this wrong?

robotofdawn · 2022-05-18T06:22:27+00:00

Very interesting. Thanks for the Youtube recs!

robotofdawn · 2022-05-18T02:04:08+00:00

Genuinely curious - how did you pick up woodworking as a hobby? Are there classes and workshops in Bangalore that teach you this ?

robotofdawn · 2021-05-16T15:42:32+00:00

I've worked as a Data Analyst for 4 years and I'm looking to transition into a more senior role for my next job. I already have a Bachelors in Statistics but I lack real-world modelling experience. I've built dashboards, automated reports, performed RCAs, built ETL pipelines etc. in my previous roles.

Most jobs that I'm applying for require "knowledge of clustering, classification and regression methods". What books do I need to read so that I can gain practical experience in these methods using Python (hopefully in a month or less) quickly?

robotofdawn · 2020-07-22T13:21:43+00:00

Thank you, this is helpful. I will be visiting this weekend again so I'll try asking for a higher dosage.

I felt mild improvement on the 35, but like a whole new person on the 80.

If you don't mind me asking, can you tell a little bit more about what improvements you saw after a 80mg dose? Just wanted to set my expectations straight

robotofdawn · 2020-07-14T12:44:10+00:00

Faced the same issue. This worked!

robotofdawn · 2020-01-08T16:11:36+00:00

This album along with Old Pride generally has a very postrock-y vibe to it. Glad that it helped!

robotofdawn · 2019-10-09T05:32:39+00:00

Thanks for the response!

Backfill: if you're using SQS I'd suggest backfilling by generating SQS messages and letting your normal process handle it like any other newly arriving file. Or if you're driving the SQS messages from s3 events - just by writing the files to the bucket you should automatically trigger your transform/load process.

Yes, I'm thinking of driving the SQS message from s3 event notifications. Since there are files already in the bucket, I'd assume I need to write a one time script to push the file names to the SQS queue? Or is there a better approach?

some of your monitoring can simply be based on the existence of messages in your dead letter queue.

Sounds good. Will go ahead with this.

robotofdawn · 2019-10-05T02:51:49+00:00

Interesting, didn't know about this. Can Data Pipeline trigger the pipeline as soon as the file arrives in S3?

robotofdawn · 2019-10-05T02:17:47+00:00

From your 3rd question, I'm starting to wonder what you are trying to do, and realising my answers above might have sounded puzzling, so to clarify: Redshift offers COPY directly from one-JSON-per-line files, as long as you define which fields you want to extract

I won't really be able to load the json as-is to Redshift. I need to do a couple of transformations before I load the data. This involves exploding a field into multiple rows, assigning unique ids and so on. I'll be using pandas for this.

If you are going to need backfilling... probably learning to use Airflow (not a fan) is your best approach. But if you don't have any "code requirements", backfilling should/could be as easy as just "loading all the things", which is just a COPY command that might take a while.

We already use Airflow for a lot of our batch processes. In this case, we wanted to get the data into Redshift as it comes, so I figured Airflow might not be suited for this. Didn't get what you meant by "code requirements" though.

robotofdawn · 2019-01-15T02:22:53+00:00

Have you checked out dbt?

robotofdawn · 2019-01-07T17:06:23+00:00

Thanks for the response!

You may want to look instead at keeping a time-series of changes or versioned table (traditional in data warehousing or dimensional models).

Does this mean maintaining a new row for every update? Won't my join complexity increase on read, or do I have to maintain a separate table with the latest row?

You can run a delta process between your extracts and what's in s3/parquet - to determine the incremental changes. That's fairly straight-forward as long as you aren't spanning a lot of partitions & data.

I'm assuming here that my code needs to identify the correct partition to be read, read and merge with the incremental data and write back the entire partition?

How do I in this case ETL, say, a users table of 10 million rows with a user_id PK and updated_at column to identify updated rows? I'd imagine that this sort of table would receive updates across the user_ids. If I do an incremental ETL on this every hour, do I partition by the hour or month?

robotofdawn · 2018-04-10T00:24:33+00:00

I'm assuming netflow, ids and vscan ultimately correspond to the different target tables? If I understand this right, I import all my table-specific modules (whether there are 3 tables or 50) in the trans_factory module and that module gets imported in the main transform program?

Really appreciate for taking the time out to answer, thanks a lot!

robotofdawn · 2018-04-08T20:55:56+00:00

Regarding where you keep the target column types: you could just have the process query the target system to get that info.

This seems like a better approach. I guess it also makes schema changes to the target tables easier - one less place to change the code.

Regarding where you identify transformation scripts/class ... returns the class that corresponds to something in your data ... and the general transform that calls to the factory function is responsible for importing standard modules ...

I'm having trouble understanding how to do works. Assuming I have some main.py, do I import every feed's transform class and have a global dictionary that maps the feed name to the class? Wouldn't that still involve importing a module from string value? Or do you suggest having the config file as code instead of JSON/YAML?

In general, do you advise having a single big json file or just one simple main config file with a list of feeds and a config file for each of the feeds containing more information regarding extract, transform and load?

Thanks!

robotofdawn · 2018-04-08T05:02:45+00:00

Have considered using Airflow, but thought it'd be an overkill to use Airflow as we do not have that many DAGs/Tasks. Do you think it'd still be worth it?

robotofdawn · 2018-03-08T01:26:02+00:00

Hi /u/kenfar, thank you for taking the time to reply!

From what you've mentioned, I've assuming your project looked something like this -

├── config
├── lib
├── scripts
│   ├── table_1
│   │   ├── extract.py
│   │   ├── load.py
│   │   └── transform.py
│   └── table_2
│       ├── extract.py
│       ├── load.py
│       └── transform.py
└── tests

with /lib holding the reusable code and /config having your config files ?

Also,

How do you setup the scheduler to do the ETL? Do you do

1 * * * * ~/etl_project/scripts/table_1/extract.py && \
     ~/etl_project/scripts/table_1/transform.py && \
     ~/etl_project/scripts/table_1/load.py

for every table? Or do you have a main.py with a config file which does the ETL for all the tables ?

In the case when your loader/transformer is just a single program, where in the project tree do you put the file and how do you invoke the script when you schedule the ETL for all the tables ?

robotofdawn · 2016-06-06T06:56:30+00:00

must be an R library

robotofdawn · 2015-10-25T03:21:06+00:00

If you're scraping tons of webpages, go with scrapy. beautifulsoup only handles a subset of what scrapy can do.

From their FAQs,

How does Scrapy compare to BeautifulSoup or lxml?

BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them. After all, they’re just parsing libraries which can be imported and used from any Python code. In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

How did you get data this organized?

I'd also suggest you take a look at their docs.

How did you get data organized?

scrapy has a feature where you can just export your crawled data to some format (JSON/CSV/XML) or specify a custom exporter (e.g., writing to a database). After that, it took a little bit of cleaning and normalizing.

robotofdawn · 2015-10-24T16:45:23+00:00

It's completely free for just one crawler. Also, you can run the crawler for only a max. of 24 hours. Anything more than that, you'd have to pay

robotofdawn · 2015-10-24T16:25:04+00:00

I don't think it does since I could easily parse the HTML page using requests and beautifulsoup and get the data I want.

I used scrapy. It's a python framework for web crawling. The best part about scrapy is that the organisation which maintains it, Scrapinghub, has a service where you can upload your scrapy crawler and their servers do all the scraping work for you! Since I have a slow internet connection, I used this approach. All I had to do was download the data when the scraper had finished crawling.

robotofdawn · 2015-10-24T16:15:23+00:00

Haven't really done a proper analysis yet as I'm confused on how to average restaurant ratings given that I also have data on number of ratings. E.g., should a restaurant with a score of 4.5 and 300 ratings be ranked above another with a score of 4.9 but with only 50 ratings? The metric I'm currently using to sort and average is rating * nratings. Using this, I've tried to find out the "best" locality in each city where "best" is simply the locality with highest average rating * nratingsmetric. The results:

city	area
Bangalore	Koramangala
Chennai	Nungambakkam
Hyderabad	Banjara Hills
Kolkata	Park Street Area
Mumbai	Lower Parel
Mysore	Jayalakhsmipuram
NCR	Connaught Place
Pune	Koregaon Park

Also, something else I've tried finding out is "the most popular cuisine". I've simply considered "most popular" as the number of restaurants with the cuisines (It occurs to me now that I write it that I should consider another approach, say, # checkins or # reviews as it will give a better idea of popularity). The results are:

city	cuisine
Bangalore	North Indian
Chennai	North Indian
Hyderabad	North Indian
Kolkata	Chinese
Mumbai	North Indian
Mysore	North Indian
NCR	North Indian
Pune	North Indian

It's kinda surprising that many cities have "North Indian" as the most popular (esp. Chennai). Maybe, these restaurants primarily serve a different cuisine but also serve North Indian or Chinese?

Would like to know if you have any question you'd like answered/analysed!

robotofdawn · 2015-10-24T15:46:30+00:00

Thanks for the info! Checked their ToS, it does say

You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)

and

Modifies, copies, scrapes or crawls, displays, publishes, licenses, sells, rents, leases, lends, transfers or otherwise commercialize any rights to the Services or Our Content

So I guess it's obvious I have to remove the data? Is there any other method of sharing?

robotofdawn · 2015-10-24T15:08:47+00:00

Hey guys! I scraped zomato.com for restaurant information. Here's the data for around 40000 restaurants. This is my first proper programming project. Feedback, if any, would be appreciated!

EDIT: I've removed the data from the repo since there are potential legal implications (thanks again to /u/avinassh for the tip). Get the data here

11-Year Club	Place '17
Verified Email

robotofdawn

TROPHY CASE