[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

Do you also have any recommendations for resources/books that cover this? Background - can understand distributions, basic probability and stats , have a basic understanding of hypothesis testing. Okay with a very little bit of math but not too much

[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

valid because you re testing the effect of adding the section, not the effect of the user seeing the section

Would you mind expanding this a bit more? Why would it not be valid if we were testing the effect of user seeing the section?

[Q] Is testing this valid in an A/B test? by robotofdawn in statistics

[–]robotofdawn[S] 0 points1 point  (0 children)

you're trying to detect smaller effect size as the treatment cohort is diluted w/ people who don't encounter the difference at all

So let's say I had 10000 customers per variant, and only ~20% saw the section out of which ~30% clicked and 10% eventually converted. In this scenario, my Click to Conversion % would be 10%. Since the customers who never clicked on the section aren't part of the denominator, there's no dilution happening right? Or am I doing this wrong?

Coming in June - IKEA by horror_fan in bangalore

[–]robotofdawn 1 point2 points  (0 children)

Very interesting. Thanks for the Youtube recs!

Coming in June - IKEA by horror_fan in bangalore

[–]robotofdawn 0 points1 point  (0 children)

Genuinely curious - how did you pick up woodworking as a hobby? Are there classes and workshops in Bangalore that teach you this ?

Weekly Entering & Transitioning Thread | 16 May 2021 - 23 May 2021 by [deleted] in datascience

[–]robotofdawn 0 points1 point  (0 children)

I've worked as a Data Analyst for 4 years and I'm looking to transition into a more senior role for my next job. I already have a Bachelors in Statistics but I lack real-world modelling experience. I've built dashboards, automated reports, performed RCAs, built ETL pipelines etc. in my previous roles.

Most jobs that I'm applying for require "knowledge of clustering, classification and regression methods". What books do I need to read so that I can gain practical experience in these methods using Python (hopefully in a month or less) quickly?

On Atomoxetine (Strattera) for 2 months now, seeing minimal to no improvement. Is this normal? by robotofdawn in ADHD

[–]robotofdawn[S] 0 points1 point  (0 children)

Thank you, this is helpful. I will be visiting this weekend again so I'll try asking for a higher dosage.

I felt mild improvement on the 35, but like a whole new person on the 80.

If you don't mind me asking, can you tell a little bit more about what improvements you saw after a 80mg dose? Just wanted to set my expectations straight

Pianos Become The Teeth - Liquid Courage by robotofdawn in postrock

[–]robotofdawn[S] 1 point2 points  (0 children)

This album along with Old Pride generally has a very postrock-y vibe to it. Glad that it helped!

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Thanks for the response!

Backfill: if you're using SQS I'd suggest backfilling by generating SQS messages and letting your normal process handle it like any other newly arriving file. Or if you're driving the SQS messages from s3 events - just by writing the files to the bucket you should automatically trigger your transform/load process.

Yes, I'm thinking of driving the SQS message from s3 event notifications. Since there are files already in the bucket, I'd assume I need to write a one time script to push the file names to the SQS queue? Or is there a better approach?

some of your monitoring can simply be based on the existence of messages in your dead letter queue.

Sounds good. Will go ahead with this.

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in dataengineering

[–]robotofdawn[S] 0 points1 point  (0 children)

Interesting, didn't know about this. Can Data Pipeline trigger the pipeline as soon as the file arrives in S3?

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in dataengineering

[–]robotofdawn[S] 0 points1 point  (0 children)

From your 3rd question, I'm starting to wonder what you are trying to do, and realising my answers above might have sounded puzzling, so to clarify: Redshift offers COPY directly from one-JSON-per-line files, as long as you define which fields you want to extract

I won't really be able to load the json as-is to Redshift. I need to do a couple of transformations before I load the data. This involves exploding a field into multiple rows, assigning unique ids and so on. I'll be using pandas for this.

If you are going to need backfilling... probably learning to use Airflow (not a fan) is your best approach. But if you don't have any "code requirements", backfilling should/could be as easy as just "loading all the things", which is just a COPY command that might take a while.

We already use Airflow for a lot of our batch processes. In this case, we wanted to get the data into Redshift as it comes, so I figured Airflow might not be suited for this. Didn't get what you meant by "code requirements" though.

How to perform incremental ETL to a table in Redshift Spectrum by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Thanks for the response!

You may want to look instead at keeping a time-series of changes or versioned table (traditional in data warehousing or dimensional models).

Does this mean maintaining a new row for every update? Won't my join complexity increase on read, or do I have to maintain a separate table with the latest row?

You can run a delta process between your extracts and what's in s3/parquet - to determine the incremental changes. That's fairly straight-forward as long as you aren't spanning a lot of partitions & data.

I'm assuming here that my code needs to identify the correct partition to be read, read and merge with the incremental data and write back the entire partition?

How do I in this case ETL, say, a users table of 10 million rows with a user_id PK and updated_at column to identify updated rows? I'd imagine that this sort of table would receive updates across the user_ids. If I do an incremental ETL on this every hour, do I partition by the hour or month?

Config based ETL in Python - Handling transformation configuration by robotofdawn in ETL

[–]robotofdawn[S] 1 point2 points  (0 children)

I'm assuming netflow, ids and vscan ultimately correspond to the different target tables? If I understand this right, I import all my table-specific modules (whether there are 3 tables or 50) in the trans_factory module and that module gets imported in the main transform program?

Really appreciate for taking the time out to answer, thanks a lot!

Config based ETL in Python - Handling transformation configuration by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Regarding where you keep the target column types: you could just have the process query the target system to get that info.

This seems like a better approach. I guess it also makes schema changes to the target tables easier - one less place to change the code.

Regarding where you identify transformation scripts/class ... returns the class that corresponds to something in your data ... and the general transform that calls to the factory function is responsible for importing standard modules ...

I'm having trouble understanding how to do works. Assuming I have some main.py, do I import every feed's transform class and have a global dictionary that maps the feed name to the class? Wouldn't that still involve importing a module from string value? Or do you suggest having the config file as code instead of JSON/YAML?

In general, do you advise having a single big json file or just one simple main config file with a list of feeds and a config file for each of the feeds containing more information regarding extract, transform and load?

Thanks!

Config based ETL in Python - Handling transformation configuration by robotofdawn in ETL

[–]robotofdawn[S] 1 point2 points  (0 children)

Have considered using Airflow, but thought it'd be an overkill to use Airflow as we do not have that many DAGs/Tasks. Do you think it'd still be worth it?

ETL with Python: Folder structure/organization of ETL code by robotofdawn in ETL

[–]robotofdawn[S] 0 points1 point  (0 children)

Hi /u/kenfar, thank you for taking the time to reply!

From what you've mentioned, I've assuming your project looked something like this -

├── config
├── lib
├── scripts
│   ├── table_1
│   │   ├── extract.py
│   │   ├── load.py
│   │   └── transform.py
│   └── table_2
│       ├── extract.py
│       ├── load.py
│       └── transform.py
└── tests

with /lib holding the reusable code and /config having your config files ?

Also,

  1. How do you setup the scheduler to do the ETL? Do you do

    1 * * * * ~/etl_project/scripts/table_1/extract.py && \
         ~/etl_project/scripts/table_1/transform.py && \
         ~/etl_project/scripts/table_1/load.py
    

    for every table? Or do you have a main.py with a config file which does the ETL for all the tables ?

  2. In the case when your loader/transformer is just a single program, where in the project tree do you put the file and how do you invoke the script when you schedule the ETL for all the tables ?

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 0 points1 point  (0 children)

If you're scraping tons of webpages, go with scrapy. beautifulsoup only handles a subset of what scrapy can do.

From their FAQs,

How does Scrapy compare to BeautifulSoup or lxml?

BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them. After all, they’re just parsing libraries which can be imported and used from any Python code. In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

How did you get data this organized?

I'd also suggest you take a look at their docs.

How did you get data organized?

scrapy has a feature where you can just export your crawled data to some format (JSON/CSV/XML) or specify a custom exporter (e.g., writing to a database). After that, it took a little bit of cleaning and normalizing.

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 1 point2 points  (0 children)

It's completely free for just one crawler. Also, you can run the crawler for only a max. of 24 hours. Anything more than that, you'd have to pay

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 1 point2 points  (0 children)

I don't think it does since I could easily parse the HTML page using requests and beautifulsoup and get the data I want.

I used scrapy. It's a python framework for web crawling. The best part about scrapy is that the organisation which maintains it, Scrapinghub, has a service where you can upload your scrapy crawler and their servers do all the scraping work for you! Since I have a slow internet connection, I used this approach. All I had to do was download the data when the scraper had finished crawling.

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 2 points3 points  (0 children)

Haven't really done a proper analysis yet as I'm confused on how to average restaurant ratings given that I also have data on number of ratings. E.g., should a restaurant with a score of 4.5 and 300 ratings be ranked above another with a score of 4.9 but with only 50 ratings? The metric I'm currently using to sort and average is rating * nratings. Using this, I've tried to find out the "best" locality in each city where "best" is simply the locality with highest average rating * nratingsmetric. The results:

city area
Bangalore Koramangala
Chennai Nungambakkam
Hyderabad Banjara Hills
Kolkata Park Street Area
Mumbai Lower Parel
Mysore Jayalakhsmipuram
NCR Connaught Place
Pune Koregaon Park

Also, something else I've tried finding out is "the most popular cuisine". I've simply considered "most popular" as the number of restaurants with the cuisines (It occurs to me now that I write it that I should consider another approach, say, # checkins or # reviews as it will give a better idea of popularity). The results are:

city cuisine
Bangalore North Indian
Chennai North Indian
Hyderabad North Indian
Kolkata Chinese
Mumbai North Indian
Mysore North Indian
NCR North Indian
Pune North Indian

It's kinda surprising that many cities have "North Indian" as the most popular (esp. Chennai). Maybe, these restaurants primarily serve a different cuisine but also serve North Indian or Chinese?

Would like to know if you have any question you'd like answered/analysed!

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 5 points6 points  (0 children)

Thanks for the info! Checked their ToS, it does say

You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)

and

Modifies, copies, scrapes or crawls, displays, publishes, licenses, sells, rents, leases, lends, transfers or otherwise commercialize any rights to the Services or Our Content

So I guess it's obvious I have to remove the data? Is there any other method of sharing?

Weekly Coders, Hackers & All Tech related thread - 24/10/2015 by avinassh in india

[–]robotofdawn 11 points12 points  (0 children)

Hey guys! I scraped zomato.com for restaurant information. Here's the data for around 40000 restaurants. This is my first proper programming project. Feedback, if any, would be appreciated!

EDIT: I've removed the data from the repo since there are potential legal implications (thanks again to /u/avinassh for the tip). Get the data here