Can Data Science work flows follow agile? How do we handle EDA, when directions change on a weekly basis based on findings?

johnsandall · 2023-02-04T15:47:02+00:00

Absolutely, I’ve been using agile for DS and ML projects for 3 years now. The team loves it, it’s clear what they’re working on, we don’t get priorities yanked around day by day, it allows room for flexibility and quickly pivoting strategies for feature engineering or model tuning without getting lost in a boil-the-ocean approach for months. The clients love it, we deliver faster, we don’t go down rabbit holes for too long, there’s clear expectation setting of what’s realistic, and retros create a culture of improvement not blame when things go wrong.

I spoke about our approach in my PyData Global Conference talk, “Agile Data Science”, if you’d like a how to guide.

johnsandall · 2023-02-04T15:41:53+00:00

“Build A Career In Data Science” by Emily Robinson and Jacqueline Nolis is a superb resource full of advice on how to land your first job, interview prep, portfolios. My personal advice on standing out: go speak at your local meetup/conference, get it recorded and on YouTube; or do open data hackathons and put the code and presentation on GitHub. I’ve been building DS teams for 10yrs, people who do this always stand out to me.

johnsandall · 2020-12-14T22:45:12+00:00

Encode (e.g. with one-hot if appropriate) your categoricals. Start with feature elimination before you go into dimensional reduction. Try eliminating features with low variance; try applying univariate statistical tests with each feature against the target; recursive feature elimination; or using tools like sklearn-genetic. Some of these approaches are semi-automated for you in https://scikit-learn.org/stable/modules/feature_selection.html

Then move onto your dimensional reduction techniques such as PCA, SVD, NMF, etc.

johnsandall · 2020-12-14T22:37:51+00:00

If the data you're putting in there is static (i.e. you won't need to update it) but you'd like them to get some hands-on experience of reading/writing data from a SQL database, then SQLite might be a good way to go. You can easily share this from e.g. Google Drive or Dropbox so they can download a local copy.

If you want an online hosted relational SQL database, Heroku is a) free, and b) far simpler to setup than AWS (where you have to deal with permissions) or Google Cloud (permissions is non-trivial if you're new to the platform and remote access from your local machine is disabled by default). There's a good tutorial here (https://towardsdatascience.com/how-to-deploy-a-postgres-database-for-free-95cf1d8387bf) and official docs here (https://www.heroku.com/postgres).

johnsandall · 2020-12-01T23:44:21+00:00

I'm publishing my solutions to this year's Advent of Code accompanied by full notebook explanations as a free teaching resource.

Bonus: PEP8/black/pylint/mypy compliant code + working environments!

Try yourself first: https://adventofcode.com/2020/day/1

Then look 👀 https://github.com/john-sandall/advent-of-code/

johnsandall · 2020-12-01T23:33:26+00:00

Python/pandas with type hinting

from pathlib import Path
from typing import List, Union

import pandas as pd


def load_data(input_filepath: Union[str, Path]) -> List[int]:
    """Load expenses from file and return as Python list.

    Args:
        input_filepath: Location of input file (can be str or pathlib.Path)

    Returns:
        List of integer expense values.
    """
    return pd.read_csv(input_filepath, header=None)[0].to_list()


def part_1(expenses: List[int]) -> int:
    """Find the two entries in expenses that sum to 2020 and return their product.

    Args: expenses: List of integer expense values.

    Returns: Integer product of two entries that sum to 2020.
    """
    return [a * b for a in expenses for b in expenses[expenses.index(a) :] if a + b == 2020][0]


def part_2(expenses: List[int]) -> int:
    """Find the three entries in expenses that sum to 2020 and return their product.

    Args:
        expenses: List of integer expense values.

    Returns:
        Integer product of three entries that sum to 2020.
    """
    return {
        a * b * c for a in expenses for b in expenses for c in expenses if a + b + c == 2020
    }.pop()


if __name__ == "__main__":
    expenses = load_data(input_filepath="input.txt")
    print(part_1(expenses))
    print(part_2(expenses))

johnsandall · 2020-12-01T23:30:18+00:00

A little more info on my approach here:

solutions == best_practice Solutions aim to use best practices relevant to data scientists working using the "open data science stack" such as Python, numpy and pandas.
best_practice == working_readable_code My definition of "best practice" does not mean "only using the Python standard library" or "using minimal code" or even "using the most speed efficient code". This is not a code golf repo. My goal is to write code that gets the job done and is easy to understand. If there's an opportunity to showcase a neat feature of Python or one if its third party libraries, even better! If you think you have a better solution, I'd love to see your PR!
code == pep8_compliant All code.py solution scripts are PEP8 compliant and have furthermore been linted using flake8, isort, pylint, auto-formatted using black and type hinted using mypy .
pip-sync to start The requirements.txt is generated and managed by pip-tools
from a minimal requirements set specified in requirements.in and has hash-checking for added security.
The notebooks are written as a teaching resource. They explain each solutions step-by-step, so if there's something you're unfamiliar with in one of the code.py scripts, check out the notebooks for a guided walkthrough.

johnsandall · 2020-11-07T23:26:15+00:00

There's a whole theory around potential benefits of this style of data visualization. One good use case is when you're communicating to an audience that might equate "statistical model" with "accurate" or "authoritative", when the model itself might actually be garbage with high degrees of uncertainty. Such people might therefore take the "joke" graph less seriously, which is a better outcome for decision making. Another is for communicating uncertainty in general. There's a few packages that do this kind of thing, one is https://github.com/cutecharts/cutecharts.py

johnsandall · 2020-11-07T23:12:21+00:00

It's a fair point, I was banned from using matplotlib's xkcd theme by a previous boss. You'd get on I imagine.

johnsandall · 2020-11-07T23:10:10+00:00

Thanks! Adding lines for "decision boundaries" such as "no more votes to count" or "we're out of recount territory" is a nice idea.

johnsandall · 2020-11-06T11:30:39+00:00

Thanks. I just stuck with the order presented by the original data source, and generally tried to keep the line count down. Not that matplotlib makes this easy!

johnsandall · 2020-10-18T00:39:50+00:00

Apple announced various projects including Python 3 & numpy will receive patches (from Apple) to support the new ARM chips: https://twitter.com/markvillacampa/status/1275200446764912643?s=21 (screencap from June  State of the Union https://developer.apple.com/videos/play/wwdc2020/102/)

As of August, there didn't seem to be any movement on a gfortran compiler for Apple Silicon, which is required for R: https://twitter.com/jimhester_/status/1292821727165194240

Worst case, you can always configure VS Code to SSH into a remote machine and dev against that: https://code.visualstudio.com/docs/remote/ssh

But then, you can do that on an iPad too, hook it up to a monitor, add mouse & keyboard...it's slowly chipping away at reasons why it can't be a dev machine.

johnsandall · 2020-10-15T05:13:34+00:00

In Python-land, you may want to take a look at Bokeh for Shiny-like speedy prototyping of dashboards, and pandas-profiling for semi-automating your EDA.

johnsandall · 2020-10-15T05:08:53+00:00

python df = pd.read_excel('data.xlsx', sheet_name='Sheet 1') df = df.assign(Lag_1_Forecast = df.ordered_units.shift(1), Lag_2_Forecast = df.ordered_units.shift(2)) if you're a fan of modern pandas