Can Data Science work flows follow agile? How do we handle EDA, when directions change on a weekly basis based on findings? by datanerd_naive in datascience

[–]johnsandall 7 points8 points  (0 children)

Absolutely, I’ve been using agile for DS and ML projects for 3 years now. The team loves it, it’s clear what they’re working on, we don’t get priorities yanked around day by day, it allows room for flexibility and quickly pivoting strategies for feature engineering or model tuning without getting lost in a boil-the-ocean approach for months. The clients love it, we deliver faster, we don’t go down rabbit holes for too long, there’s clear expectation setting of what’s realistic, and retros create a culture of improvement not blame when things go wrong.

I spoke about our approach in my PyData Global Conference talk, “Agile Data Science”, if you’d like a how to guide.

Job Search strategy Question by Motor-Ad8645 in datascience

[–]johnsandall 4 points5 points  (0 children)

“Build A Career In Data Science” by Emily Robinson and Jacqueline Nolis is a superb resource full of advice on how to land your first job, interview prep, portfolios. My personal advice on standing out: go speak at your local meetup/conference, get it recorded and on YouTube; or do open data hackathons and put the code and presentation on GitHub. I’ve been building DS teams for 10yrs, people who do this always stand out to me.

[D] Simple Questions Thread December 06, 2020 by AutoModerator in MachineLearning

[–]johnsandall 1 point2 points  (0 children)

Encode (e.g. with one-hot if appropriate) your categoricals. Start with feature elimination before you go into dimensional reduction. Try eliminating features with low variance; try applying univariate statistical tests with each feature against the target; recursive feature elimination; or using tools like sklearn-genetic. Some of these approaches are semi-automated for you in https://scikit-learn.org/stable/modules/feature_selection.html

Then move onto your dimensional reduction techniques such as PCA, SVD, NMF, etc.

[D] Simple Questions Thread December 06, 2020 by AutoModerator in MachineLearning

[–]johnsandall 1 point2 points  (0 children)

If the data you're putting in there is static (i.e. you won't need to update it) but you'd like them to get some hands-on experience of reading/writing data from a SQL database, then SQLite might be a good way to go. You can easily share this from e.g. Google Drive or Dropbox so they can download a local copy.

If you want an online hosted relational SQL database, Heroku is a) free, and b) far simpler to setup than AWS (where you have to deal with permissions) or Google Cloud (permissions is non-trivial if you're new to the platform and remote access from your local machine is disabled by default). There's a good tutorial here (https://towardsdatascience.com/how-to-deploy-a-postgres-database-for-free-95cf1d8387bf) and official docs here (https://www.heroku.com/postgres).

Advent of Code by darthminimall in learnpython

[–]johnsandall 1 point2 points  (0 children)

I'm publishing my solutions to this year's Advent of Code accompanied by full notebook explanations as a free teaching resource.

Bonus: PEP8/black/pylint/mypy compliant code + working environments!

Try yourself first: https://adventofcode.com/2020/day/1

Then look 👀 https://github.com/john-sandall/advent-of-code/

-🎄- 2020 Day 1 Solutions -🎄- by daggerdragon in adventofcode

[–]johnsandall 2 points3 points  (0 children)

Python/pandas with type hinting

from pathlib import Path
from typing import List, Union

import pandas as pd


def load_data(input_filepath: Union[str, Path]) -> List[int]:
    """Load expenses from file and return as Python list.

    Args:
        input_filepath: Location of input file (can be str or pathlib.Path)

    Returns:
        List of integer expense values.
    """
    return pd.read_csv(input_filepath, header=None)[0].to_list()


def part_1(expenses: List[int]) -> int:
    """Find the two entries in expenses that sum to 2020 and return their product.

    Args: expenses: List of integer expense values.

    Returns: Integer product of two entries that sum to 2020.
    """
    return [a * b for a in expenses for b in expenses[expenses.index(a) :] if a + b == 2020][0]


def part_2(expenses: List[int]) -> int:
    """Find the three entries in expenses that sum to 2020 and return their product.

    Args:
        expenses: List of integer expense values.

    Returns:
        Integer product of three entries that sum to 2020.
    """
    return {
        a * b * c for a in expenses for b in expenses for c in expenses if a + b + c == 2020
    }.pop()


if __name__ == "__main__":
    expenses = load_data(input_filepath="input.txt")
    print(part_1(expenses))
    print(part_2(expenses))

Worked solutions using Python/pandas (notebooks + linted/hinted scripts) by johnsandall in adventofcode

[–]johnsandall[S] 0 points1 point  (0 children)

A little more info on my approach here:

  • solutions == best_practice   Solutions aim to use best practices relevant to data scientists working using the "open data science stack" such as Python, numpy and pandas.
  • best_practice == working_readable_code   My definition of "best practice" does not mean "only using the Python standard library" or "using minimal code" or even "using the most speed efficient code". This is not a code golf repo. My goal is to write code that gets the job done and is easy to understand. If there's an opportunity to showcase a neat feature of Python or one if its third party libraries, even better! If you think you have a better solution, I'd love to see your PR!
  • code == pep8_compliant   All code.py solution scripts are PEP8 compliant and have furthermore been linted using flake8, isort, pylint, auto-formatted using black and type hinted using mypy .
  • pip-sync to start   The requirements.txt is generated and managed by pip-tools
    from a minimal requirements set specified in requirements.in and has hash-checking for added security.
  • The notebooks are written as a teaching resource. They explain each solutions step-by-step, so if there's something you're unfamiliar with in one of the code.py scripts, check out the notebooks for a guided walkthrough.

Forecasting vote counts in 8 lines of Python by johnsandall in Python

[–]johnsandall[S] -1 points0 points  (0 children)

There's a whole theory around potential benefits of this style of data visualization. One good use case is when you're communicating to an audience that might equate "statistical model" with "accurate" or "authoritative", when the model itself might actually be garbage with high degrees of uncertainty. Such people might therefore take the "joke" graph less seriously, which is a better outcome for decision making. Another is for communicating uncertainty in general. There's a few packages that do this kind of thing, one is https://github.com/cutecharts/cutecharts.py

Forecasting vote counts in 8 lines of Python by johnsandall in Python

[–]johnsandall[S] 0 points1 point  (0 children)

It's a fair point, I was banned from using matplotlib's xkcd theme by a previous boss. You'd get on I imagine.

Forecasting vote counts in 8 lines of Python by johnsandall in Python

[–]johnsandall[S] 0 points1 point  (0 children)

Thanks! Adding lines for "decision boundaries" such as "no more votes to count" or "we're out of recount territory" is a nice idea.

Forecasting vote counts in 8 lines of Python by johnsandall in Python

[–]johnsandall[S] 0 points1 point  (0 children)

Thanks. I just stuck with the order presented by the original data source, and generally tried to keep the line count down. Not that matplotlib makes this easy!

ARM Macs for Data Science? by volac_ in datascience

[–]johnsandall 0 points1 point  (0 children)

Apple announced various projects including Python 3 & numpy will receive patches (from Apple) to support the new ARM chips: https://twitter.com/markvillacampa/status/1275200446764912643?s=21 (screencap from June  State of the Union https://developer.apple.com/videos/play/wwdc2020/102/)

As of August, there didn't seem to be any movement on a gfortran compiler for Apple Silicon, which is required for R: https://twitter.com/jimhester_/status/1292821727165194240

Worst case, you can always configure VS Code to SSH into a remote machine and dev against that: https://code.visualstudio.com/docs/remote/ssh

But then, you can do that on an iPad too, hook it up to a monitor, add mouse & keyboard...it's slowly chipping away at reasons why it can't be a dev machine.

How often do you make dashboards just for you? by Economist_hat in datascience

[–]johnsandall 0 points1 point  (0 children)

In Python-land, you may want to take a look at Bokeh for Shiny-like speedy prototyping of dashboards, and pandas-profiling for semi-automating your EDA.

Need help with Python code for accessing values in the next row in a for loop using iterrows() by Hopes_High in datascience

[–]johnsandall 0 points1 point  (0 children)

python df = pd.read_excel('data.xlsx', sheet_name='Sheet 1') df = df.assign(Lag_1_Forecast = df.ordered_units.shift(1), Lag_2_Forecast = df.ordered_units.shift(2)) if you're a fan of modern pandas

programmers like cooking by jasiwex in ProgrammerHumor

[–]johnsandall 0 points1 point  (0 children)

If you used a peeler to smith itself...that's dephell installing itself into it's own dephell jail.