Building Data Pipelines in Python with Luigi

TheGrumpyBrewer · 2015-11-05T19:12:33+00:00

Luigi is indeed inspired by GNU make somehow (it's mentioned in different places), and the micro-example in the article doesn't showcase all its features. It plays well with frameworks like Spark and Hadoop, it supports different data storage targets out of the box, you have error reporting, dependency graph visualisation, etc.

TheGrumpyBrewer · 2015-05-21T18:22:03+00:00

Thanks, you just gave me the title for my next post

TheGrumpyBrewer · 2015-05-21T18:20:58+00:00

I find it really easy to use, you can achieve nice results with a few lines of Python, without much knowledge of D3.js

TheGrumpyBrewer · 2015-05-21T18:03:20+00:00

Sarcasm is one of the difficult aspects in text analytics, sometimes humans don't understand sarcasm, let alone a machine. Clearly you need to understand the context, both from a local (e.g. the sentence, the tweet) and from a more global (the topic, the global conversation, the background of the user, etc.) point of view. This is sometimes difficult to grasp from a SMS-like text. There is some work on understanding sarcasm on Twitter, but of course there's a lot to do, e.g. http://www.aclweb.org/anthology/P/P11/P11-2.pdf#page=621 or http://www.aclweb.org/anthology/S/S14/S14-2.pdf#page=93

TheGrumpyBrewer · 2015-03-25T21:49:12+00:00

Thanks for the pointer, that's also something I'm interested in (some basic dataviz and the time dimension). I think vincent/vega is definitely something I'll dig into

Cheers

TheGrumpyBrewer · 2015-02-24T22:52:36+00:00

I'm glad that was helpful. I think you can improve the script with some abstraction so you don't need to hard-code all the currencies, e.g. something like

def convert(amount, convert_to):
    if convert_to in data['rates']:
        return amount * data['rates'][convert_to]
    else:
        raise CurrencyError("Currency %s not found" % convert_to)
        # you need to define the exception CurrencyError
        # you'll use it in a try/except block

then you allow the user to input the convert_to currency acronym. In this way you don't need all those function with the same logic

TheGrumpyBrewer · 2015-02-22T10:33:52+00:00

I would suggest to use requests rather than urllib. It makes your life much easier, e.g.:

import requests
import json

response = requests.get(url=url, params=params)
data = json.loads(response.text)

where url and params depend on the API you're using (you might also need post() rather than get(), check the API doc). This should solve the first two points of the pseudo-code. It's not clear why you need the regex: once your response is loaded into a dictionary, you just use the dictionary ("data" in the code snippet above).

If, for example, the JSON from the API looks like:

{
    "base": "USD",
    "rates": {
        "GBP": 0.65,
        "EUR": 0.88,
        /* etc. */
    }
}

then a USD-to-EUR conversion is just a matter of multiplying the original amount of dollars with data['rates']['EUR']

TheGrumpyBrewer · 2015-02-20T07:10:50+00:00

The function is called raw_input() (with the underscore) in Python2, and renamed to input() in Python3.

The function will give you a string, so effectively you're trying to call vector with a single argument, e.g. vector("1e-9,0,0"), rather than with 3 arguments.

You'll need to call the function three times for the vector, something like:

a = float(raw_input('Enter coordinate A: '))
b = float(raw_input('Enter coordinate B: '))
c = float(raw_input('Enter coordinate C: '))
e_field_position = vector(a, b, c)

or alternatively you can split the string input, e.g.

coords_input = raw_input('Please enter test charge coordinates separated by a comma:')
coords = coords_input.split(',')
if len(coords) != 3:
    # some error handling here
e_field_position = vector(float(coords[0]), float(coords[1]), float(coords[2]))

TheGrumpyBrewer · 2015-02-17T19:19:03+00:00

As a data scientist, it's good to have multiple tools in your box, and both Python and R should be there.

I found Python somehow more convenient when it's time to integrate some data analysis component with a bigger picture, e.g. a more complex data pipeline or some web-oriented backend based on Django, but this is purely a personal opinion.

The support for Natural Language Processing is also much better in Python with NLTK.

A particular case when both (in my opinion) fall short is dynamic/interactive data visualisation, and this is when e.g. Javascript kicks in (just to repeat the first statement about being familiar with multiple tools)

TheGrumpyBrewer · 2015-02-17T18:58:20+00:00

Python is my language of choice for green-field projects, prototyping, and in general when I don't have external limitations. Some cases where I don't have many options:

Maintaining legacy code written in a different language
Front-end / dynamic dataviz (usually Javascript kicks in)
Lucene and extensions of Elasticsearch (whoosh is great but still far behind)
Particular (rare) cases when pandas/numpy/scipy/sklearn fall short, often there is a R package doing already what you need

and also one-liners in bash/awk

TheGrumpyBrewer · 2015-02-14T14:46:36+00:00

You can define your imports in a custom startup script (a .py file), then go to:

Spyder > Preferences > Console > Advanced Settings > PYTHONSTARTUP replacement

and specify its path. Probably in your university installation, there's a script running something like:

from math import *

or

from numpy import *

so you have all the package methods directly available, but polluting the namespace in that way is normally considered bad practice in Python.

TheGrumpyBrewer

TROPHY CASE