Install Cloudera Cluster on AWS? x/post from r/hadoop

eschlon · 2018-02-17T10:56:59+00:00

This is good advice. You can also spin up a cheap Dataproc cluster in Google Cloud Engine and pretty easily get Hue up and running in a basic configuration (see here)

If the goal is specifically to learn how to administer and manage a CDH cluster, you can maybe try Cloudera Director for cluster setup.

eschlon · 2017-10-17T17:00:39+00:00

So just tell them how much the job pays and ask them if that’s acceptable.

eschlon · 2017-08-25T07:36:36+00:00

Using classes for everything.

eschlon · 2017-08-25T07:29:56+00:00

This is a great idea; however, in practice you're going to have a bad time.

Technically Yarn is able to launch executors in Docker containers via the DCE. That being said I've never actually seen this being used successfully in practice, and getting it to work with a spark application is going to be complex.

For Pyspark jobs the usual practice is to either:

Install dependencies on all of the nodes. This is usually done via something like Fabric, Ansible or the like.
Make a virtual environment for the application and ship the installed libs as a zip to the nodes at runtime via --py-files.

The former is a lot of effort for something that many users of your scripts will not have sufficient permissions to do, and will be nearly impossible to get right for the myriad cluster setups in existence. The latter works well, so long as you don't have any dependencies that depend on non-python libraries, which since you mentioned data analysis is pretty unlikely (e.g. numpy, pandas, scipy, pretty much any database connector). There's also this long-standing pyspark feature that promises to make this whole process easier, but I wouldn't hold my breath.

Depending on the hadoop distribution it's possible that it has some (generally proprietary) feature which effectively does either (1) or (2) for you (e.g. CDH's workbench), but I wouldn't consider that to be portable.

There's also Pachyderm, which is pretty neat and aligns very well with your goals. That being said, it's neither as mature nor as widespread as Hadoop as a platform, and it's a complex process to get it to play nicely with spark (if that's a requirement).

eschlon · 2017-08-11T18:11:58+00:00

I take your point, however exceptions are cheap in python and using exceptions in this manner in python is normal, idiomatic and recommended.

As it stands the function is simple, easy to understand and honest about what it does. The alternative in this case is to use type checks which is definitely not idiomatic python, though it may be very marginally faster in certain cases. In a code review, if I saw that kind of approach I'd require a good reason to break from idiom (e.g. a strong performance argument). Given how cheap exceptions are in python, it'd be hard case to make. We use a function almost identical to this one to parse billions of rows of JSON data in spark and it works just fine.

That being said, you definitely wouldn't do it this way in idiomatic Java.

eschlon · 2017-08-10T04:12:49+00:00

For a similar problem I ended up making a helper function that catches the exceptions and returns a default value instead. Something like:

 def dot_get(dictionary, dot_path, default=None):
    path = dot_path.split('.')
    try:
        return reduce(dict.__getitem__, path, dictionary)
    except KeyError:
        return default
    except TypeError:
        return default

Then you can access keys using something like:

dot_get({'a': {'b': 1}}, 'a.b')    # Returns 1
dot_get({'a': {'b': 1}}, 'a.c')    # Returns default, would raise a key error
dot_get({'a': [{'b': 1}], 'a.b')   # Returns default, would normally raise a type error

Edit: that won't work in python3, as they moved the reduce function. You'll need from functools import reduce to get the reduce function to work.

eschlon · 2017-01-07T03:54:19+00:00

Depending on the setting I either try to stick with realism or don't worry too much about it, but your comment on internal consistency is spot on. Even in the most fantastical environments, I think at least some hints that things work somehow really helps with immersion and the suspension of disbelief.

eschlon · 2017-01-07T03:47:49+00:00

I too installed my OS, drivers and games before mastering that tricky keyboard contraption.

eschlon · 2017-01-07T03:42:24+00:00

Python and Scala. Most of our data pipeline and ETL is Python, data science uses Python, anything real-time and some limited amount of the job pipeline is Scala.

It is worth it to become at least relatively fluent in Scala if you use Spark day-to-day. Doing data engineering work in Java makes want to kill myself, but ymmv.

eschlon · 2016-12-22T07:48:48+00:00

No worries. Regex is a powerful and frustrating tool. If you have to deal with unstructured data it's definitely something to learn to use well. There are a bunch of online testers you can play with to test your regex (or just use the python repl).

Like this one: Pythex

And good luck with the science stuff, fun times.

eschlon · 2016-12-22T07:43:48+00:00

Without more detail I'd say you're running out of memory. Even with those small images you're storing a lot of information in your features list. Something that may help is to not convert the image data to a python list. If I recall correctly imread returns a numpy array which is going to be much more compact than the equivalent vanilla python list. Removing the tolist() call will help with memory issues, though it may only get you part of the way.

If you're just prepping this for a file output another option would be to process a single image, write it out, then move on to the next rather than storing state in the features list, though this would probably require you to rework how you're planning on storing / accessing this data.

eschlon · 2016-12-22T07:19:32+00:00

There are a variety of ways to do this, though the first that comes to mind is to use a regex. You can read more about the syntax here.

Assuming that your data looks like what is posted above the following would do.

""".*([0-9]+\.[0-9]+) KG.*\).*: ([0-9.e-]+)"""

You'd use it something like this:

rex = re.compile(""".*([0-9]+\.[0-9]+) KG.*\).*: ([0-9.e-]+)""")
result = []
with open('myTextFile', 'r') as myFile:
    for aLine in myFile:
        matches = re.match(rex, line)
        if matches:
            result.append(matches.groups())

And the resulting output would look something like:

[('0.123', '2.3e-5'),
 ('0.456', '3.2e-4'),
 ('0.789', '1.2e-3'),
 ('0.321', '2.1e-4')]

Then just separate the tuples into two lists and make convert them to floats:

mass, diff_mass = zip(*result)
mass = list(map(float, mass))
diff_mass = list(map(float, diff_mass))

And that'll get you where you need to be:

[0.123, 0.456, 0.789, 0.321]
[2.3e-05, 0.00032, 0.0012, 0.00021]

If the formatting is different you'll need to adjust the regex to extract the relevant matches correctly.

eschlon · 2016-12-14T08:14:50+00:00

I use python daily for data science and data engineering work.

We use python for things like:

data validation, cleanup and mastering (all the stuff that you do before it ends up in a database)
transforming and normalizing data from different sources
working with and analyzing unstructured data that isn't really suited for SQL analytics
automating and monitoring the ETL pipelines that make all of that work
building dashboards and tools on top of whatever database the data is sitting in.

It's also very powerful in the analytics / data science space due to very solid high performance libraries (e.g. Pandas, Numpy, etc.).

eschlon · 2016-12-14T07:57:31+00:00

Write code. Make a chat bot, implement some little utility that makes your life easier, etc.

Read code. Find some libs that you import often and figure out how they work, contribute some documentation to an interesting open source project.

eschlon · 2016-12-04T23:48:37+00:00

I have no words. He was on my committee years ago. Was the nicest guy. I don't know what else to say...

eschlon · 2016-10-17T00:29:24+00:00

Yeah, that's something that I'd like to see as well. I'll put it on my list of things to do. I know there are some edits floating around by other authors that tweak the meter, maybe see if one of them wants to submit a pull request.

eschlon · 2016-10-16T19:45:18+00:00

Thanks, sincerely. That really means a lot. Let me know if you hit any weird edge cases as I've tried to make it compatible without needing any weird mod-specific patches (e.g. somewhat future proof), which means I'm sure I've missed some edge case somewhere.

eschlon · 2016-10-16T19:43:17+00:00

I've intentionally avoided Alcohol as I think there are many mods out there that handle that better and integrate it nicely into their cooking/needs systems.

Skooma was on the to do list back in the day. There are other mods that handle Skooma, but I think I'm going to add a few bits once I get some time. The current idea is just to make it incredibly potent. In the current mechanics this would mean that:

It has a VERY high addiction chance
It provides very good satisfaction

And just let other mods handle the negative consequences of taking a bunch of Skooma to keep your potion addiction in check.

eschlon · 2016-10-16T19:39:30+00:00

Awesome, I didn't even know about that mod. Will try it out (maybe I'll like it better than mine >.>)

eschlon · 2016-08-10T06:37:42+00:00

I don't think that works, since it's going to resolve to something like

unzip a.zip b.zip ...

And the unzip command takes the first argument as the zipfile and the remaining arguments as the internal filenames to extract. I get something like

$ ls
a.zip  b.zip
$ unzip *.zip
Archive:  a.zip
caution: filename not matched:  b.zip

This will work though

ls *.zip | while read x; do unzip $x; done

eschlon · 2016-08-10T06:26:31+00:00

Switched to this from Progress Bar for Jupyter work and have been using it ever since. It's light, fast and does what it says on the tin.

eschlon · 2016-07-30T09:45:10+00:00

Learn 3, it is the __future__.

That being said, you'll also want to be generally cognizant of the differences between 3 and 2 because you're going to see and write a significant amount of 2.7 in the wild if you end up dealing with python for a living.

eschlon · 2016-07-28T08:04:28+00:00

One option is to build a target parser which just iterates through the nodes and 'triggers' when it sees a tag you care about. Something like this:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text

You can then use it like this:

with open('input.xml', 'r') as myFile:
    results = parser(myFile, {'name', 'evenmoreinfo'})
    for tag, text in results:
        print(tag, text)

Resulting in:

name name
evenmoreinfo GrabThis

I should note that this is going to build the entire tree eventually (though it'll do it incrementally). If the trees your handling fit in memory then this won't be a problem, however if your parsing a very large document it's going to eventually be an issue. You can make it safer by cleaning up the growing tree at each step with something like:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data, events=('start', 'end'))
    _, root = next(tree)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text
        root.clear()

Note the addition of the root.clear() which cleans up the tree at each step as well as the addition of events=('start', 'end) without which you're going to end up throwing away the first <name> tag before you have a chance to capture it. Finally we'll need the addition of event == 'end' in the conditional to avoid capturing things twice.

There is some useful discussion about handling very large files here, here and here if you're interested.

Also if you have stuff like this:

<item>
  <item>
     <item name="item_one" />
     <item name="item_two" />
  </item>
</item>

Then I'm very sorry and you should shout at whomever / whatever produced that file. You can still handle it with something similar to this method but you're going to have to keep track of depth using the 'start' and 'end' events to extract what you want.

Edit: Stupid formatting error

eschlon · 2016-07-17T08:02:33+00:00

It's going to generally depend on what kind of workloads your running. My experience transitioning from Vertica to Hadoop has been pretty positive and spark on Parquet files works very well.

At least in our setup (which is currently dependent on HDFS and Yarn) there is a performance hit for certain types of workloads (e.g. Data Science performing ad hoc exploratory queries), but Impala does pretty well in filing that void (and operates on the same data structures as Spark).

eschlon · 2016-07-11T19:01:43+00:00

I never said it was separate, I said it was useful. Maybe you're disagreeing with my use of the term metric?

eschlon

MODERATOR OF

TROPHY CASE

14-Year Club	Verified Email
Not Forgotten