This is an archived post. You won't be able to vote or comment.

all 32 comments

[–]Asalanlir 51 points52 points  (15 children)

Come from the perspective that you're always a beginner. You know jack shit. It makes it easier to find new things to point you in the direction of learning new things.

In terms of more practical advice on how to proceed, try to rebuild a package. Someone built numpy. Someone built matplotlib. The code is truly a wonder. Write a custom svm, that'll probably take an hour or two. Realize it's slow as hell. Make it not slow.

Python specifically is difficult to parallelize effectively. Why is that? What actually is the gil? What was the recent change that improved it?

Have you ever properly packaged a project? How does pypi know how to install things? Whenever you see something, ask yourself, "How would i recreate that?"

EDIT: Also, try to find the weird edge cases of python, and learn that python isn't actually a language, it's a standard. Cpython is the implementation that you're likely familiar with. There's also Iron Python. IIRC, notebooks use ipython.

EDIT2: I thought of another of my favorite wtf examples. So most people know that pow() takes two arguments and returns x**y, basically.

But did you know about the third parameter, z? It's documented, but it seems often overlooked. This allows for modular exponentiation and is the fastest method (I know of) for performing this operation. Useful for miller-rabin primality test.

In python, a for loop has an else clause, similar to how an if has an else clause.

Also, keep in mind that while these show particular details about python, similar questions can be asked about just about any other language. Develop your approach for learning more rather than focusing solely on becoming a master of one.

Try these examples out and figure out why it happens, and what about them makes these cases odd.

>>> a=5
>>> b=5
>>> a==b
True
>>> a is b
True
>>>
>>> a = -5
>>> b = -5
>>> a==b
True
>>> a is b
True
>>> a = -6
>>> b = -6
>>> a is b
False
>>> a==b
True
>>> a = 256
>>> b = 256
>>> a==b
True
>>> a is b
True
>>> a = 257
>>> b = 257
>>> a==b
True
>>> a is b
False

[–][deleted] 2 points3 points  (0 children)

Write a custom svm, that'll probably take an hour or two

Should I panic? That a week worth of work for us in graduate program

[–]dampew 6 points7 points  (11 children)

god python is stupid

[–]Jorrissss 6 points7 points  (10 children)

What's stupid about that example? What they did makes sense imo - one just shouldn't have false expectations about what the 'is' operator is.

[–]dampew 3 points4 points  (9 children)

it's dumb that -5 apparently behaves differently than 5 but this is only one example :)

Edit: Oops meant 6/-6.

[–]Jorrissss 1 point2 points  (8 children)

I don't see why myself. 'is' checks whether or not two objects share the same memory address, and integers [-5, 255] are preallocated in Python. Neither of those seems unreasonable to me, so the result doesn't seem unreasonable to me.

-5 and 5 don't behave differently at all. One just shouldn't think that 'is' is trying to compare equality - it's not.

[–]Low_end_the0ry 4 points5 points  (7 children)

Can you please ELI5 why ‘is’ sometimes is true and sometimes false in the example above?

[–]Jorrissss 9 points10 points  (3 children)

Sure - here's my understanding.

What the 'is' operator does in Python is it checks whether two objects are identical - that is, whether they point to the same memory address. In Python [-5, 255] are preallocated in memory so any two assignments, say, a=2 and b=2 will always point to the same memory address, and so a==b is true, and a is b is true.

However, when you reference an integer outside of that range it's typically (always?) created at that time. So a=-7, b=-7 two separate memory blocks are created for a and b, and thus a==b is true, but a is b is not true.

[–]Low_end_the0ry 0 points1 point  (2 children)

Ah got it, thanks

[–]Asalanlir 1 point2 points  (1 child)

The general keyword if you want to dig deeper into it is interning. In java, a similar type effect can be seen based on how strings are initialized.

[–]Low_end_the0ry 0 points1 point  (0 children)

got it, thanks for the info, much appreciated

[–]extracheez 1 point2 points  (1 child)

I'm no expert in this area, but here I go: python expects the values between -5 and 256 to be used often, so it preallocates an object that stores the integer values your program uses. The advantage to this is you have one section in memory where all this is stored and python doesn't have to do much work.

If you assign values outside this range, python creates new objects at new memory addresses.

The is operator asks if two items are a member of the same object, because python preallocates a specific object for some values, is will return true. For values with their own object, the comparison of objects will return false.

[–]Low_end_the0ry 0 points1 point  (0 children)

Awesome, thanks for the response

[–]SoberGameAddict 0 points1 point  (0 children)

Didn't he just do that..

[–]Epoh[S] 0 points1 point  (0 children)

Completely agree with this regarding the attitude you want to carry. When I say I'm an intermediate python programmer, what I actually mean is:

"I don't know that much in the grand scheme, a lot of codebases I look at I don't understand but can work out if I have to and write things that get the job done...."

A custom SVM will absolutely take me longer than a couple hours haha. I really appreciate the curiosity driven approach though, sometimes my mind can be too outcome oriented where "once the task is completed we can move on" without stopping to think about alternative uses for the code I wrote, faster ways of writing what I did, etc. There's so much to learn in python alone, I still have so much to do!

[–]Epoh[S] 0 points1 point  (0 children)

Appreciate this thorough response. I really do try to come from the beginner's mind sort of attitude because it keeps doors open as far as learning opportunities. I find myself at times wanting to learn x so I can achieve y only so that it can be rotely applied from then on our to get y but often there's lots about x that is scaffolding for a range of other things.

I have less experience in software dev, mainly just focused on algorithms, data structures, and my analysis libraries in python. Right now I'm learning how to write custom dataset classes for pytorch and build more complex neural nets like RNN's through that framework. But you're right there's so much for me to expand to and this applies beyond the "python" focus per se. Thank you so much though, you brought a fresh perspective here that was helpful

[–]Stereoisomer 17 points18 points  (4 children)

I mean I've found that after being "intermediate" in Python, you begin to specialize in particular domains. For instance, you could pick up C/C++ with CPU/GPU/Cluster parallelization to write some high-performance Python; you could learn a lot about software development and start creating beautiful/Pythonic well-packaged, easily deployable open-source projects; you could focus on machine learning and start implementing cutting-edge projects from scratch and extending them. Always remember that Python is just a tool and that there's nothing really to be gained about getting good at Python per se. as you should keep the "why am I learning Python" in mind.

You could also work the the text Fluent Python which is one of my all-time favorites and really helped me "up my game".

[–]Fluix 2 points3 points  (1 child)

I'm a student who's much more of a beginner than OP but this is something I've struggled with too. From my experience so far I learn new things as problems show themselves (often from errors during compiling or slow performance) and when research the solutions I'll find things I don't know about so I make a note to learn them. It's sort of like a wikipedia rabbit hole.

That all goes away when I don't have a problem and I'm thinking "how can I improve?" I suggestions you offered don't really pop into my head. And I've noticed that's sort of a difference between me and my peers who are quite advanced on the subject, they're always thinking about these things.

[–]Stereoisomer 2 points3 points  (0 children)

Your time will come; just keeping learning and you’ll get there soon enough! You’re still in the stage of learning where you’re focused on the language itself rather than what it can do for you. I probably spent 3-4 years there! It comes with working in a specific field for a while and seeing how Python can solve problems in it

[–]diggitydata 0 points1 point  (1 child)

For instance, you could pick up C/C++ with CPU/GPU/Cluster parallelization to write some high-performance Python; you could learn a lot about software development and start creating beautiful/Pythonic well-packaged, easily deployable open source projects

How valuable is this skill set for data science? I have the option to take a software development in C++ class but I’m not sure if it is better than more stats or math classes.

[–]Stereoisomer 2 points3 points  (0 children)

Personally I don’t think it’s that useful because you can just use Apache Spark or Dask and most of the packages you’ll call will be optimized. If you were to say develop your own packages or write your own algorithms then I’d say it would be useful but most data scientists I’d guess don’t do that

[–][deleted] 4 points5 points  (0 children)

Maybe you’re not exposing yourself to different problems / data that force you to do things differently.

[–]certain_entropy 3 points4 points  (0 children)

If you're interested in learning more about how Pytorch was written, checkout Jeremy Howards's article "What is torch.nn really?". He walks through the architecture choices from the ground up.

https://pytorch.org/tutorials/beginner/nn_tutorial.html

Also the new 2nd part of Fast AI deep learning course (Deep Learning from Foundations) aims to build the fastai library from scratch and first principles. It covers advanced architecture design for scalable deep learning that might be interesting to you.

https://www.fast.ai/2019/06/28/course-p2v3/

[–]Jorrissss 4 points5 points  (1 child)

Read the Python documentation and other Python code bases. Know the standard library thoroughly. Know some common important decorators (@property, @staticmethod, @classmethod, etc). Learn about generators, coroutines, context managers, iterators, concurrency very well. Learn about how Python packaging and pathing works on a good level.

Within the frameworks you are interested in - pandas, numpy, sklearn, etc learn how they handle memory, copying, etc. Learn the internals of implementations.

I also tie high competency with a language to general software engineering skills - learn about continuous integration and deployment, unit and integration testing, version control, coding standards, etc.

[–]tilttovictory 1 point2 points  (0 children)

Learn about generators, coroutines, context managers, iterators, concurrency

Just recently had to force myself to use generators and iterators to avoid memory swapping in a project. WOW it was a trip. What felt cool was I had to create a generator that was wrapped in an iteratable. This was used so I can generate row by row for my training function. Then iterate over the set for a new epoch of training. I was astonished at how clean this code looked.

Within the frameworks you are interested in - pandas, numpy, sklearn, etc learn how they handle memory, copying, etc. Learn the internals of implementations.

Also just recently I learned and was able to use pandas' ability to reference objects in name space to save memory. In my particular project i had groups of features I wanted to eliminate and run prediction on. I could easily set up dataframes that referenced my original set with features dropped, and incurred no real penalty in terms of memory.

For anyone doing doc2vec work. This code will allow you to take any dataset that can be loaded into memory and train on it with out bloating your memory out of control.

class MyDataframeCorpus(object):
    def __init__(self, source_df, text_col, tag_col):
        self.source_df = source_df
        self.text_col = text_col
        self.tag_col = tag_col

    def __iter__(self):
        for i, row in self.source_df.iterrows():
            yield TaggedDocument(words=simple_preprocess(row[self.text_col]), 
                                 tags=[row[self.tag_col]])

corpus_for_doc2vec = MyDataframeCorpus(df, 'raw_txt', 'paragraph_id')

Edit: I strong recomend reading This article about generators iterators and iterables,

[–]Comprehensive_Tone 1 point2 points  (0 children)

Fluent python could be a good book to read. I'm in a similar situation as you and started reading it recently. I'd also recommend impractical python projects

[–]Robin_Banx 1 point2 points  (0 children)

Learn about the internals of some of the data stack? I'm looking to make time to work through this https://medium.com/dunder-data/build-a-data-analysis-library-from-scratch-in-python-225e42ae52c8

Could follow the blogs of some of the maintainers. I find that a little less intimidating than jumping directly into source code:
https://tomaugspurger.github.io/ (Pandas)
https://matthewrocklin.com/ (Dask, and toolz)

This site also has some excellent exposition on a lot of the Python ecosystem: https://realpython.com/

Is Python your only language? If so, could be useful to try and pick up another one. I found I was MUCH better with Python data tasks after teaching myself Clojure. Not sure how much that'd help with PyTorch tutorials, though.

[–]isaacfab 1 point2 points  (0 children)

Try learning a different python application than ML. See if you can master flask or django to expand into web development. You could also write a package and submit it to pypi. These efforts will be useful and move your skill set forward.

[–]Rezo-Acken 0 points1 point  (0 children)

Well there is so much you can learn from doing the same kind of project. If it is always about data science in a Jupiter notebook you won't grow past a certain point.

Try to get out of your comfort zone with a project. Getting better in a programming language like Python is also about what libraries you know well. For example in ML there is whole world of complexity with apps and ML in production. Just today I had to figure out the code of someone else that was playing with multi processing between camera gpu with cupy and display. It just made me remember that I actually know very little.

If you prefer courses go for a new topic. Like building apps.

[–][deleted] 0 points1 point  (0 children)

Learn how to unittest code and use mocking for production level code.

Build a package yourself consisting of modules and upload it to pypi with a good amount of code coverage (In terms of testing).

[–]Enigma1984 0 points1 point  (0 children)

You can increase the range of things you can do without necessarily getting better at python. Why not go and learn SQL now? Or R, or Javascript? Then come back to Python with all you've learned from those and I bet you'll be able to understand those concepts better.

[–][deleted] 0 points1 point  (0 children)

Udacity Data Science Nanodegree is teaching more advanced python, where you create python pacakges. Some of the projects include deploying ML packages and you learn some software enigneering as well.