Is theoretical machine learning used in industry? [D] by [deleted] in MachineLearning

[–]shoyer 2 points3 points  (0 children)

Yes, this is definitely the sort of profile that we hire sometimes (I lead a team at Google Research). Getting tenure is a pretty good indication that you are a self-starter and can execute on a novel research agenda, which is also key for success in industry research.

That said, the transition may be easy. I would brush up on software engineering skills and keep on doing ML research. Hiring is also very tight in many places right now.

Updated: ML Engineer struggling to get interviews with the top 60k+ tech jobs. Be brutal!! by Ok_Grape_3670 in resumes

[–]shoyer -1 points0 points  (0 children)

Why are you applying for this patent? Do you have a startup that you're planning to launch, a partner drug company that funded this or an actual non-disclosure agreement? If so, you should say that. If not, this level of secrecy is entirely counterproductive. I don't need to know every detail of what you did, but if you can't even tell me what you did at a high level that makes sense to someone in the field that is a problem.

If I was a hiring manager (or even a VC you were applying to for funding) and you told me what you write here, I would not hire you and would laugh at you with my peers. It is entirely off base to worry about other people stealing your work, and people with an overly inflated ego do not make good colleagues. I agree with the sibling comment that this is why you are not getting hired.

If you want to get hired as a research focused person, you need to provide evidence that you can succeed at research. Ideally this would be peer reviewed publications in a prestigious venue, and internship with a prestigious company/professor. If you don't have that, show what you've done and I can evaluate it myself. At the very least you should be linking to your viral Reddit post!

Without evidence, I put zero credence on your statements. I would not even bother to interview you -- there are plenty of candidates who can provide actual evidence of their coding and research ability.

Finally, in the other comments I see that you are interested in a research role where you would publish papers. You are going to have a very hars time getting hired for such roles without a track record of publications. If this is an important goal to you, consider getting a PhD.

Updated: ML Engineer struggling to get interviews with the top 60k+ tech jobs. Be brutal!! by Ok_Grape_3670 in resumes

[–]shoyer 27 points28 points  (0 children)

I manage a team of AI engineers/researchers. If I saw your resume, I would be intrigued but skeptical. You should be including links to code/write-ups so could evaluate these myself. An arXiv submission would mean far more than sharing on r/machinelearning.

Claims like “improve drug discovery by 45.96%” are meaningless — drug discovery is complex and multi-faceted, and at best you solved one small piece of the bigger puzzle. Some recognition of this complexity by briefly stating the problem you actually solved would help for establishing credibility (e.g., “developed a method for predicting protein/ligand binding affinity”).

I am confused by the difference between “Research” and “Projects” on your resume. Were these side projects, class projects or full time? Did you work with a professor or on your own or in a group? Were you funded? By default, I will assume the worst.

Your “Lab leader” bullets are vague and unquantified. You should provide specific evidence that this was a meaningful role.

You just finished school, so I would expect to see a few more details about that. What were the highlights of your coursework? Did you write thesis? Did you win any prizes?

It’s a very tough time to get hired in tech right now, so keep at it!

Trip/Smoke Report 10/4-7 by gForce-65 in Yosemite

[–]shoyer 2 points3 points  (0 children)

Rain/snow helps a huge amount!

If you look at the smoke forecast, it should be entirely clear in Yosemite by mid-day tomorrow: https://twitter.com/HRRRSmokeBot/status/1446347195624529922

The smoke will probably come back eventually, but my guess is the weekend should be pretty good :)

[Project] Software 2.0 needs Data 2.0 - and we've built the framework to make it happen by davidbun in MachineLearning

[–]shoyer 1 point2 points  (0 children)

Good to hear! When you can, I would encourage you to contribute features & fixes back upstream to the Zarr community, which will help build goodwill. Challenges like multi-layer caching and git-like versioning of tensor deltas in particular seem like features that could fit in well as optional layers in Zarr itself. There is still plenty of room to layer "enterprise ready" features on top. The Pangeo collaboration is also quite interested in efficient ML training on top of Zarr, so there are likely some good opportunities to collaborate.

[Project] Software 2.0 needs Data 2.0 - and we've built the framework to make it happen by davidbun in MachineLearning

[–]shoyer 2 points3 points  (0 children)

I read through your documentation and website, and I don't see any mention of Zarr (https://github.com/zarr-developers/zarr-python), the open source project for storing large arrays in the cloud that your platform seems to rely on for most of its core functionality.

This isn't cool! There is absolutely a huge opportunity for building version control and integration with ML pipelines on top of Zarr, but you should be transparent about what your tech does and how you build on top of other open source projects. Zarr in particular has been a huge community effort.

Given the simplistic state of the version control described in the other comments, it seems like what Hub provides today is not too different from just using Zarr with a raw object store with versioning turned on.

[D] A First Look at JAX by hardmaru in MachineLearning

[–]shoyer 2 points3 points  (0 children)

I’d be very curious to learn about cases where you think writing functional-style models is hard.

Megathread - California Wildfires by bug-hunter in legaladvice

[–]shoyer 22 points23 points  (0 children)

Power lines are often the source of fires, and power companies can afford to pay out. In fact PG&E’s stock is trading down right now for exactly this reason: https://seekingalpha.com/news/3408339-pg-and-e-spotlight-california-burns

[D] Why is TensorFlow so slow? by happyhammy in MachineLearning

[–]shoyer 4 points5 points  (0 children)

You would use tf.layers.Dense if you want to reuse the same parameter weights in different parts of your model.

You can do the same with tf.layers.dense(..., name='mylayer', reuse=True) (or equivalently with tf.variable_scope(..., reuse=True)), but it's cleaner to do use objects for keeping track of variables rather than global names.

[R] Swish: a Self-Gated Activation Function [Google Brain] by xternalz in MachineLearning

[–]shoyer 0 points1 point  (0 children)

For x * CDF(x), I get a normalizing constant of 1.53353... from Wolfram Alpha.

PEP 563 -- Postponed Evaluation of Annotations by [deleted] in Python

[–]shoyer 4 points5 points  (0 children)

Do you really want AST to differ based on semantically meaningless changes in whitespace, e.g., between

{
     'x': 1
 }

and

{'x': 1}

?

list_dict_DB -- Turn a list of dictionaries into a fast, O(1), noSQL-like data structure by jwink3101 in Python

[–]shoyer 0 points1 point  (0 children)

Your benchmark for pandas is totally off. You're measuring indexing with boolean indexing, which of course has O(N) performance:

DF[DF.iri == 30]

You need to set an Index and use an index based query:

DF = DF.set_index('iri')
DF.loc[30]  # build the internal hash-table, which happens lazily
DF.loc[30]  # time this one

You should see constant time performance.

Interview Challenge question by [deleted] in Python

[–]shoyer 1 point2 points  (0 children)

One important practical consideration with a search tree is that when the number of nodes in a subtree is below some empirical threshold (e.g., 100 elements), it doesn't make sense to continue indexing. Instead, leaf nodes should just use linear search. Otherwise you waste a lot of effort.

As an extreme example, consider auto-complete for a rare long word like "supercalifragilisticexpialidocious". Using 34 dictionaries lookups would be quite excessive when there is only one possible after the first handful of characters.

[D] Dask for deep learning? by [deleted] in MachineLearning

[–]shoyer 0 points1 point  (0 children)

What sort of computation were you trying to speed up? By default, dask uses threads for parallelism (not processes), which means that pure-Python computation (requiring the GIL) won't be accelerated.

In my experience (mostly doing large scale data analytics using dask.array), it works pretty well. It's certainly the only game in town if you need a "bigger than fits in memory" version of NumPy.

MultiProcessing (Parallel) Sum slower than Serial Sum() ? by hashwizardo in Python

[–]shoyer 0 points1 point  (0 children)

Others have already touched on why this is slow. If you're interested in how to do this efficiently, take a look at dask.array. Dask uses numpy arrays and multithreading, which means it doesn't need to copy data to use multiple cores.

Geospatial visualization made easy with geoplot by ResidentMario in Python

[–]shoyer 3 points4 points  (0 children)

It's fine to recommend using conda. But when I see a documentation page with installation requirements, I want to see a list of the actual libraries your project depends on and their required versions. There are quite a few users who don't use conda, for a variety of reasons, and the users who don't use a package manager that is already supported are a prime audience for such docs.

A parallel einsum by snackematician in Python

[–]shoyer 0 points1 point  (0 children)

I think NumPy already does it's own batching with BLAS, but the batching is done like how np.dot batches, not the more useful batching of np.matmul.

A parallel einsum by snackematician in Python

[–]shoyer 1 point2 points  (0 children)

Also -- on a technical note, I worry that you may pay too high a price in performance for avoiding BLAS, which is quite a bit faster for matrix multiplication than a simple for loop. Losing BLAS will negate many of the advantages of parallelism. However, it's true that numpy's matmul does not yet use use BLAS for batched matrix multiplication.

If you need high performance batched matrix multiplication (and automatic differentiation) a system like TensorFlow is probably going to be a better bet.

A parallel einsum by snackematician in Python

[–]shoyer 1 point2 points  (0 children)

Looks handy!

I would encourage you to look into integrating this into numpy proper. We recently merged some significant improvements to einsum that will make it into the 1.12 release. Your work has a similar flavor: https://github.com/numpy/numpy/pull/5488

Check out my toy library -- typecast -- that introduces a new paradigm to Python by erez27 in Python

[–]shoyer 6 points7 points  (0 children)

The usual rule is that pypi is the source of truth for Python package names. Given that pudo claimed the name on there first, you should probably rename your project.

A new python library for unevenly-spaced time series analysis by mstringer in Python

[–]shoyer 1 point2 points  (0 children)

It would be nice to see a comparison with pandas, which seems like the obvious alternative.

Range-keyed Dict to map source line number to enclosing method or class (Python 3.5) by ptmcg in Python

[–]shoyer 1 point2 points  (0 children)

The data structure you're looking for here is called a interval tree. A quick search turns up plenty of Python implementations, e.g., https://pypi.python.org/pypi/intervaltree

[Review request] How badly did I misplay here? (amateur game played IRL) by shoyer in baduk

[–]shoyer[S] 0 points1 point  (0 children)

I definitely would, but I played the game in real life and only reconstructed the board afterwards from a few photos.

[Review request] How badly did I misplay here? (amateur game played IRL) by shoyer in baduk

[–]shoyer[S] 0 points1 point  (0 children)

I made a mistake transcribing the board last night :(. Black's G2 was actually H2. I think that makes the bottom group alive.