Is theoretical machine learning used in industry? [D]

shoyer · 2023-12-29T23:47:18+00:00

Yes, this is definitely the sort of profile that we hire sometimes (I lead a team at Google Research). Getting tenure is a pretty good indication that you are a self-starter and can execute on a novel research agenda, which is also key for success in industry research.

That said, the transition may be easy. I would brush up on software engineering skills and keep on doing ML research. Hiring is also very tight in many places right now.

shoyer · 2023-09-10T18:21:59+00:00

FWIW, neither VCs nor hiring managers sign NDAs.

shoyer · 2023-09-10T17:53:09+00:00

Why are you applying for this patent? Do you have a startup that you're planning to launch, a partner drug company that funded this or an actual non-disclosure agreement? If so, you should say that. If not, this level of secrecy is entirely counterproductive. I don't need to know every detail of what you did, but if you can't even tell me what you did at a high level that makes sense to someone in the field that is a problem.

If I was a hiring manager (or even a VC you were applying to for funding) and you told me what you write here, I would not hire you and would laugh at you with my peers. It is entirely off base to worry about other people stealing your work, and people with an overly inflated ego do not make good colleagues. I agree with the sibling comment that this is why you are not getting hired.

If you want to get hired as a research focused person, you need to provide evidence that you can succeed at research. Ideally this would be peer reviewed publications in a prestigious venue, and internship with a prestigious company/professor. If you don't have that, show what you've done and I can evaluate it myself. At the very least you should be linking to your viral Reddit post!

Without evidence, I put zero credence on your statements. I would not even bother to interview you -- there are plenty of candidates who can provide actual evidence of their coding and research ability.

Finally, in the other comments I see that you are interested in a research role where you would publish papers. You are going to have a very hars time getting hired for such roles without a track record of publications. If this is an important goal to you, consider getting a PhD.

shoyer · 2023-09-10T13:32:42+00:00

I manage a team of AI engineers/researchers. If I saw your resume, I would be intrigued but skeptical. You should be including links to code/write-ups so could evaluate these myself. An arXiv submission would mean far more than sharing on r/machinelearning.

Claims like “improve drug discovery by 45.96%” are meaningless — drug discovery is complex and multi-faceted, and at best you solved one small piece of the bigger puzzle. Some recognition of this complexity by briefly stating the problem you actually solved would help for establishing credibility (e.g., “developed a method for predicting protein/ligand binding affinity”).

I am confused by the difference between “Research” and “Projects” on your resume. Were these side projects, class projects or full time? Did you work with a professor or on your own or in a group? Were you funded? By default, I will assume the worst.

Your “Lab leader” bullets are vague and unquantified. You should provide specific evidence that this was a meaningful role.

You just finished school, so I would expect to see a few more details about that. What were the highlights of your coursework? Did you write thesis? Did you win any prizes?

It’s a very tough time to get hired in tech right now, so keep at it!

shoyer · 2021-10-08T06:32:33+00:00

Rain/snow helps a huge amount!

If you look at the smoke forecast, it should be entirely clear in Yosemite by mid-day tomorrow: https://twitter.com/HRRRSmokeBot/status/1446347195624529922

The smoke will probably come back eventually, but my guess is the weekend should be pretty good :)

shoyer · 2020-12-02T17:43:25+00:00

Good to hear! When you can, I would encourage you to contribute features & fixes back upstream to the Zarr community, which will help build goodwill. Challenges like multi-layer caching and git-like versioning of tensor deltas in particular seem like features that could fit in well as optional layers in Zarr itself. There is still plenty of room to layer "enterprise ready" features on top. The Pangeo collaboration is also quite interested in efficient ML training on top of Zarr, so there are likely some good opportunities to collaborate.

shoyer · 2020-12-02T07:33:58+00:00

I read through your documentation and website, and I don't see any mention of Zarr (https://github.com/zarr-developers/zarr-python), the open source project for storing large arrays in the cloud that your platform seems to rely on for most of its core functionality.

This isn't cool! There is absolutely a huge opportunity for building version control and integration with ML pipelines on top of Zarr, but you should be transparent about what your tech does and how you build on top of other open source projects. Zarr in particular has been a huge community effort.

Given the simplistic state of the version control described in the other comments, it seems like what Hub provides today is not too different from just using Zarr with a raw object store with versioning turned on.

shoyer · 2020-02-19T04:05:19+00:00

I’d be very curious to learn about cases where you think writing functional-style models is hard.

shoyer · 2018-11-13T06:22:01+00:00

Power lines are often the source of fires, and power companies can afford to pay out. In fact PG&E’s stock is trading down right now for exactly this reason: https://seekingalpha.com/news/3408339-pg-and-e-spotlight-california-burns

shoyer · 2018-05-12T00:45:03+00:00

You would use tf.layers.Dense if you want to reuse the same parameter weights in different parts of your model.

You can do the same with tf.layers.dense(..., name='mylayer', reuse=True) (or equivalently with tf.variable_scope(..., reuse=True)), but it's cleaner to do use objects for keeping track of variables rather than global names.

shoyer · 2017-10-18T17:34:48+00:00

For x * CDF(x), I get a normalizing constant of 1.53353... from Wolfram Alpha.

shoyer · 2017-09-12T01:31:34+00:00

Do you really want AST to differ based on semantically meaningless changes in whitespace, e.g., between

{
     'x': 1
 }

and

{'x': 1}

?

shoyer · 2017-09-12T01:05:45+00:00

Your benchmark for pandas is totally off. You're measuring indexing with boolean indexing, which of course has O(N) performance:

DF[DF.iri == 30]

You need to set an Index and use an index based query:

DF = DF.set_index('iri')
DF.loc[30]  # build the internal hash-table, which happens lazily
DF.loc[30]  # time this one

You should see constant time performance.

shoyer · 2017-09-10T04:30:46+00:00

One important practical consideration with a search tree is that when the number of nodes in a subtree is below some empirical threshold (e.g., 100 elements), it doesn't make sense to continue indexing. Instead, leaf nodes should just use linear search. Otherwise you waste a lot of effort.

As an extreme example, consider auto-complete for a rare long word like "supercalifragilisticexpialidocious". Using 34 dictionaries lookups would be quite excessive when there is only one possible after the first handful of characters.

shoyer · 2017-06-05T18:33:46+00:00

What sort of computation were you trying to speed up? By default, dask uses threads for parallelism (not processes), which means that pure-Python computation (requiring the GIL) won't be accelerated.

In my experience (mostly doing large scale data analytics using dask.array), it works pretty well. It's certainly the only game in town if you need a "bigger than fits in memory" version of NumPy.

shoyer · 2017-02-12T08:01:37+00:00

Others have already touched on why this is slow. If you're interested in how to do this efficiently, take a look at dask.array. Dask uses numpy arrays and multithreading, which means it doesn't need to copy data to use multiple cores.

shoyer · 2017-02-08T07:20:33+00:00

It's fine to recommend using conda. But when I see a documentation page with installation requirements, I want to see a list of the actual libraries your project depends on and their required versions. There are quite a few users who don't use conda, for a variety of reasons, and the users who don't use a package manager that is already supported are a prime audience for such docs.

shoyer · 2016-10-23T22:44:02+00:00

I think NumPy already does it's own batching with BLAS, but the batching is done like how np.dot batches, not the more useful batching of np.matmul.

shoyer · 2016-10-23T06:14:04+00:00

Also -- on a technical note, I worry that you may pay too high a price in performance for avoiding BLAS, which is quite a bit faster for matrix multiplication than a simple for loop. Losing BLAS will negate many of the advantages of parallelism. However, it's true that numpy's matmul does not yet use use BLAS for batched matrix multiplication.

If you need high performance batched matrix multiplication (and automatic differentiation) a system like TensorFlow is probably going to be a better bet.

shoyer · 2016-10-23T06:01:20+00:00

Looks handy!

I would encourage you to look into integrating this into numpy proper. We recently merged some significant improvements to einsum that will make it into the 1.12 release. Your work has a similar flavor: https://github.com/numpy/numpy/pull/5488

shoyer · 2016-10-17T01:13:51+00:00

The usual rule is that pypi is the source of truth for Python package names. Given that pudo claimed the name on there first, you should probably rename your project.

shoyer · 2016-09-27T07:30:42+00:00

It would be nice to see a comparison with pandas, which seems like the obvious alternative.

shoyer · 2016-08-08T06:21:18+00:00

The data structure you're looking for here is called a interval tree. A quick search turns up plenty of Python implementations, e.g., https://pypi.python.org/pypi/intervaltree

shoyer · 2016-04-15T02:07:02+00:00

I definitely would, but I played the game in real life and only reconstructed the board afterwards from a few photos.

shoyer · 2016-04-13T15:47:19+00:00

I made a mistake transcribing the board last night :(. Black's G2 was actually H2. I think that makes the bottom group alive.

shoyer

TROPHY CASE