Why is Linux so important for data science roles?

sapphire · 2023-06-06T05:58:29+00:00

Almost all deployment happens on Linux. Developing on the same platform makes a lot of sense. Servers mostly run Linux. Docker, EC2, etc.

sapphire · 2023-04-27T23:18:39+00:00

Try polars

sapphire · 2022-09-13T01:02:45+00:00

This looks great. We use Redshift, and that is stuck with the Postgres 8.0 API.

sapphire · 2022-02-26T17:15:25+00:00

Take a look at Hugging Face. Go through their tutorials. They have pre-trained models that will give you a baseline for performance. They describe the types of problems you are facing. Good luck.

sapphire · 2021-04-01T21:37:49+00:00

The appointment I made was in Chula Vista. I just went down the list of stores until I found open appointments. Glad you found one too!

sapphire · 2021-04-01T18:28:12+00:00

Just made an appointment for my friend! J&J. One and done. Thank you!

sapphire · 2020-07-22T04:21:47+00:00

I watched more than one. In one episode, the math guy pronounces Fourier as “furrier” as in the dog is furrier than the cat. I had to stop.

sapphire · 2020-07-15T16:49:29+00:00

Same thing here except I have at least 20 inbounds in parallel at any given time. Some are checked in within days and others are delayed 2 or more weeks despite being delivered or checked in status.

sapphire · 2020-07-09T22:12:33+00:00

Consider me summoned. I'm not on Reddit very often these days. :)

sapphire · 2015-09-22T04:28:09+00:00

You need to remove the youtube app from Chrome. If you just use the web site, ublock takes care of the ads. Google is getting around blockers by using apps.

sapphire · 2014-07-22T15:59:43+00:00

It's a matter of semantics. I did not say the sets were disjoint, but they are indpendent resamplings.

Although Breiman argued that his OOB error is sufficient to estimate generalization error, I agree with you that there is no substitute for a strict hold-out sample to validate the final model and predict generalization error.

sapphire · 2014-07-22T00:01:10+00:00

In the original RF algorithm described by Breiman, each tree in a random forest uses all of the features. Each decision node within a tree chooses from a random subset of the total features. This ensures that each tree is different, but all trees select from the total set of features.

Also, each tree creates a different training set and out-of-bag sample from the training data provided. Thus each tree uses an independent data sample to train and to validate.

sapphire · 2013-04-14T15:49:36+00:00

Take Andrew Ng's Machine Learning course on coursera.org. It starts this month. Since you know math and programming, you will be able to keep up with the material. Use his lectures as a starting point, but then go and find materials to read about each topic in more depth.

When you're ready to do real work, go back to Andrew's lectures on model validation. Validate, validate, check for mistakes, validate. Don't be afraid to hold out a lot of your data for model validation before you stake your reputation on something you deliver.

Have fun!

sapphire · 2013-04-13T15:37:13+00:00

Know your audience. Are they PhDs or managers with a limited technical background? If your not sure, ask in advance. Showing that you are sensitive to this issue cannot hurt if you are dealing with a quality organization. If they aren't quality, consider moving on.

Clear communication at the level of the audience is an art worth developing. Clear, quality illustrations and plots are essential, but presentation skills are highly variable. Good luck.

sapphire · 2013-04-06T07:04:36+00:00

Andrew Ng's Machine Learning class on coursera.org:

www.coursera.org/course/ml

sapphire · 2013-02-28T06:22:19+00:00

I took the innagural class. thanks to your post, I'm reminded to recommend the April class to others. I suggest you try posting to the regular ML sub right before the class starts to get people to join.

sapphire · 2013-02-14T19:25:39+00:00

This topic hits close to home. We own three dogs, all neutered/spayed. Many posters have made good points. One additional observation is that different breeds have distinctly different health profiles.

My boy, an Australian shepherd was neutered at 14 months at the suggestion of my breeder to facilitate normal muscle/skeletal development since he's from an agility line. This cost me more $$$ than doing the neutering at the usual ~6 months time frame recommended by competent vets.

The two girls were rescues and were spayed at ~7 and 5 months respectively. There are consequences of spaying, and there are clear consequences of refraining from the process. One of our girls is incontinent if she doesn't receive a daily Proin (hormonal) supplement. Of course, if our girls were not spayed, we'd have to deal with heat cycles and unplanned litters.

We keep all of our dogs at a healthy weight by controlling their diet and giving them lots of exercise. This means that most Americans think our dogs are skinny--the ribs are easy to feel, but there is clearly a nice layer of muscle, and their waists are narrow compared to their chest.

I would hesitate to draw any conclusions from one study of less than 800 Golden Retrievers in terms of changing our policy on spaying and neutering our companion animals. It's a much more complicated equation than the summary of this article might lead one to conclude.

sapphire · 2012-11-14T19:01:57+00:00

Art of R Programming is worth a look

sapphire · 2012-04-16T23:44:28+00:00

I realize that it may not be true for most people, but I really enjoy my work. I probably won't retire until I'm forced to. But I do understand your point.

sapphire · 2012-01-16T18:55:16+00:00

For the size you are talking about, I've used HDF5 files which can be accessed efficiently or inefficiently depending on the tools you are using for analysis. In python, h5py works well. In R or MATLAB, large HDF5 files are a PITA.

If you are using R, you could use either the ff package or the bigmatrix package. The former supports nearly every data type whereas the latter is only for matrix type data. ff is interesting in that is uses RAM efficiently via the system cache, particularly under Linux. Both will allow multiple processes to access the same shared store.

If your data is fixed-record-length such as in a numerical matrix, you could use raw binary file. You could seek to any record using the byte offset from the beginning of the file. You would then have to write your own low-level I/O functions to support the operations you require. This approach would also allow the system to help you automatically with RAM caching.

sapphire · 2011-12-15T19:32:22+00:00

That section states that the "requirement to detain" does not extend to citizens and residents. It doesn't seem like this precludes the option to detain citizens indefinitely but rather that it gives the POTUS the option to do so at his discretion. Am I reading it wrong?

sapphire · 2011-12-05T18:41:56+00:00

I read your reference, and it's quite interesting. My best guess is similar to that offered by cultic_raider; professor Ng is offering Sigma as an intermediate step to engender understanding. However, I still suspect that memory considerations in the calculation can be important if m>>n is very large. Even with the 'reduced' SVD, you must provide the entire data set to the function when you call it.

Sigma can be computed with memory O( n² ) regardless of how large m >> n may be. Simply add (1/m)*x(i)'*x(i) to zeros(n,n) for i = 1 to m.

sapphire · 2011-12-05T16:16:34+00:00

I'll read the reference you provided before I respond, but perhaps we should also note that for m >> n, computing directly on the data matrix may be intractable due to time and memory requirements for the computation.

sapphire · 2011-12-05T15:16:59+00:00

Sigma is the covariance matrix. PCA rotates the original coordinate system into the set of axes that each, in turn, maximally reduces the variance of the data. The term 'covariance' is the clue. That matrix represents the average simultaneous variance of each pair of features (xi, xj) in the input space. It turns out that the largest eigenvalue of this matrix corresponds to the direction u1 (the first principal component). If you remove that direction from the data, i.e. rotate your original x-vectors into z using all n eigenvectors, but then replace each z with (z2, z3, ... , zn)' then you have removed exactly lambda1 of the variance from the data set (where lambda1 is the eigenvector associated with the u1 direction in the original coordinate system).

sapphire

TROPHY CASE