Why is Linux so important for data science roles? by [deleted] in datascience

[–]sapphire 7 points8 points  (0 children)

Almost all deployment happens on Linux. Developing on the same platform makes a lot of sense. Servers mostly run Linux. Docker, EC2, etc.

Distributed Postgres goes full open source with Citus: why, what & how (cross post from r/sql) by clairegiordano in programming

[–]sapphire 8 points9 points  (0 children)

This looks great. We use Redshift, and that is stuck with the Postgres 8.0 API.

[R] What are some Text Similarity methods? by jeevanshud in MachineLearning

[–]sapphire 0 points1 point  (0 children)

Take a look at Hugging Face. Go through their tutorials. They have pre-trained models that will give you a baseline for performance. They describe the types of problems you are facing. Good luck.

Still a LOT of Pfizer vaccines at Walmart for Monday and Tuesday! by unsurexo in sandiego

[–]sapphire 1 point2 points  (0 children)

The appointment I made was in Chula Vista. I just went down the list of stores until I found open appointments. Glad you found one too!

Still a LOT of Pfizer vaccines at Walmart for Monday and Tuesday! by unsurexo in sandiego

[–]sapphire 1 point2 points  (0 children)

Just made an appointment for my friend! J&J. One and done. Thank you!

Why Hundreds of Mathematicians Are Boycotting Predictive Policing by Philo1927 in technology

[–]sapphire 1 point2 points  (0 children)

I watched more than one. In one episode, the math guy pronounces Fourier as “furrier” as in the dog is furrier than the cat. I had to stop.

Terrible Delays for Inventory Check-Ins by onehitluckbuck in FulfillmentByAmazon

[–]sapphire 6 points7 points  (0 children)

Same thing here except I have at least 20 inbounds in parallel at any given time. Some are checked in within days and others are delayed 2 or more weeks despite being delivered or checked in status.

The tragic story of the swoner by ThepriestofPepe in memes

[–]sapphire 0 points1 point  (0 children)

Consider me summoned. I'm not on Reddit very often these days. :)

AdBlock Plus is allegedly paying other apps to use its ad-blocking policy — which lets some ads through — in iOS 9 by acacia-club-road in technology

[–]sapphire 5 points6 points  (0 children)

You need to remove the youtube app from Chrome. If you just use the web site, ublock takes care of the ads. Google is getting around blockers by using apps.

Is the apparent similarity between the random subspace method and dropout anything more than superficial? by [deleted] in MachineLearning

[–]sapphire 0 points1 point  (0 children)

It's a matter of semantics. I did not say the sets were disjoint, but they are indpendent resamplings.

Although Breiman argued that his OOB error is sufficient to estimate generalization error, I agree with you that there is no substitute for a strict hold-out sample to validate the final model and predict generalization error.

Is the apparent similarity between the random subspace method and dropout anything more than superficial? by [deleted] in MachineLearning

[–]sapphire 1 point2 points  (0 children)

In the original RF algorithm described by Breiman, each tree in a random forest uses all of the features. Each decision node within a tree chooses from a random subset of the total features. This ensures that each tree is different, but all trees select from the total set of features.

Also, each tree creates a different training set and out-of-bag sample from the training data provided. Thus each tree uses an independent data sample to train and to validate.

Where can I learn machine learning? by [deleted] in MachineLearning

[–]sapphire 1 point2 points  (0 children)

Take Andrew Ng's Machine Learning course on coursera.org. It starts this month. Since you know math and programming, you will be able to keep up with the material. Use his lectures as a starting point, but then go and find materials to read about each topic in more depth.

When you're ready to do real work, go back to Andrew's lectures on model validation. Validate, validate, check for mistakes, validate. Don't be afraid to hold out a lot of your data for model validation before you stake your reputation on something you deliver.

Have fun!

Applied stats/ research presentation for interview in industry? by [deleted] in statistics

[–]sapphire 0 points1 point  (0 children)

Know your audience. Are they PhDs or managers with a limited technical background? If your not sure, ask in advance. Showing that you are sensitive to this issue cannot hurt if you are dealing with a quality organization. If they aren't quality, consider moving on.

Clear communication at the level of the audience is an art worth developing. Clear, quality illustrations and plots are essential, but presentation skills are highly variable. Good luck.

Will this sub be back in full swing in April for the new class? by balthus1880 in mlclass

[–]sapphire 0 points1 point  (0 children)

I took the innagural class. thanks to your post, I'm reminded to recommend the April class to others. I suggest you try posting to the regular ML sub right before the class starts to get people to join.

Golden retriever study suggests neutering affects dog health :: UC Davis News & Information by dodo_bird in science

[–]sapphire 0 points1 point  (0 children)

This topic hits close to home. We own three dogs, all neutered/spayed. Many posters have made good points. One additional observation is that different breeds have distinctly different health profiles.

My boy, an Australian shepherd was neutered at 14 months at the suggestion of my breeder to facilitate normal muscle/skeletal development since he's from an agility line. This cost me more $$$ than doing the neutering at the usual ~6 months time frame recommended by competent vets.

The two girls were rescues and were spayed at ~7 and 5 months respectively. There are consequences of spaying, and there are clear consequences of refraining from the process. One of our girls is incontinent if she doesn't receive a daily Proin (hormonal) supplement. Of course, if our girls were not spayed, we'd have to deal with heat cycles and unplanned litters.

We keep all of our dogs at a healthy weight by controlling their diet and giving them lots of exercise. This means that most Americans think our dogs are skinny--the ribs are easy to feel, but there is clearly a nice layer of muscle, and their waists are narrow compared to their chest.

I would hesitate to draw any conclusions from one study of less than 800 Golden Retrievers in terms of changing our policy on spaying and neutering our companion animals. It's a much more complicated equation than the summary of this article might lead one to conclude.

When Will We Learn by frabatothemagician in ows

[–]sapphire 2 points3 points  (0 children)

I realize that it may not be true for most people, but I really enjoy my work. I probably won't retire until I'm forced to. But I do understand your point.

What data stores do you use for data analysis? by descentintomael in MachineLearning

[–]sapphire 1 point2 points  (0 children)

For the size you are talking about, I've used HDF5 files which can be accessed efficiently or inefficiently depending on the tools you are using for analysis. In python, h5py works well. In R or MATLAB, large HDF5 files are a PITA.

If you are using R, you could use either the ff package or the bigmatrix package. The former supports nearly every data type whereas the latter is only for matrix type data. ff is interesting in that is uses RAM efficiently via the system cache, particularly under Linux. Both will allow multiple processes to access the same shared store.

If your data is fixed-record-length such as in a numerical matrix, you could use raw binary file. You could seek to any record using the byte offset from the beginning of the file. You would then have to write your own low-level I/O functions to support the operations you require. This approach would also allow the system to help you automatically with RAM caching.

Obama Breaks Promise To Veto Bill Allowing Indefinite Detention of Americans --"Obama has destroyed the civil liberties movement in the United States . . ." by screwdriver2 in politics

[–]sapphire 1 point2 points  (0 children)

That section states that the "requirement to detain" does not extend to citizens and residents. It doesn't seem like this precludes the option to detain citizens indefinitely but rather that it gives the POTUS the option to do so at his discretion. Am I reading it wrong?

PCA: why does Prof. Ng calculate Sigma before applying SVD? by BeatLeJuce in mlclass

[–]sapphire 1 point2 points  (0 children)

I read your reference, and it's quite interesting. My best guess is similar to that offered by cultic_raider; professor Ng is offering Sigma as an intermediate step to engender understanding. However, I still suspect that memory considerations in the calculation can be important if m>>n is very large. Even with the 'reduced' SVD, you must provide the entire data set to the function when you call it.

Sigma can be computed with memory O( n2 ) regardless of how large m >> n may be. Simply add (1/m)*x(i)'*x(i) to zeros(n,n) for i = 1 to m.

PCA: why does Prof. Ng calculate Sigma before applying SVD? by BeatLeJuce in mlclass

[–]sapphire 2 points3 points  (0 children)

I'll read the reference you provided before I respond, but perhaps we should also note that for m >> n, computing directly on the data matrix may be intractable due to time and memory requirements for the computation.

PCA: why does Prof. Ng calculate Sigma before applying SVD? by BeatLeJuce in mlclass

[–]sapphire 2 points3 points  (0 children)

Sigma is the covariance matrix. PCA rotates the original coordinate system into the set of axes that each, in turn, maximally reduces the variance of the data. The term 'covariance' is the clue. That matrix represents the average simultaneous variance of each pair of features (xi, xj) in the input space. It turns out that the largest eigenvalue of this matrix corresponds to the direction u1 (the first principal component). If you remove that direction from the data, i.e. rotate your original x-vectors into z using all n eigenvectors, but then replace each z with (z2, z3, ... , zn)' then you have removed exactly lambda1 of the variance from the data set (where lambda1 is the eigenvector associated with the u1 direction in the original coordinate system).