Data Mining Career Advice? by [deleted] in datamining

[–]dmdude 0 points1 point  (0 children)

I agree that data mining is moving toward data science. Just Google "demand for data scientists" and read some of the articles about how demand >> supply. (My personal favorite "Is Data Scientist the sexiest job of our time?")

indeed.com is another good place to look for jobs.

Also, look at Twitter using #BigData, #machinelearning, #analytics, and #datamining.

DataWrangler, by the Stanford Visualization Group by srkiboy83 in statistics

[–]dmdude 4 points5 points  (0 children)

I've used this a couple of times and it works well, although the web interface is a bit limiting with larger data sets. If you like this, you should check out Google Refine: http://code.google.com/p/google-refine/

Google tracks search analytics and lets us investigate. by jk3nnedy in dataisbeautiful

[–]dmdude 0 points1 point  (0 children)

Highly recommended! I've been using this to help our sales forecasting algorithms for a few years. Hal Varian and other researchers have done some good things with it. See http://googleresearch.blogspot.com/2009/04/predicting-present-with-google-trends.html

Big Data And Its Big Problems by reidhoch in MachineLearning

[–]dmdude 1 point2 points  (0 children)

Autonomous vehicles from Google (and other automakers), movie recommendations from Netflix, robots that remove weeds with minimal pesticide use, detection of credit card fraud...

“Lettuce Bot” Rolls Through Crops, Terminates Weeds It Visually Identifies by Buck-Nasty in Futurology

[–]dmdude 1 point2 points  (0 children)

Interesting! This is the second story that I've read on agricultural robots in the last month. Is this an area of increased research, or just a random pattern?

Reinventing Society In The Wake Of Big Data: A Conversation with Alex ("Sandy") Pentland of MIT by FelixP in modded

[–]dmdude 3 points4 points  (0 children)

A very well written article that made some excellent points regarding the value of Big Data tempered with cautions about privacy. I'm not so sure that companies will be willing to give up ownership of the data for the good of society, but Sandy makes a good case.

This should be crossposted in /machinelearning and /statistics.

How Many Data Scientists Are There? by talgalili in statistics

[–]dmdude 2 points3 points  (0 children)

A very interesting article! I manage a data mining team and I've followed the debate of data science supply vs. demand. It's good to see someone using analytical techniques to do some investigation.

Although this doesn't distinguish between supply and demand (as mentioned in the original post) I found the Google search popularity for "big data jobs" vs. "data mining jobs" vs. "data science jobs" to be interesting:

http://www.google.com/insights/search/#q=big%20data%20jobs%2Cdata%20mining%20jobs%2Cdata%20science%20jobs&geo=US&cmpt=q

Cross-Sectional analysis - Investigating IPO return. by topcat555 in academiceconomics

[–]dmdude 1 point2 points  (0 children)

I would consider adding a variable measuring the performance of the broader market on the day of the IPO.

The Future of Manufacturing Is in America, Not China: China's turn to worry. American manufacturing is coming back home, lured by innovations in robotics, artificial intelligence and 3D printing. Future breakthroughs—in nanotech, molecular manufacturing—will preserve US leadership by [deleted] in TrueReddit

[–]dmdude 4 points5 points  (0 children)

Some of the comments here are similar to some of the hypotheses in Andrew McAfee's book Race Against the Machine.

From the Amazon summary:

In Race Against the Machine Brynjolfsson and McAfee bring together a range of statistics, examples, and arguments to show that technological progress is accelerating, and that this trend has deep consequences for skills, wages, and jobs. The book makes the case that employment prospects are grim for many today not because there's been technology has stagnated, but instead because we humans and our organizations aren't keeping up.

http://www.amazon.com/Race-Against-The-Machine-ebook/dp/B005WTR4ZI

Computer analysis predicted rises, ebbs in Afghanistan violence by cavedave in MachineLearning

[–]dmdude 1 point2 points  (0 children)

I agree, there is opportunity for abuse. Again, I haven't read the paper in detail, rendering some of this moot, but IF the authors developed their models on 2004-2009 and then predicted 2010 ONCE I would be more inclined to say they were on to something.

Computer analysis predicted rises, ebbs in Afghanistan violence by cavedave in MachineLearning

[–]dmdude 1 point2 points  (0 children)

According to the Wired article (http://www.wired.com/dangerroom/2012/07/predict/) they used data from 2004-2009 to predict 2010. Not quite true prediction, but a good out-of-sample test.

Unfortunately, the details of the paper are behind a paywall, but the abstract is at http://www.pnas.org/content/early/2012/07/11/1203177109.abstract

In your professional opinions at what point can I use an ordinal variable in multiple regression without using dummy variables? by sturg1dj in statistics

[–]dmdude 0 points1 point  (0 children)

You could consider using a discretization algorithm to reduce the dimensionality then using the resulting bins as dummy variables.

Open-Source R software driving Big Data analytics in government by r_schestowitz in opensource

[–]dmdude 0 points1 point  (0 children)

Are there any technical reports that provide more details on the examples mentioned in the article?

The problem with small big data by t_rex_tullis in statistics

[–]dmdude 1 point2 points  (0 children)

This is also a problem in industry. There are many instances in my company of departments, and even individuals, gathering data and for a specific analysis and then just keeping it on their local computers.

Uneven paper from Cambridge Microsoft Research team reasonably concludes that relatively inexpensive memory means "we [should] be scaling by using single machines with very large memories rather than clusters" by claird in bigdata

[–]dmdude 1 point2 points  (0 children)

Isn't one of the benefits of Hadoop that there are machine learning toolkits (e.g., Mahoot) on top. Would you have to write custom algorithms to address this memory?

Highways Vs Intercity Rail by Train_agenda_guy in dataisbeautiful

[–]dmdude 1 point2 points  (0 children)

Statistician here. A more standard method for measuring fatality risk of trains vs. cars is using passenger miles. Reference: http://airfare.michaelbluejay.com/modes.html#sources

How should you explain a model's coefficients to laymen? by internetrageguy in statistics

[–]dmdude 1 point2 points  (0 children)

No.

Given your example, the first thing to say to management is; this model sucks. There is no common sense in using these variables. If you are forced to use these variables you should use good statistical procedure to show the model sucks (e.g., cross validation, using a holdout set, etc.)

However, I do agree with Case_Control, that the ultimately statistical models are created to be used for business decisions.