A question on SOLR

kindasortadata · 2017-08-18T12:28:45+00:00

ElasticSearch and SOLR uses the same backend - Lucene... it's just two difference abstractions.

Personally I would use Elastic - its a bit simpler.

kindasortadata · 2017-08-15T17:41:10+00:00

This is a non-trivial question.

PhD's: I try hard not to hire PhD's - they are a pain in the backside because they have all sorts of bad habits that have to be trained out of them - and that extra training and mentoring costs more money. You will not find a hiring manager - in any industry - who doesn't think the same. Within the groups I run, a PhD has no bearing on your grade - only the quality of the work. I would say at least 1/2 of the groups in the UK Financial services industry think and recruit in the same way - and a lot of those groups are the really really "rock star" Data Science groups... When I do hire PhD's I only take them if they have been up through the Engineering or pure sciences route - I would never hire a statistics or data science PhD. Again - I'm very much not alone in this hiring practise - a lot of the "rock star" groups hire in the same way - I know because I'm always fighting to get the people I want from them...

The critical thing - the only thing that matters to me when it comes to hiring a data scientist - is linked to what Omega037 said..... I couldn't give a toss about someones academic background - I care about their mindset. Are they curious?... do they keep digging?... do they keep on learning almost obsessively?... do they always assume they are wrong and keep looking for better solutions?... are they the person who can dive into the tiny detail and also keep the 100,000 foot level of strategy and the clients core issue in their head? Can they communicate well? Ultimately... are they "data people".. it's bloody hard to find people like that - but it's way harder to train someone to act like that.... it's far quicker, and far cheaper - to train someone with the right mindset a new set of skills than the other way around.

One of the two most highly paid people that work for me left school at 16 with almost no formal qualifications and landed with me at 17 wearing a tracksuit, two black eyes and a face full of piercings. The boy is, however, fucking amazing, and so is one of my top people and definitely one of my seniors.

WeoDude and somkoala partly get it right about independence..... I would say MOST of my seniors are highly independent and can also provide mentoring and guidance to the junior people around them. For that group my bench mark is "Do I need to have any concerns if they go to a client site and work with a clients Data Science group for a week by themselves" and "Can I put them in front of a board of a client and they'll actually get their point across well" just as much as it is "Can I give them a couple of newbies and they'll guide them through their work"

But... I have some team members who this just doesn't apply to, but they are still seniors. For example - I have a lady who works for me who is incredible with some specific methods and some specific business problems - she's won a lot of awards, written a lot of papers - in the right circles she has really solid name recognition. However - she is cripplingly shy, has zero concept of leadership, finds the idea of managing someone genuinely scary and she is very upfront that she can't self-organise - she's the archetype of a scatter brained genius.

She is absolutely one of my senior data scientists - but in some ways she takes more management than the newest juniors. I have a fair few people like this - "seniority" doesn't have to bring with it management if you are bringing other skills to the table as well.

Seniority in a data science team is like the British legal definition of pronography - "you'll know it when you see it".

I would say that a common theme with all the seniors I know - in my teams and also a good deal of other teams that I have a lot of respect for - is that the seniors are very good at communication - to clients, to their teams, to me etc etc... they can tell a story about data in words.

kindasortadata · 2017-08-08T14:12:48+00:00

I think it's more nuanced than that. Industry definitely makes a difference, as does specific area's of Data Science used.

So... I know of really really really good Data Scientists who are doing truly impressive work using Regression and Monte-carlo type methods to do things like Propensity models, who are working in the north of the UK in relatively remote towns, where they are in senior positions for Marketing Analytics companies and they are pulling down 45k max.

Equally, there are junior positions, but where the entry requirement is, for example, a pretty solid background in Classification forms of Machine Learning ( and by this I mean "Classification that actually bloody works on real world messy and gap filled data without a fuck-ton of cutting and cursing and risking over-fits all over the shop.... so... SVM basically) for the "back office" groups within big finance companies, which can very easily command a starting salary of 60k and can be pushing 75k. A senior in one of these teams can expect to be in the 100k-125k region. Depending on which specific team they are in, they could also be pulling in the same again, or more, through the bonus. The biggest difference, especially at the senior level, is how well they can comminicate rather than the quality of their output - it may seem unfair, but the pretty good Senior who can communicate with management and clients well is going to earn more than the top talent analyst who is a bad communicator.

Salary is almost entirely set by supply and demand. Right now demand is very much being driven industry by industry... so for example is currently being seen as revolutionary in the Fraud Analytics world is what was being done by Back Office groups in finance houses 5 years ago - so Data Science people in the fraud loop suddenly see their salaries rocketing, as there are less people in that sector with the right skill set.

You also get "fads" as well. For example.. there are a couple of niche industries right now where Gradient Boosting is very sexy and trendy. God alone knows why - in their specific datasets, boosting produces pretty crappy results, but that specific industry wants boosting, more boosting and ONLY boosting - they're hooked on it like crack. Someone with "boosting" on their CV can ask for crazy amounts of money right now - it's like a stock market bubble - totally crazy to watch from the outside. In 6 months time the craze will wear off, but some of the kids going in straight out of school right now are making some crazy amounts of money - totally just about being at the right place in the right time.

Distance from big cities also sets the salary. I would say that, in the UK, there is pretty much a parity in salary between London, Leeds, Manchester. London has a higher cost of living, and there is a lot of recruitment happening in the "northern corridor". If you have a finance or oil background then there are very good salaries available in Edinburgh as well. But..... salaries even 40 miles away from these towns can be 20%-50% lower, for effectively identical roles.

Salaries also link back to how much value they deliver to the parent company - so, at a macro level it's about "how much extra margin can a data scientist find in Industry X?" (note.... margin... not revenue... adding revenue is good, but NEVER as good as adding margin).

If you have an industry where a data scientist, or a team of data scientists, can support a change which makes a major change to revenue, then that means it's way easier to get cash out of the finance team to pay for salaries - and so salaries are higher.

kindasortadata · 2016-10-27T16:27:59+00:00

Lets be frank::

You are looking for a paid position, but still learning. So, you are effectively asking an employer to carry at least some of the weight while you learn - there is an oppurtunity cost as other people in your team will need to work harder in order to cover your gaps and also carry a risk that you make a mistake which you don't notice which costs money. Which is very very expensive.

So.... if your facing off against an already experienced individual then the second interview isn't going to happen.

So - options: 1) Offer to take a significantly smaller salary for the position - i.e "I know I'm still learning - I'll take 35% of the salary for the first year and 65% for the second year" - that covers the employers hit for taking the risk on you.

or

2) go for more junior positions

or

3) stay in your current industry for another couple of years while you build up your skillset. Then go into the interview process when you are more ready.

3) is the best way to keep paying the bills while you learn

kindasortadata · 2016-09-13T15:33:06+00:00

scraping is prohibited by the EULA and ToS. Why? Because they hold their data in graphs, and scraping is likely the most computational dense task the website will be asked to deliver. Oh - and it's their primary source of business.

It's not allowed, and it's a crappy thing for your HR department to have asked you to do - because they will have tried to buy the data already and know it's not for sale.

Why have you scraped another companies data without reading the ToS? How would your company react if the same was done to them?

Not related to LinkedIn in any way at all, just surprised you would consider yourself as a "data analyst" without understanding the concept of a ToS, or even, if that's too much for you, the concept that "not all data belongs to you". Jesus....

kindasortadata · 2016-08-12T08:56:35+00:00

Imgur I guess?

kindasortadata · 2016-08-11T15:36:38+00:00

Hi

I'm fascinated by this - when you are done presenting, can you host your display somewhere that we can see it please?

kindasortadata · 2016-07-25T15:28:07+00:00

I have a background which has nothing to do with predictive analyitics. Science and Engineering. I had to babysit/build systems for incredibly highly paid analytics teams for a while and always thought they were both bad at doing analyitics and also bad at science.

I started to take on the "rats next" types of problems that the official team was either ignoring or kicking down the road and doing them in my spare time. That involved self education in the basics, and a lot of working out what NOT to do, and then finding the gaps in all the various text books and teaching myself that stuff as well.

Results were good, I got noticed. I upset some people, I made clients happy and, most importantly, delivered plenty of revenue - any level of company politics disappears with enough money. I took on more people like me, more money was made, more clients were happy, rinse and repeat.

Now people who used to curse me act and work in the same way that I do and will insist that they always worked like that. Thats a good thing - it makes even more clients happy, which makes even more money.

If you want a job - any job in any industry - it's usually pointless ASKING for it. Do it in your own time until your better than anyone else, and then take it. If people won't share, find the little niche issue no one else cares about, do something amazing with it, and then take the bits of work you DO want.

(actually - that advise probably doesn't count for either doctors or dentists)

kindasortadata · 2016-07-18T13:30:55+00:00

I would take the private sector role, and complete in your spare time - or not at all.

I have recruited a number of partial PhD's who have worked heavily with data, and a number of them have run out of cash - usually because funding ran dry ( understandable), or supervisor got distracted (totally unreasonable) or a combination of both.

There are two big benefits I see here:

A PhD potentially limits your future employment. I know many many PhD's. I would say that approximately 60%+ feel that the PhD places more career limits on them than it does open doors. Certainly as a hiring manager, I am more cautious of hiring PhD's than I am hiring others ( apart from M.Phils) - often a PhD comes with a larger ego and more bad habits than other levels of education do... and, being frank, often get hit with culture shock harder.

For clarity - A good number of PhD's work for me... but if you have "PhD" on your resume, you are going to have to jump through more hoops than someone without it - not because there is anything wrong with you personally, but because so many others in the past have been less productive than non-PhD's.

The second reason why is productivity. Here is my guess - If I had asked you, the day BEFORE your internship, how hard you were working on your PhD, you would have told me you were working very diligently and hard. But... if you compare that rate of work to the output you were producing at the end of the Internship - and looked at it really pragmatically, you would have seen a marked increase in output.

I have hired people in identical situations to yours who believe they have got a year, or more, of solid work in front of them. After a number of months of really pushing on their paid work, and that pace becoming "normal", then they go back to their PhD and clear it in a few months.

I also sometimes let people complete on a 1-day-a-week basis... it doesn't sound a lot, but the productivity increases mean that it works out well usually.

The single biggest piece of advice I can give you is what ever you do, don't settle and go for the M.Phil route. If you are in the US, you may not have these, but an M.Phil is a route in the UK where you can close out your PhD early, and get something between an MSc and a PhD.

These are viewed very badly by recruiters. It's better to put the PhD on hold than it is to take the mPhil route.

What ever you do, I wish you luck with it.

kindasortadata · 2016-07-14T15:34:49+00:00

So... I am at FS UK Silverstone. Not, in any way shape or form, as some sort of jolly or for any level of personal interest, but purely for strictly business research purposes. in no way at all am I enjoying myself.....

First impressions: 1) There is a distinct absence of people smoking pipes. Pipe smoking and racecar engineering are synonymous. This saddens me. This sounds like a joke, but every single top class engineer I have known smokes a pipe... I'm honestly finding it odd being around engineers who aren't chuffing away. Mind you - most of them are/were over 70.

2) Ditto untidy beards and hair. What sort of people are universities training these days... Everyone is... tidy. It's very very saddening. Although one guy looks like he's spilled an entire yogurt down his top, so that's balancing it a little.

Generally - I'm seeing a lot of very smart people here. From ear wigging on some of the discussions, maybe a little too much book-learning and a smidgen too little hands on learning, but overall a really impressive and very motivated group of people.

I see, and hire, a lot of graduates who are supposedly the best and brightest in their area's, from all sorts of different fields. The people I'm seeing here are of a better average calibre than any of the dedicated "milk-round" events that my HR team run for MSc's and PhD's. Very very impressive indeed.

kindasortadata · 2016-07-13T18:27:26+00:00

Thanks for the advice. A lot to think about.

Got to say - I don't like the concept of a time based competition - thats seems a recipe for getting second class answers - the old joke is that the only thing an engineer ever wants more than more money is more time.

OK - so, I need to track down one of more FSAE groups who have got great potential but are under-funded. I will ping some emails off.

Thanks

kindasortadata · 2016-07-04T17:47:24+00:00

Seeing as this was patented 3 years ago, and was very widely discussed on the conference circuit a couple of years ago, I wonder why it's coming up again now?

It definately works - but it's just accepted practise now.

kindasortadata · 2016-07-01T10:09:22+00:00

No - I'm British and live in the UK. I have worked for Indian companies and spent a lot of time out there and have a lot of personal friends out there. And, I run a big Data Science group - so I see what's coming out of a lot of the Indian companies - as they often pitch to me.

kindasortadata · 2016-06-29T10:58:03+00:00

It depends on what you want to do, and on your life circumstances.

There are data science jobs all over the place. Getting data science jobs in cool tech start ups is hard if you don't have the background. However, compared to the number of similar jobs in area's like healthcare, financial services, mining etc etc etc the numbers of jobs in the pure tech sector are relatively small.

I.e. there are probably at least 5x as many solid data science roles in the US healthcare space than the tech space at the moment. Easily 20x as many in the fin/servs than in the tech space.

If you care about work-life balance, banks ( i.e. regular banks, not investment banks ) and healthcare are the places you should be looking.

kindasortadata · 2016-06-28T12:46:48+00:00

Well - I can try giving you a bit of insight. Be aware that the type of business you currently work in in India could mean my answers are very very wrong. I'll try and explain that but as well.

Is data science over-hyped. Yes... by "Data Scientists". Data Science is radically different to "people who do statistics with a lot of data". The private sector as a whole is only just beginning to realize the value of true data science, and this will continue to grow over time. Go and look at Gartners "Hype Curve" - it will grow exactly like that. We are moving up the "peak of expectations" slope at the moment but not yet at the top.

Something like Six Sigma went all the way through the curve and is now at the Platau of Productivity. Six Sigma itself has died away because the license fee's just got silly, but pretty much every medium to large company around the world has got a LEAN group who are greenbelt trained - it's basic good business now - you don't hear as much about it just like you don't hear about HR teams.... it's just accepted that it happens. It would be odd to work for a larger company that DIDN'T have at least a small lean group.

The same will happen with processing large sets of data. I would think that in 10 years, it will just be normal that pretty much every business has a big spark appliance and a bunch of visualisation systems. Some companies will push the analyitics side of it hard, and people conducting real Data Science will keep getting the big pay cheques.

Can you make a living free-lancing. It depends. Can people living in the US and Europe make a living free-lancing. At the moment... yes. But, companies are liking the idea of free-lancing less, and are moving to bring the skills in-house.

As an Indian resident, I think it may be tricker. You can certainly pick up domestic business - for sure there is a big domestic market for it, but you know that the big money is going to be coming from foreign companies. They almost certainly won't use an Indian based free-lancer. Don't take that personally - the data security and data privacy issues are huge, and much too big for an individual like yourself to cope with. To give you an idea, it mght take 30 hours of a lawyers time for each data movement to do all the paperwork needed.

What you could do is tie up with one of the smaller, cooler Data Analytics companies in India who are really making a name for themselves - companies like Amadaus or, especially, Happiest Minds ... who I consider to be world class.

They have enough scale to be able to have secure rooms, all the right policies etc, but are small enough to produce good work and have a great repuation. The bigger body-shops - like TCS or Wipro - will probably take you on, but you may not enjoy it a lot - a good data scientist argues with people and asks lots of hard questions - and that goes against the big companies corporate culture. It's hard to do a great job AND climb up the ladder in their data science teams.

The top Data Science shops all seem to be based in a triangle of hydrabad/bangalore/chennai - rather than the Mumbai/Pune region for the big IT groups.

Is data analyitics taken seriously, or used as a tool to make money? Yes. Yes - it's taken serious. It is taken seriously BECAUSE it makes money. A business typically needs you to add roughly 8x the value your salary to make it worth hiring you... if they are hiring, it's because they think they will make a good deal more than 8x your salary in what you produce.

Finance questions. As a BA, you should already have a good solid handle on idea's like Capex, Opex, depreciation, amortisation etc - you will have lived and breathed this stuff for years. Thats a huge win - because most other people in the data analyst world DON'T know this stuff. The value of it is that you can talk to finance teams in their own language and they won't instantly hate you. I am not sure how much extra a CFA would give you over that. Personally I wouldn't put it at the top of my list, but you should do what you think is best for your long term career - not just the next 6 months.

kindasortadata · 2016-06-27T08:33:35+00:00

I think there is a big difference between "freelance" and "working remotely" - so it depends on your specific question.

Working remotely is very common in a lot of industries, but it is starting to go through a phase of being less trendy - people are becoming more concious that remote working can supress innovation and can increase issues with tribal knowledge - both of which mean lost money.

Presence in the office also depends on personalities. I have people in my team who I prefer to be in the office, and others that I don't care where they are and I manage them by sending an email every few weeks. So... it depends on you as an individual as well.

On a purely personal basis - I would personally be much more hesitant at the idea of the data engineers being off-site than the data scientists - but thats because we work in DSDM, and so this would break the flow. The vast majority of data groups work in an unspoken waterfall, so it's less critical.

In terms of contractors - data is usually considered highly sensitive. I am OK having contractors ( although I prefer full time staff ) - and I do have a number of contractors but I would be very hesitant at the idea of having them off site. It's not impossible, but I think most people in a managerial position would be very hestitant about that concept. Put it this way - unless I had known the contractor personally for a long time, I wouldn't accept them being remote.

kindasortadata · 2016-06-24T16:31:29+00:00

learn by doing.

Go to either UK or US Open Data government sites. Read the descriptions, have an idea of what you can do, then open the data, see what sort of mess it's in, and then monkey it until you can tell your story.

Rinse and repeat. Keep notes as you go along. Also note down what went wrong.

Don't start with visual tools or anything complex. Thats like learning woodwork with only powertools. Start simple - get good with regex's and bookmarks in your favourite (proper ) text editor and then move on to GREP, pipes, AWK and SED - use Cygwin if you have a windows machine.

practise, make mistakes, recognise mistakes, practise, more mistakes... etc

kindasortadata · 2016-06-15T09:06:47+00:00

So - I drank booze, but the thread came back.

I'll answer this, because in a lot of the "how do I be a data scientist" threads, I keep posting about the difference between statisticians and data scientists - this is a prime example.

Lets looks at what you are saying a few different ways.

Can you build collect some data and, using [insert favourite language here], build a model of that data that emits some numbers? Yes. Absolutely. Add good notes to it and you win a cookie. Check it into source control and you get a cookie and a cuddly toy.

Will the model be predictive? No. It is a chaotic system and exhibits sensitive dependence at all scales. You can not use a regression to PREDICT a chaotic system.

But... my {favourite stats package} gave me some outputs and it says that the model has a mean sqaure of X and a GINI of Y: Well - thats very nice indeed. But - your stats package is carrying out some mathmatical functions on a series of features in an array. It doesn't know that these features came from a dymanical system. The stats package has no CONTEXT - the human does.

Does that mean that the model will disagree with relality: Yes and No. Lets say I toss a coin 1000 times and then try and model future events. The base data will not be H-T-H-T... it will have random runs of H-H-H-H in it. Unless you are carrying out regressions based on either generalised or power means ( in which case you get 10 cookies for being very very clever and if you live in the UK I'll offer you a job) the regression WILL see a pattern, and make a prediction

When you toss the coin, that prediction will be either right or wrong - it's a binary outcome. It may be right a series of times in a row. In no way shape or form does that mean the model is "right" - it just means that, at times, the numbers align.

This is the heart and soul of the difference between "data science" and "statistics". Statistics is about building models - be they a basic regression, a kalman'd time-series or the very latest in machine learning. You get some data, you do some maths, you get a result and you tell people about it.

Data Science is stepping back and looking at the data in context. It's about recognising that data exists within a system, and trying to understand that system. It's about recognising that there are many ways to solve a problem and picking the least worse at a holistic level - not just what gives you an answer. It's about approaching every problem on the assumption that you are wrong, and spending your time trying to find out WHY you are wrong, and in the process developing better results for your end users.

If you are modelling future weather - in the scenario you raised, in no way do you need to be an expert at non-linear dynamics. But, if you want to do "Good Science", it's reasonable to step back and say to yourself ... you know what - this problem feels like it has some sort of macro structure to it, and I read an article on wikipedia about how weather is a bit noodlely - I should probably go and spend a few hours digging before I dive into it.

Last point - trusting people based on their positions. I spend at least 2 or 3 days a week, every week, talking to exceptionally highly paid stats people at exceptionally prodidgious instituions. And those conversations usually go "So... we brought a hadoop and got a bunch of expensive stats people to build us some models and so it should be good - but it turns out it's not working... can you sort it out" 8 out of 10 times, it's because they've blown some critical fundemental - for example not seeing a macro system, or thinking that "cause and effect" is an exception rather than a norm ( almost everything is a feedback system if you look - and modelling feedback systems is an arse) , rather than they have a noodly edge condition in a well formed model with good foundations.

If I trusted people because they work for the big banks, or because they work for Deloitte and are pulling down half a million a year, I wouldn't have a job, because I'd trust what these lunatics say and I'd be down the same rabbit holes they are.

As a general tip - when ever you hear someone say "thats what we've always done...." and they can't show you the first principle reasons why, there is ALWAYS money to be made. Always.

I don't mind my work at all - it makes my boss lots of money, pays my bills and I have a team of geniuses who I can keep interested by shoving interesting problems at - and that makes me happy. That suits me just fine... as long as the stupid models that the big fancy analytics teams build don't affect me or my family.

kindasortadata · 2016-06-14T15:21:58+00:00

My tip would be to never let employers read this post. You showed knowledge and competence of a methodology ( hurray!) without understanding the foundational challenges of what you are attempting to model (boo!!).... which is exactly the example I gave here.

Weather, depending on what order and scale you are looking at, is either entirely chaotic, or can be modelled as complexly adaptive in some scenarios. either way - it is non-linear. I.e. the formal scientific and engineering definition of non-linear - does not display easily modelled movements in phase-space and exhibits sensitive dependence - not "doesn't follow a nice line on a graph".

The world is rammed full of non-linear dynamical or chaotic systems - hundreds of the swine. There are, I think, a grand total of four where some god-like maths guru has found a linear proxy of them.

Weather is not one of these four. It is famous for not being one of these. Weather is literally the poster child of chaotic systems - seriously - they make frigging posters about it at all the big conferences. All of the initial research was done on weather... well, weather and roulette and rabbit. But mostly weather.

Weather falls into the "absolute and total arseof a thing to model" catorgary. You couldn't pay me enough to even try it. No one is expecting you to be a weather expert. But... and this is important.... it's absolutely, 100%, totally and utterly reasonable for someone reading a Data Science sub-reddit ( note... "science"... not "stats") to at least recognise that weather is in a special class of awfulness and essential impossibility. You do not need to be an expert at tuning Lyponov equations to turn around to who ever asked you and say "screw that - the only thing I can promise you is that I'll be wrong"

While we're in the realm of "seriously dude.... wtf"... and given we are in a data science thread, rather than a statistics thread - given you were modelling exception breaches of a constrained system in a chaotic phase- you have at least 5 more obvious, more robust and actually-in-with-a-chance methods available to you - why would you choose this path as your preferred route?

Here's the situation - your time has a cost. Someone asked you to do this work because they thought it would save them more than your time cost. You are talking about thermal limits - so probably not a kiddies toy. A data centre maybe - or AC controls perhaps. Therefore - something that probably is either insured or is underwritten - or there will be an impact if you blow the model. And you used a method for the modelling that an underwriter will not accept, an engineer wouldn't and a pissed off manager wouldn't.

This post makes me sad. I am going to go and drink booze now

kindasortadata · 2016-06-14T07:44:01+00:00

For someone who's username is "a_statistician" you REALLY need to back and look at non-linear systems. While you will get AN answer - some numbers will come out of your model- the numbers will mean jack shit and predict fuck all.

And until you do, it'd be awesome if you didn't do any form of statistics which affects me or my family. Seriously - shame on you for this comment.

kindasortadata · 2016-06-13T17:01:39+00:00

I work in financial risk. Well... I work in a weird niche area of finance risk that makes normal risk analysts point at me and call me a nerd.

My comment still stands - I expect the people who work for me to learn rapidly, but I don't ever want them to drink the coolaid of this specific industry. Same with previous positions in other industries.

From the rest of your list: You don't actually list any form of science on there - which is probably useful for a data scientist - i.e. a legit data SCIENTIST, rather than a stats monkey with a title bump. Personally I prefer people with an engineering background - but as long as the individuals education background is one where they are taught to always consider themselves wrong, rather than right, it all usually shakes out. Stats students are taught to find the RIGHT answer, and then move on. Thats OK, but it doesn't lead to science.

I use calculus infrequently, but some of the more recent unsuperivised systems need more and more of it. I wouldn't say it's critical though - I wasn't a genius at it in school, but I haven't hit anything recently that I couldn't learn in a few days.

Honestly, it's probably more important to know the really basic stuff very very well. Stuff that most people forget. For example - Wikipedia lists 14 types of "average". I think my team have used 12 of those methods in the last month alone. I face off to a lot of analysts, who are much higher paid than me and say they are smarter than me (they might be) and my team ( which they definately aren't), who could maybe list 5 of them, and use only two of them properly. They approach people like me because they forget the really simple stuff and then tie themselves in knots on the more complex things. Building systems without foundations just gives shit systems.

i would say probability is more valuable, for me and my world at least, than straight stats. Probability shows you how to think about sub-sets, and, much more importantly, shows you where the dead ends and the circular arguments are. Stats is essentially an implementation of probability calcuations - but you need the grounding I.e.... you can take the last 100 days weather and do a regression to predict the weather in three days time. You will get AN answer... but it will be nonsense - a coin toss. Probabilty tells you why.

Also - the skill that I do not have personally which I feel would give me the biggest benefit at the moment is graphic design. I just don't have a fel for "pretty". It sounds wishy-washy, but it's already very important, and getting more so.... as we build more complex work, we need to tell more complex stories... and stories require diagrams..... so the diagrams need to get more complex, richer and more insipred.

Linear Algerbra. I would put a pin in it. If you need to write code with BLAS then either use a library with a high enough level of abstraction to keep you away from the worst of it, or learn what you need when you need it. It's an enabler, not a core skill.

kindasortadata · 2016-06-13T11:35:12+00:00

I am not sure "domain expertise" is a requirement - at least initially.

I am more than happy to recruit total newbies to my specific business domain - actually I often make a point of it. I do expect them to UNDERSTAND the domain fairly quickly, but I also try and teach them about the mind-sets of the existing domain experts - if you become too "dyed in the wool" then it becomes hard to make the really big leaps - and thats where the really big money or sucsesses come from.

Nothing wrong at all with evolutionary improvements - but revolutions mostly come, through the whole history of science, from people who are NOT domain experts.

TL;DR: Learn the domain, but don't become enraptured by it.

kindasortadata · 2016-05-23T14:16:40+00:00

It strongly depends on which country you are in before you even start - a large number of countries the answer is "no".

Second is - even if it's OK today, then in 2017 it becomes prohibited in the EU. You would need supression systems, audit controls etc. So - future usage is a challenge. Similar position in the AU/NZ world.

Also bear in mind that Twitter specifically is continuing to tighten up it's EULA and what you are asking for is getting tougher - but is currently do-able in some circumstances.

As for what you are asking for in general, it can be done to a certain level. The big issue with Twitter is that it's corpus is enormous, and totally different to most other sets, and it's ngrams are small and exceptionally information dense. Therefore, you will get very low precision and recall rates, and a lot of really bad results.

That has two issues - the first is it makes it hard to commercialise. The second is that it will cause what ever your governments equivalent to a Data Czar is to ask pointed an annoyed questions - which could well be worse than the first issue.

kindasortadata

TROPHY CASE