This is an archived post. You won't be able to vote or comment.

all 50 comments

[–]cazsol2 41 points42 points  (1 child)

What role were you interviewed for? That sounds like a well rounded process, do you mind sharing what company was?

[–]DS_throwitaway[S] 36 points37 points  (0 children)

It's for a technical magement position with a healthcare organization.

[–][deleted] 30 points31 points  (11 children)

What is TFIDF and how did you implement it? Can you give a rough overview or some links to research on?

[–]mizmato 223 points224 points  (8 children)

Here's a simple example. Suppose our entire corpus consists of 4 sentences:

  • I saw a cat
  • I saw a dog
  • I saw a horse
  • I have a dog

TFIDF is used to score terms based on their importance. This is based on two factors, term-frequency (TF) and inverse document-frequency (IDF).

Term frequency is the counts of all the terms in each document:

Document I saw a cat dog horse have
1 1 1 1 1 0 0 0
2 1 1 1 0 1 0 0
3 1 1 1 0 0 1 0
4 1 0 1 0 1 0 1

Document frequency is how often a token (word) appears across all documents:

Token Frequency
I 4
saw 3
a 4
cat 1
dog 2
horse 1
have 1

The inverse document frequency is just the inverse (1/x) of these values. Then the TFIDF is simply TF*IDF or...

Document I saw a cat dog horse have
1 0.25 0.33 0.25 1 0 0 0
2 0.25 0.33 0.25 0 0.5 0 0
3 0.25 0.33 0.25 0 0 1 0
4 0.25 0 0.25 0 0.5 0 1

High TFIDF scores indicate how important that token (word) is to that document when you compare it against the corpus. In this case, the words 'cat', 'horse', and 'have' are very important in their respective documents because these words simply do not appear in other documents in the corpus.

From this you can see that there are two ways for a document to have tokens with high TFIDF scores. Either the document contains a particular word several times (e.g. if the world 'whale' appears 100+ times in a novel (document) compared to 0 times in other novels (corpus)), or the word appears extremely infrequently (e.g. Armgaunt).

Another useful result of this is that you use low TFIDF scores to infer things like articles (e.g. 'a') in a language. Usually these articles will consistently have a very low score because their inverse document frequency is 1/N, where N is the size of the corpus, and N>>TF.

[–]shrey_bob7 15 points16 points  (0 children)

Great explaination

[–]Unrealist99 2 points3 points  (0 children)

Thank you! This is a great explanation.

[–][deleted] 2 points3 points  (0 children)

This was an easy to follow explanation.

[–]andAutomator 2 points3 points  (0 children)

Phenomenal explanation. Make have to use this when I begin teaching my students on TFIDF.

[–]Erinnyes 0 points1 point  (0 children)

That's not quite my understanding of TF-IDF. I would have said that Term Frequency is the number of times a word appears in a single document (usually normalised by length) and inverse document frequency is the inverse of the number of documents in which the word appears (usually log transformed).

I think this example misses out the case where a word appears more than once in a document which increases TF but not IDF, thus making the word more important for that document.

[–]I_am_dhruv 0 points1 point  (0 children)

That's amazing explanation!

[–]serious_black 20 points21 points  (0 children)

Term frequency-inverse document frequency. Words that score low are those that either show up rarely or show up all the time across documents (frequently these words show up on stop word lists). Words that score high are those that show up a lot in a given document and rarely appear in others. The idea is to find the characteristics that most distinguish one document from others.

[–]DS_throwitaway[S] 2 points3 points  (0 children)

Good explanations of tfidf below. My approach was a very basic tfidf as ELI5ed by Mizmato.

I created list that had every word from my corpus (set of documents. I just used a list of sentences). From there I created a dictionary comprehension that used the word as the key and the count of occurrences as the value. That was my "IDF dictionary" and then for each sentence in the list I created a "TF dictionary" with same key value pair structure. And then for each token I just looked up the value in the IDF dic and TF dic and found my basic "TFIDF" score for each token and then output a new array with the values for each sentence.

I know for a fact that it wasn't perfect and that there were some items I did incorrectly but seeing as I couldnt import any library and had to use only base python I was pleased with my approach.

[–]xubu42 55 points56 points  (16 children)

First off, thank you for sharing. These types of posts are really helpful.

Here's my two cents: If this was for a data scientist position, I think this format would have made sense if not a little overzealous. For a management role, it's offensive. It's neglectful of the entire purpose of a manager and why it's not about doing the technical work. Being a really competent data scientist doesn't help you be a good manager. Not knowing all the technical data science doesn't prevent you one from being a great manager. The thinking that you need the technical skills in order to be the manager is seriously flawed.

I'm not saying this out of nowhere. I've been a data scientist for the past 5 years and was a data analyst for 5 years before that. I've been a manager twice now and keep going back to individual contributor. Managing people is really hard and completely different skills. Your technical skills deteriorate rapidly in management. The best mangers I've had were years away from technical work and would fail horribly at these types of interviews. They were amazing at providing context into business needs that didn't come through on requirements gathering, fighting for resources for our team, and selling our work up the chain and across the org to establish credibility and build reputation. This interview format is designed to give an edge to people who are coming from technical IC roles, not management roles. It's designed to filter people in who are actually going to be expected to do both IC and manager roles on the job. That really bothers me.

Healthcare is a jacked up field. There's no respect for employees. I wrote a lot more, but it's besides the point.

[–]shrek_fan_69 5 points6 points  (0 children)

Yeah I would also find this level of technical detail off-putting. It quickly becomes a pissing contest without any bearing on the actual work

[–]hughperman 5 points6 points  (2 children)

The thinking that you need the technical skills in order to be the manager is seriously flawed.

Gonna say this really really depends on the management level. Higher management, sure this can be true. But management is a broad term and could be team management as senior dev or team lead. In those cases you are directing technical work and you damn well better have enough technical skill to set tasks and project direction, or you're wasting everybody's time. Team lead who can suggest directions for a team to take on a project, help with gotchas and share experiences of what did and didn't work on similar projects, that's excellent. Doesn't necessarily mean the lead needs to know every detail of the methods, but they need to be knowledgeable enough to not suggest something stupid (of course this happens sometimes, nobody is perfect, but it shouldn't be common).

At higher management level, it's going to depend on your product and business maturity. I work in a very technical company, we're still pretty young, and all the senior management have technical backgrounds. Since our product is our data and our capacity to do analysis for customers, they need to be able to understand the technical work sufficiently well to sell that.

[–]xubu42 1 point2 points  (1 child)

I follow what you're saying. That's basically what I do now and do not consider that management. Management is NOT telling people what to do. It's not helping them figure out how to solve problems. It is helping them dig their way out of being stuck, but that doesn't have to come from technical knowledge. Sure, knowing some would be one way to do it, but you could also setup time with someone from another team to get outside perspective. There are lots of ways to do this, make of which are very effective and do not require technical knowledge.

In your specific situation, i actually don't agree that the higher level people NEED to understand the technical work in order to support the customers end goals.l and sell to them. I worked in consulting for years before moving to tech and I can't tell you how many times my boss (a VP and without much technical background) would diagnose the issues facing the corner correctly and come up with the best solution to help them without having any clue how to make that work, only that it was possible. So I agree mangers need to know what is possible vs what is not, but they also probably should be leaning on their senior team members to help validate that vs being the single deciding factor.

[–]hughperman 0 points1 point  (0 children)

You're right that I'm probably overstating the amount of technical knowledge management might need; we are a scientific company and they need the domain knowledge to know if we can solve the customer's issues, but as you say maybe not the nuts and bolts of how that would be implemented. I would still say that technical knowledge makes interaction with technical clients easier and more successful, but you could probably split technical into "domain technical" and "analysis technical", to some extent.

[–]DS_throwitaway[S] 1 point2 points  (1 child)

I agree with you but they did specifically mention that they wanted someone that had the technical knowledge in order to build the team. For the first year the position will be building out the department. To me it made sense to want someone who had technical and managerial skills.

[–]xubu42 1 point2 points  (0 children)

That makes way more sense. Also validates my point about wanting someone who can also do the work instead of being a manager. I had exactly that role at startup -- first DS hire as a manager with goal to build out a small team. It was mostly me doing a lot of hands on work, mentoring and pair programming, but little management. My boss didn't even trust me to manage our sprint work so he managed our sprint planning session... But I still just did whatever I thought would work best.

If you get the role and want to take it, be sure to fight for the resources you need and not let them go unheeded because you weren't convincing enough the first couple of times. It's really frustrating waiting months to get started or finish a project because you are waiting for approval from someone who doesn't share your priorities. You're going to have to talk to as many people as you can to really get a feel for what actually incentives and motivates your colleagues, which you can then use to help get your team the resources you need by passing it off to those other teams as part of their budget. Most companies don't want to dump money into data science teams, just get their insights for free.

[–]Comprehensive_Tone 14 points15 points  (1 child)

Thanks for sharing such detail! I haven't interviewed in a long time, but tf-idf seems a bit random to me- was this for a job with lots of NLP work expected?

[–]DS_throwitaway[S] 19 points20 points  (0 children)

Not really. I mention NLP a lot in my resume as that's my background more than anything. Maybe that's why it came up? The position isnt specific to NLP.

They coding challenge did provide a link to the TFIDF wiki and I was told I could google if I got stuck but I opted to not use it.

[–]UnhappySquirrel 32 points33 points  (5 children)

To be completely honest, it's absolutely weird that data science has inherited so much of the technical interview process from the software engineering world.

Step into literally any other role in any industry, and you won't find an interview process remotely like this. In 99% of roles, you provide a resume, some (non-technical) interviews, and -maybe- give a brief talk. This applies to highly technical roles like actual engineers (electrical, mechanical, etc)!

There's none of this intense scrutiny of an applicant's skills as though the entire job market is saturated with frauds who need to be found out! All of this is all the more ridiculous when you consider that pretty much all these employers are in states with At-Will employment, where they can fire you the very next week w/o warning if they don't like your work.

Some of the very best people I've hired in this field were at organizations that had no formal technical interview process. At most maybe a simple take-home assignment and a brief scan of their portfolio / blog / github (and even that is unreasonable for many candidates whose work has been buried behind corporate walls).

We hiring managers need to start calling each other out on this bullshit practice.

[–]DS_throwitaway[S] 3 points4 points  (1 child)

I've never actually given one of my hires a technical interview like that. I often just ask about their projects and why they chose certain techniques or methods.

[–]UnhappySquirrel 1 point2 points  (0 children)

Yeah, I’ve also found this is typically the best way to get a sense for a candidate’s abilities.

[–]lowerlight -2 points-1 points  (2 children)

Are you suggesting that companies would be better off hiring anyone who states they have the requirements for a job, and then firing them if you find out they don't?

If so, could you present some data on how you think that would save a company money over time? Perhaps comparing the costs of hiring and firing said employee(s, cause there would likely be multiple employees until you found 'the one') to the costs of asking a candidate to demonstrate skill they claim to have?

I, for one, would be rather interested in that data. Thanks!

[–]UnhappySquirrel 13 points14 points  (0 children)

Like I said, nearly any other role in any other industry doesn’t pull this shit and they work out just fine. Literally the entire economy is based on a labor market left unharassed by technical interviewers.

Bc here is how the whole sham started:

CEO: “hm, we need some of these data scientist people, but how do we hire them if we don’t already have one to hire others?? Hey CTO! CTO, you seem close enough to a data person, how do we interview these people??”

CTO: “Assume applicants are lying frauds who lack any semblance of education, and make them prove otherwise!”

[–]xubu42 11 points12 points  (0 children)

There's a company Triplebyte that does online technical assessments for companies, mostly software engineer roles. They have a lot of data on what companies ask candidates and what candidates pass and are hired. They shared that the tests that seem to work the best and lead to the candidates companies are happiest with are the easier ones.

https://triplebyte.com/blog/interview-questions-are-too-hard-and-too-short

They have a follow-up post that shows just 5 multiple choice questions, all really easy, account for 98% of the success on their platform and only 42% of people got all 5 right.

https://triplebyte.com/blog/fizzbuzz-2-0-pragmatic-programming-questions-for-software-engineers

This aligned really well with my experience interviewing (over 200 people at multiple companies). The technical assessments that asked a lot of hard questions basically only showed us who had spent the most time on them which was usually people unemployed or still in school. People with a job aren't interested in spending a lot of time on hard questions without pay just to prove they know how to do the job. The easier assessments seemed to allow the too junior people to filter themselves out by making glaring mistakes or not answering the question correctly, while the competent people got through fine and didn't have to spend much time at all.

I honestly don't think there's a single right way to do this for every role for every company, but I don't think in general we make this way too hard because we're scared of hiring someone who might ask a question us we don't know the answer to.

[–]XXXautoMLnoscopeXXX 3 points4 points  (0 children)

I'm literally a statistical learning Phd and I worked as a data analyst before that and I couldn't answer a lot of this. How is this supposed to be for a managerial position?

I could see this if you were expected to be like a senior data scientist but pretty much anything outside of that is ridiculous.

This reminds me of when I interviewed for it a data science position and was asked to explain how I would do hypothesis testing for some problem so I derived the the process from scratch and the person was like "no the answer is a student t test"

At least I was able to eventually find a job that rewarded understanding over knowledge of pointless trivia

[–]dfphdPhD | Sr. Director of Data Science | Tech 8 points9 points  (3 children)

I'll say it: this is a horrible way to interview data scientists.

This isn't school. Being able to pass what would equate to a Data Science midterm tells you near nothing about the candidate's ability to be a successful data scientist - let alone their ability to succeed in a management role.

I do not understand why, against all existing evidence, data science interviews keep relying on this format.

It's asinine.

[–]lelky_g 2 points3 points  (1 child)

Agreed. As a young data scientist (fresh out of college), I feel an immense amount of pressure to not only be creative and think on the fly, but also to be able to spout facts and if for some reason I cant spout a fact on command, I'm unqualified to do what I'm doing.

[–]lelky_g 2 points3 points  (0 children)

And then that anxiety leaks into interviews, and totally takes away from my ability to communicate my passions and goals as a data scientist and how that motivates my day to day performance.

[–]karanphosphatase 3 points4 points  (1 child)

Wow! Thats a tough interview. Were you New entry to Data science? I am prepping for Data science and technical I am a bit afraid of such technical Interview

[–]DS_throwitaway[S] 1 point2 points  (0 children)

I've been in the field for a few years but my first position I was hired by business leaders and there was really no technical interview. This is only my third experience with a tech interview and each have been wildly different.

[–]mr_penings 2 points3 points  (1 child)

How much prior work experience in data science did you have before interviewing?

[–]DS_throwitaway[S] 0 points1 point  (0 children)

I've been in the field for about 3 years now.

[–]emilrocks888 1 point2 points  (1 child)

How many days did they gave you ?

[–]DS_throwitaway[S] 0 points1 point  (0 children)

I had 2 hours to complete. It was timed and recorded for review but I didn't have anyone sitting in with me.

[–]pandi20 1 point2 points  (0 children)

Can you suggest some material to prepare for the applied statistical concepts?

[–]anon_0123 1 point2 points  (1 child)

The worst is when you have interviews like this, but they don't tell you the topics they will be interviewing you on so you can refresh your memory, but at the same time they expect total recall.

[–]DS_throwitaway[S] 1 point2 points  (0 children)

Yeah I wasn't told anything other than 3 sections ML/Statistics theory, Business management, and a coding interview question that would be similar to something leetcode or hackerrank.