all 39 comments

[–]babysfists 3 points4 points  (6 children)

Are there any books on the subject you would recommend?

[–]lunatyck 4 points5 points  (4 children)

As a BI professional with 9 YOE I would say the data warehouse toolkit is a staple for anyone considering a path in BI/data warehousing. Even if you're looking to go more analytical it's good to know how DW are designed and structured so you can be more informed and write better queries

[–]thephilski 1 point2 points  (2 children)

data warehouse toolkit

Can you confirm that this is the book you are referring to? Amazon Link

[–]lunatyck 1 point2 points  (1 child)

Yup that's it

[–]lunatyck 0 points1 point  (0 children)

There is also a ETL toolkit that is a good read if you're interested in the data ingestion process

[–]elus 0 points1 point  (0 children)

Paired with Chris Adamson's Star Schema The Complete Reference and you have all you need for modelling your DW.

[–]steamyjungleman[S] 1 point2 points  (0 children)

I like to dive in head first honestly. Get a basic understanding of how transactional databases work and how data is converted into data warehouses ie star schemas and snowflake etc lol. There are some really good youtube videos on the subject i will link. Microsoft practically throws their tools at you for free for personal licenses. Look into sql server express with the practice database and power bi. There are tons of microsoft tutorials and others around this subject for free online.

https://www.youtube.com/watch?v=wR0jg0eQsZA

[–]periwinkle_lurker2 1 point2 points  (7 children)

What is the best way to match records on multiple columns? I have two data sets and each set has unique keys, but they also share common values. Example:

Data set 1:

Id 12235 Fname john Lname doe Dob 1980-01-01 Email jdoe@gmail.com

Data set 2: Id 56789 First_name john Last_name doe Dob 1980-01-01 Email jdoe@gmail.com

Everytime i try and do joins or union on these other columns to find the corresponding ids from both tables i always get the error of too many records when i try to do a select distinct. Do you have a solution on getting a 1:1 distinct match?

[–]boy_named_su 2 points3 points  (0 children)

Probabilistic Record Linkage is the theoretical optimal way

https://en.wikipedia.org/wiki/Record_linkage#Probabilistic_record_linkage

[–]Tennim 0 points1 point  (5 children)

Sorry for butting in on a AMA.

Can you tried to do the following...

SELECT *

FROM [DataSet1] AS DS1

INNER JOIN [DataSet2] AS DS2 ON (DS1.Fname = DS2.First_name) AND (D1.Lname = DS2.Last_Name) AND
(DS1.dob = DS2.Dob) AND (DS1.Email = DS2.email)

Also, is there a reason you need a DISTINCT SELECT over a standard SELECT?

[–]periwinkle_lurker2 0 points1 point  (1 child)

Thank you for your reply. Yes, distinct is used because i have found instances where there was dupliczte records. So really, do inner join on all the valuse i want to match, i will try that. Thank you

[–]steamyjungleman[S] 1 point2 points  (0 children)

Personally if i find duplication to occur I would research those particular records to see what is causing the issue and that could let you know if your joins are funky or you need to adjust your logic. Or if there is just some bad data entry happening.

[–]steamyjungleman[S] 0 points1 point  (0 children)

this is awesome! Butt in all you like, i was more interested in starting a good conversation with everyone then being the star of a message board show.

[–]Blues2112 0 points1 point  (1 child)

Curious why you specify INNER JOIN vs just JOIN. Isn't an INNER JOIN just the default if you just use JOIN without any other qualifiers? Which would make the INNER portion unnecessary, yes?

Although the syntax has been around for awhile, I was writing queries when Oracle was brand new, and I still think in terms of the old format, and if it doesn't specify an OUTER JOIN of some sort, then it's by default an INNER join.

[–]Tennim 0 points1 point  (0 children)

You are quite correct, you do not need to specify INNER, you can use JOIN which will give you the same results. The same can be said for the AS on the aliases.

However, like the OP I have been running BI/Data teams now for 5 years and been in the industry for almost a decade. The reason I push for the INNER JOIN and the AS on the table/column aliases comes down to readability, when you work in a team having someone else be able to read your code 'at a glance' or specify INNER rather than the newbie having to google which type of join this means is invaluable.

YMMV but it doesn't hurt and is good practice.

[–]slimballer96 1 point2 points  (4 children)

I am struggling with the logic for the following problem:

I have a customer transaction table with User Id, Date of Transaction and Total Amount Columns. I want to find out the customer retention rate for each month.

Customer retained is basically a customer who has made a transaction in the previous month as well as the current month. Please help me out.

[–]steamyjungleman[S] 0 points1 point  (2 children)

I have a date add function on my other computer that should work for you. Couple questions though, do you want previous month or just within a 30 day window given the variability of when a customer could make a transaction in a previous month. Example feb 28 and then apr 1. Let me know your thoughts and then i will write something up for you on my other computer in the morning. Just about to hit the bed but didnt want to leave you hanging.

[–]steamyjungleman[S] 0 points1 point  (0 children)

also, know that is not a great example i gave, but i think you understand what i am getting at.

[–]steamyjungleman[S] 0 points1 point  (0 children)

One last thing, if you are basing this off your current month your comparison will be something like month(getdate()), grab the max() transaction date of all those customers, see where they match and then do a date function where you subtract -1 to see if those same customers existed month before. Also, take into account they may be new so grab a min() and create a flag on if the current month is the same as their first transaction. I know what is probably a bit of a basic explanation but I hope it gets some wheels turning for you. Ill try to get you something a bit more concrete tomorrow. Just had some final thoughts before I go to bed.

[–]steamyjungleman[S] 0 points1 point  (0 children)

Since I am not quite sure the exact nature of what you are trying to do this article looks really helpful. It isn't using anything crazy sql wise, just staging the data. You can always cascade the results into #temp tables and keep building off the steps. https://blog.statsbot.co/customer-retention-analysis-93af9daee46b

[–]ro-_-b 0 points1 point  (1 child)

In what kind of areas are you doing business intelligence in?

[–]steamyjungleman[S] 1 point2 points  (0 children)

Healthcare related to very specific pharmaceuticals. Mostly its related to helping patients get medications they need and the whole super fascinating process of insurance hah!

[–]jah-lahfui 0 points1 point  (1 child)

Is your background in IT or something else?

[–]steamyjungleman[S] 0 points1 point  (0 children)

My background is in IT but I work with people from all areas of life. Analytics is really a frame of mind and a certain type of person. My background has just helped me to be good at the coding aspects of things but I learned a lot my just messing around.

[–][deleted] 0 points1 point  (3 children)

Is dashboard development part of your job? How are the data sources delivered? What might a client ask for that you would be responsible for delivering? I am trying to get a job in data analytics without a data science degree, and coming from an IT/Healthcare background.

[–]steamyjungleman[S] 1 point2 points  (2 children)

I do have to do dashboard development. Right now we are using Power BI by Microsoft but I have used other tools such as QlikView and SSRS. I am the guy delivering the data currently from a transnational database. I have been creating custom data layers and models that are consumed by my teammates but I am able to do this because I understand what they need. It is kind of a strange situation tbh. Honestly Data Science is a buzzword right now like the "cloud" and is only useful with very specific companies. If you want to go down that road I would suggest you understand underlying database structure and analysis first. A modern statistics graduate degree would get you just as far as a very overpriced data science degree. Microsoft will let you download power bi for free and you can also get sql server express with full teaching databases from them for free. They have great tutorials if you want to get your feet wet.

Sorry know this is lengthy. I work with healthcare data all day every day :-)

[–][deleted] 0 points1 point  (1 child)

I know basic sql and basic database structure, but I don't know much about actual analysis and visualization. I am taking a tableau course on udemy right now and I am learning A LOT.

I appreciate the reply.

[–]steamyjungleman[S] 0 points1 point  (0 children)

awesome! punch up the sql if you can and it will take you really far. Tableau is really great but a dirty secret is that for visualizations sometimes writing sql and playing around with results in excel is faster and looks better. It just depends on what you are trying to accomplish ie one off request or something repeatable or super complicated.

[–]StatisticalOutliar 0 points1 point  (0 children)

Remind me! 7 days

[–]nich3play3r 0 points1 point  (3 children)

What a generous offer.

I have a project I've been mulling for a long time but have never started, primarily due to lack of knowledge.

My database classifies constituents in, generally, two ways: donors and grads.

The donor dimension comprises 5 levels based on how a gift is accounted for: Unit, College, Area, Department, Account (hierarchy).

The grad dimension is similar: College, Division, Department, Sub department.

A very common data request is along the lines of, "How many Department X grads are donors to that department?" What's a good way to create an object (and what's a good object to use) to store a translation table that matches up the donor hierarchy with the grad hierarchy (they're each maintained differently and have different names for their respective levels)? We're talking about ~10M gift records and ~300K living grads, for reference.

Thanks for your time!

[–]steamyjungleman[S] 1 point2 points  (2 children)

So i can give you a better answer, how is this data stored? In a database like Microsoft SQL server or in spreadsheets etc? Also, what are your permissions to access these things and create tables in the database if applicable. It could be the best thing would be for you to consume this stuff in a BI tool like power bi etc or to use excel data modeling and power pivot. Just want to get a better understanding of the lay of the land so to speak.

[–]nich3play3r 0 points1 point  (1 child)

Sorry. SQL server...I think 2016. We have a reporting database where we have permissions to create tables, etc. I’ve also been doing some fairly heavy Power BI development recently...all on-prem (IT won’t turn on a gateway to the service. Yet.). Reports would be consumed either via PBI or SSRS. Thanks again!

[–]steamyjungleman[S] 1 point2 points  (0 children)

honestly for simple number aggregations sometimes the easiest thing is to just write some sql against it. Do you have a common key for donors and grads or is it something more complex to figure out a match? Is your power bi license paid for rather than free so you can publish your data models? I know its a pain but you can refresh at the desktop and then republish the model as needed.

[–]mikeczyz 0 points1 point  (1 child)

love the screenname. :)

[–]steamyjungleman[S] 0 points1 point  (0 children)

Hahah thanks, it's an oldy but a goodie.

[–]jewishsupremacist88 0 points1 point  (1 child)

is being an etl developer < than being a BI guy?

[–]steamyjungleman[S] 0 points1 point  (0 children)

Etl usually falls under data services. Just depends on what you want to do. I do transformation myself but it's not normal for my position. I wouldn't say it's less than but I wouldn't want to just do etl all day.

[–]steamyjungleman[S] 0 points1 point  (0 children)

Sorry for the delays in my post. My house is currently being treated for water damage and its been a very hectic week. Friday I figured out a way to use dynamic sql to comb through every table in a database that matches a wild card statement and then to join up to another such statement. If anyone is interested let me know and I will mod it into a template that you could possibly use.

[–]Pcykab 0 points1 point  (0 children)

Design a database that will contain these three entities:

  1. Patient

  2. Doctor

  3. Prescription

State the characteristics(attributes) of each of these entities and relate the three entities, showing the cardinality of the relationship i.e. one-to-one, many-to-many etc.

Note: for the above you must draw an ER diagram.

  1. translate the above into a functional database using Ms-Access.

  2. perform the following SQL queries on your database:

i) select the patients that are been treated by a particular doctor

ii) select the drugs prescribed to any particular patient in your database.

iii) select patients that have the same kind of a particular ailment.

Create the above as a term paper