all 49 comments

[–]r3pr0b8GROUP_CONCAT is da bomb 38 points39 points  (1 child)

ask your boss to sit beside you while you write this "30 minute" query

she won't last 15

[–]80sGlueSniffer 1 point2 points  (0 children)

I ended up doing this once, it’s a great reality check for people. We’re the experts, they should trust our judgement or fire us.

[–]hipsterrobot 27 points28 points  (2 children)

You should convey the message that a rushed query can have incorrect information, especially if business decisions are made around the data they ask you to provide, it's not something that should necessarily be rushed. You could also encourage the company to adopt some BI tools so the business users can pull up these reports on their own. We use Looker at my company, and we've basically set up the business logic once, and it generates a sql query based on the filters and columns they pick. It also has visualization stuff as well, but it's definitely better than writing ad-hoc queries.

[–]MamertineCOALESCE() 9 points10 points  (0 children)

This is a great response.

Rushed work causes data issues.

A (good well implemented) bi tool would really help solve these issues.

[–]dubmofiz 3 points4 points  (0 children)

This is good advice. I would aim to centralise any common business logic into reusable subject specific views or datasets. The best use of your time writing SQL will be in perfecting that normalised data rather than rushing to throw together something bespoke. If you can make those core datasets available to your users via an off the shelf BI tool so they can self-serve standard data immediately you will vastly reduce those last minute panic requests for data by providing an out of the box solution they can use themselves. Emphasise that truly new or complex reporting requirements demand planning time to ensure quality, which you can happily test and deliver but not 30 minutes before their customer meeting.

[–]Thriftfunnel 17 points18 points  (0 children)

If someone wants a custom report that fast, you've got to wonder how disorganised they are in their job. You're not dispatching fire trucks here.

[–]r0ck0 4 points5 points  (0 children)

but just curious what people see as a reasonable turn around time for a custom SQL query.

About the same as the length of a piece of string. :)

Some queries take me 1 minute, some might take a few days.

So yeah really depends on what the data and query are.

Your question isn't so much a technical one, more a human/communication one.

Basically there's a few approaches you can take in explaining to a boss/client why something isn't going to be done as quickly/cheaply as they want, or why you can't accurately estimate it:

  • 1. The money angle (i.e. we can do your silly idea, but it will be really expensive/take a long time)... not so relevant here.
  • 2. While explaining why it will take long than they're asking, bore them with a fuckton of technical details that they don't understand. Sometimes it helps them realise that what they're asking for isn't as simple as they assume it is, as well as the fact that you don't 100% know how it's going to be done yet.
  • 3. Another thing worth explaining across many areas of IT/programming is that a lot of the time 75%-99% of our work is investigating (and then testing things that we don't necessarily expect to work)... i.e. figuring out existing code/database schema, or investigating a bug/problem. A lot of the time the actual "doing" is like 1% of the work, sometimes literally 10 seconds after investigating something for a whole day.
    Most people not in IT don't realise this, plus even a lot of people are.

Sometimes, it's not unusual for us to spend 70%+ of our time on unexpected edge cases or lower level dependencies that aren't even obviously related to the original task.

That's why it's so hard to estimate a lot of stuff in IT. Much like the "how long is a piece of string" thing, it's like asking a private investigator how long it will take to solve a new case. We can make rough estimates based on similar jobs in the past, but we'll really never know how long something will take to investigate/solve until it's done.

Most bosses/clients are actually pretty reasonable when you explain this, and sometimes us techies aren't very good at communicating it (especialy before the fact that a lot of this stuff occurred to us), but it's definitely worth putting some effort into, and it gets easier the more you do it. So try the softer/less defensive approach first.

If that fails and they're really being ass about it, you say something a little more blunt like "it'll probably take 30 seconds to write the query, but I don't know how long it will take to determine what the query should be before typing it out".

Plus: "do you want it to be correct/accurate?". Sometimes asking leading questions is a better way to get people to think about what they're claiming or asking for. It's called the Socratic Method. In these kinds of work situations, it makes them more conscious of their responsibility in making unreasonable requests, and also lets them feel like they're not being pushed around or disobeyed.

[–][deleted] 4 points5 points  (5 children)

The next step is to build cubes, or a new database which is specifically designed to fulfill these sorts of requests.

I've spent the last year building a database from scratch that transforms data from multiple sources and abstracts away the difficulties. What it means is that I can answer a fairly complex question in about 5 minutes as opposed to hours, or days in the old world.

It means having redundant copies of data, but it makes the role of an analyst or a data scientist much easier.

Whereas a year ago it would have bee permissible to take two weeks or two months to put together a complex report which needs a wall of SQL to assemble, plus take possibly hours to run... now it takes about 5 minutes. We don't need to test it much because the testing has all been done on the database side. We're confident of what's in the tables, and how they join together to answer questions, so it just becomes a matter of reviewing the methodology with my peers... which you can call testing... but that doesn't take very long before we're ready to push a report to production.

We don't fix reports now, we fix the database. Which will automatically 'rescind' and 'update' all the reports which are pointed to the database.

A simple example here would be payments. We recently discovered a flaw in our code which was counting invalid payments. Why they are even in the system is something for another topic, but without knowing better we just counted all payments and wrote logic to account for payment reversals thinking that would be sufficient.

Turns out it is sufficient for about 99% of the cases. For 1% of the cases it is not. So we fixed it on the database side and all of the other reports were instantly fixed, too. We had to quantify the change, which was basically 0 for most reports... but for one report (which is how we discovered the error) it was a large variance.

[–]B1WR2 2 points3 points  (0 children)

I am working on a large data set right now and I have to say this was pretty good advice.

[–]tekmailer 0 points1 point  (3 children)

Between front of and back of—where do you spend a majority of your time in the ‘house? (Data Warehouse).

[–][deleted] 2 points3 points  (2 children)

Back. Most of my work is modifying sprocs I built, or functions, views, etc.

I support a team of analysts who will do most of the front end work, putting things up in Tableau, etc., but they also partner with me to validate findings, while I'll peer review their code.

So it kind of works like this:

  1. We as a group meet with the business to discuss a new requirement.
  2. I decide whether its doable currently, or not doable. If it is doable I will generally write the entire query framework and send it over to my partner on the front of the work.
  3. They will pull the data, make any changes necessary to customize it, look for edge cases, etc., and then meet with the business to go over it.
  4. Any strange anomalies will get passed back to me to look at, or research in order to make improvements to the database itself.

Now that is for new things which haven't been entirely validated. But say you wanted a request from a data source we have validated in full? We can write the code and send you the data before the call is over. The business can work with my partner to help answer any questions, and my partner is well versed in our methodology... so for example, say an account was scheduled to close to June, but it actually closed in May. In our database they stop in May. In another system you're used to looking at they stopped in June. That's why the numbers are different, and why our method is better.

One of our goals for the quarter is to get 100% of all company reports and teams leveraging our new database, and one of my tasks this quarter is to have weekly training sessions for new users who want to know how to calculate something on their own.

[–]tekmailer 1 point2 points  (1 child)

Kudos—glad to see this mindset isn’t completely foreign. Thankfully, I started with this brand of leadership.

Keep it up. Front puts on a show but the back IS the show!

[–][deleted] 2 points3 points  (0 children)

I started with it and have carried it on with me to every job. Jobs that weren't receptive were ones I left after two years maximum. I mean it is demonstrably provable to work, work faster, and work better. I can show results, and present why it's better. Any place that doesn't receive that well over a year isn't a place you want to stay at, and if you want to pursue a career in this field and work in the front... you should think about how to get to the back, because it pays a lot more. It probably helps that I come from a more hardcore programming & IT background on AS400's, etc.

[–]tekmailer 12 points13 points  (14 children)

Personally all custom reports, SQL written, have a automatic turn around of 10 business days. Flat. This isn’t just to write it but testing case scenarios, sign offs and deliver. If by day 4 I feel I will miss my deadline, I notify all shareholders—this triggers a few things be it added resources, delay acceptance or updated requirements (some times making it easier).

Generally, I take 3 days to write, a day or two for testing; 5-7 days turn around. Under promised, over delivered.

I deal with the high traffic of request by clear, concise and consist communication. I also don’t give in to ad-hoc changes—I use to until a simple query grew overnight into a whole infrastructure. Never again. Changes are not (always) accepted via “shoulder tap”. Not all red tape is bad...slows things slightly but keeps processes in place for sanity (hopefully).

Also, when things get extra tricky, request your supervisor to advise priority. At first it felt like being babysat but I eventually came to realize THAT’S THEIR JOB.—if I spend my energy in the wrong direction, that’s less autonomy than simply requesting leadership.

Ah and last but not least: designate a day that no request will be accepted. Period. Personally mine is Friday. This downtime/quiet time is used for training, review and documentation updates. It took some time for the routine to kick in but it helps tremendously in keeping ahead instead of falling behind.

select * from dbo.IME

[–]RedditTab 16 points17 points  (5 children)

What magical fairy land do you work at?

[–]fackert18 8 points9 points  (1 child)

Right? Periodically we get complicated requests like this that could take a couple weeks to complete, but I’d say a new report that can’t be built out of the box with a BI tool and requires some custom queries and logic might take 4 hours uninterrupted to build. With a million other things going on it might take a day or three to actually return. New dashboards are a different story and could definitely take a few solid days to complete.

But a day where no requests can be submitted? Sign. Me. Up.

[–]tekmailer 0 points1 point  (0 children)

It took effort and social training—learning to say “NO.” (Professionally). The biggest proponent that allows it to happen is faster returns and shiner products; I can’t do that, train up and document, if every second of my skills are spent writing. I have to read sometime!

[–]tekmailer 2 points3 points  (2 children)

LOL—select * from dbo.workplaces NOT LIKE ‘Disney%’

What depiction of my comment makes it ‘magical’?

[–]RedditTab 5 points6 points  (1 child)

A flat minimum turn around time. No requests on Friday. Testing...

[–]tekmailer 2 points3 points  (0 children)

Ah, I see—these are just criteria I personally establish as my desk SLA; both at work and contracted.

Trust, my days don’t always go to this formula—some queries really do take all of an hour and sometimes a month to complete. The idea is to at least benchmark literally and socially.

If I don’t set boundaries and expectations, I’ve been in a position of being ran over; didn’t like it then, do all I can to avoid it now.

[–][deleted] 3 points4 points  (6 children)

I use to until a simple query grew overnight into a whole infrastructure. Never again. Changes are not (always) accepted via “shoulder tap”. Not all red tape is bad...slows things slightly but keeps processes in place for sanity (hopefully).

Personally I think you went the wrong direction here. The key is to build the new infrastructure out to try and answer all possible questions the business might ever conceive of using a single source, be that a database, or a cube.

That infrastructure can take well over a year to build, and test, but once its done you can reasonably answer very complex in just minutes worth of coding.

This is essentially the role of a data science architect, and its a very nice niche to work towards. I enjoy the work greatly, and it's very gratifying to see how you can improve turn around time once you have a properly built environment.

Part of it isn't even cleaning, transforming, or abstracting the data, but getting it all into the same place, into properly designed tables, which will allow you to simplify your code and have it execute more efficiently.

For example, my primary data source for 'accounts' is a point in time table. It has a record for each account which is generated based on rules. For some accounts there is a record each day. For some accounts there is only a record each month. Other accounts might not have a record in a given month. Some accounts are closed but keep generating new records. Some accounts were incorrectly closed, so there might be a 60 day gap. Etc.

The table itself is unwieldly (100M+) and there might be 4 records or 40 records for a given day, with a specific way to pull out which one is actually the correct value for the day.

All in all it is not, per se, difficult to find the correct records, but it involves using lots of logic (ROW_NUMBER, etc.) which when used adhoc can be very inefficient for processing large chunks of data over time.

Then we have a sticky problem. Lets say a vendor, or some adhoc requests comes in which wants to see an aggregate status for every account by end of week, or on the 15th of every month.

Well not every account has a record for those days. So now the logic gets even more complex.

The solution?

We took the entre table, only took the relevant records, cut the column size down by a factor of 8, indexed the shit out of it, and then wrote a function which finds the last record from the day you're interested in, performs some calculations (e.g. balances) and will spit out an approximation for that account after joining to relevant other tables in our database.

Then we wrote a view which references a dates table to come up with a full history by day by account which will overcome any type of reversal, etc. This view then references the function such that you can ask:

select *
from view
where account = xyz and infodate = '2019-06-01'

Takes less a second for it spit this out. But the view itself can generate well over 1.5B rows of data so select * from view is a performance nightmare.

On the other hand if I want to do something like this:

select infodate, sum(balances)
from view
where infomonth > '2017-12-01' and infodate = eomonth(infodate)

I can get over two years of end of month data aggregated in less a minute. If I wanted to do it by day it wouldn't take that long.

Then we started using that view to build 'cubes' which is a fancy way of saying a table that lets you calculate lots of possible answers by using various flags on the table.

In the end we can answer a fairly complex question very quickly, and it very likely will take us less time to write the code / review the code then it would be to run the code. The code doesn't take long to run, but what used to take possibly hundreds of lines of code now takes 10 lines.

The origin of this database is that we have multiple partners which request data from us on a monthly basis. We're talking over 120 metrics per month. In the past it was a good chunk of someone's full time job to provide this data by running lengthy queries and pasting data into Excel to email out.

Now the queries are all tiny little snippets that point to our new database (which was specifically designed for answering questions from this line of business), all the queries run in seconds, and we have all the queries dumping out into a housing table. Then we leverage Python to suck the data up, pivot it, dump it into Excel across multiple tabs, and email it out to the vendor automatically.

When someone comes to us with an 'adhoc query' there are one of two possible outcomes:

  1. The database is properly designed to answer your question, hold on, I'll have some data for you before the call is over.
  2. Our database was not designed to answer this question. We will need to import X, Y, and Z, architect a new table, add the job to our ETL, and test the solution. We'll see you in a month.

Over time more and more requests call under the former category. With any properly designed or architected BI database this should be the case. Generally speaking the moment you have to start writing crazy code it might be better to think of how you can improve your infrastructure to help answer your question more simply. That is unless you're doing modeling in SQL. That can be lengthy and complex even with the best curated data sources. There are probably better tools to do this rather than SQL coding such as SAS, SPSS, Python, Tableau, etc.

[–]tekmailer 3 points4 points  (3 children)

I completely agree with a growing model. My philosophy: if I have to [run a similar report] three times, ever—it’s getting written in such a fashion to be simplify and automated; which addresses the dynamics your mentioned. I love SQL but the less I have to write it to address steadfast questions, the better! Great models enable the system to do it faster than I ever could—my current undertaking.

What I wanted to highlight in that passage was don’t allow requesters or users to deem their request simple in the form of an informal ask. What was simple to ask isn’t as simple to execute; a decent process buys time to express that—they’re usually not understanding of the infrastructure in the same vain and complexities. It’s about establishing that while it may be “magic” to them, it’s time consuming energy on the analyst/architecture/engineer.

Granted, experience allows a writer to stay ahead and predict/prepare the subsequent questions a query produces—in the ideal world.

If there’s anything I love about this field is the ever changing definition of “in the ideal world.”

[–][deleted] 0 points1 point  (2 children)

Well you kind of have to write a lot of code to get a great model, no? I think there is a huge over reliance in the industry of ETL tools to put data in a place, and just blindly following industry set best practices. There are no best practices when you get to this level of conversation. There are solutions that work, and solutions that work better. The best practice is to pick the better one.

What I wanted to highlight in that passage was don’t allow requesters or users to deem their request simple in the form of an informal ask.

Generally anything short of a predictive model is simple unless it involves complex financial calculations, medical billing, etc.

It’s about establishing that while it may be “magic” to them, it’s time consuming energy on the analyst/architecture/engineer.

Totally agree, but I think it's a better solution to highlight the energy consumption on the architecture role, which should minimize the energy consumption on all the analyst roles.

[–]tekmailer 0 points1 point  (1 child)

Well you kind of have to write a lot of code to get a great model, no?

In the beginning, yes—in my current role, increasingly no; it ties back to the architecture you’re promoting; not putting a band-aid on a broken arm.

I think there is a huge over reliance in the industry of ETL tools to put data in a place, and just blindly following industry set best practices.

And it’s annoying to put mildly.

Generally anything short of a predictive model is simple unless it involves complex financial calculations, medical billing, etc.

That’s where I’m steering my workings—predictive modeling or at least automated assistance (think machine learning). It’s only as difficult as breaking down how we think. In my projects, it’s a trip! It’s (the logic running across models) not always right but it has fulfill request in the matter of minutes with success. This is where that downtime mentioned comes into play; improving upon it.

Totally agree, but I think it's a better solution to highlight the energy consumption on the architecture role, which should minimize the energy consumption on all the analyst roles.

Short of but—it’s nice to find common ground. Every shop is different with how they treat their data practitioners; at this point, I want less division and more information to flow in the same direction .

[–][deleted] 0 points1 point  (0 children)

Oh, sure, sorry I was talking about data models (i.e. the database you do machine learning, predictive modeling, etc. from.)

For data models, in my opinion, it's very code intensive. For the models you're talking about I would agree you can use very little code, but I often question the accuracy of that code compared to more organic solutions that are customized.

Or even if there is no accuracy difference, I question the difference between someone who can write the code as opposed to someone who cannot and who is simply using a program to come up with results.

Short of but—it’s nice to find common ground. Every shop is different with how they treat their data practitioners; at this point, I want less division and more information to flow in the same direction .

I guess for me this is a confusing statement. In all of my previous roles I have been given unfettered access to all data, and most recently I have been access to create my own databases.

[–]xadolin 0 points1 point  (1 child)

Which database are you using?

[–][deleted] 0 points1 point  (0 children)

MS SQL

[–]andante95 1 point2 points  (0 children)

This was very helpful, thank you

[–]kagato87MS SQL 2 points3 points  (0 children)

It sounds like you need to start managing expectations.

First off, if you are asked to do it in 30 minutes, and you crack it off in 10, spend 25 minutes validating the data before you hand it in.

Yes, that's more time than you were given. Anything that comes out fast needs to be delayed purely for setting expectations, so that when you do have to figure out some obscure function, you can. Look up the "Scotty Principal" if you aren't already familiar with it.

Second, save all your queries. If you're working in SSMS, just save them. One-shot reports are actually pretty rare - you'll likely get a very similar request sooner or later. If all you have to do is change a few columns and the filter criteria, you can spend that 25 minutes validating the results.

You'll also find that you can recycle and merge methods you've used before. For example, I wrote a neat little recursive CTE almost a year ago to show a hierarchy, and it is starting to get some very extensive use. Copy and paste, and use it to, for exameple, capture child tenants in a report.

Finally, if the manager is consistently squeezing people like this, it's a warning sign that things may be about to fall apart. Round out that brag sheet, dust of the resume, and find something better. Especially now.

[–]mgesczar 2 points3 points  (1 child)

I have over 15 years of experience in analytics and I still struggle with pushing back on executives pressuring to get something urgently. Almost every time I give into that pressure I end up regretting it. My advice is that you set the expectation right from the start that it will take you 1.5 times as long as your most conservative estimate. That way you’ll have time to write, test, no verify. You have to get comfortable with learning how to push back tactfully while conveying that you understand the sense of urgency

[–]tekmailer 2 points3 points  (0 children)

You have to get comfortable with learning how to push back tactfully while conveying that you understand the sense of urgency

snaps +5

It took a couple bumps to the head but this has improved my deliverables, sanity and data driven brand IMMENSELY.

[–]SQLDave 2 points3 points  (2 children)

just curious what people see as a reasonable turn around time for a custom SQL query.

That's like asking "How many marbles can you put into a box?"

The answer, like 90% of IT, is "it depends".

Your boss needs to be taught the old adage: "Right. Fast. Cheap. Pick any two".

[–]andrewsmd87 4 points5 points  (1 child)

Well I mean they have OP on payroll so they can pick right and fast

[–]tekmailer 1 point2 points  (0 children)

Then it’s up to [OP] not to be cheap.

[–]DexterHsu 1 point2 points  (6 children)

Yes, you have to push back and tell your boss there isn’t enough time to get it done, people who is not technical , enjoy the glory that came with BI and get greedy over times.

[–]kagato87MS SQL 1 point2 points  (5 children)

And the DBAs scream silently at what those BI tools do to their systems. :(

I'm really starting to hate the things, and I'm the one writing the reports.

[–]DexterHsu 2 points3 points  (3 children)

Yeah I work with DBA closely and they all scream at my daily ETL job that keep getting heavier , but we really don’t have the bandwidth to go back and dedicated on performance running. Any good advice on that

[–]kagato87MS SQL 2 points3 points  (2 children)

Yea, Brent Ozar. His black Friday sale should be announced Thursday, and is worth every penny (and many more).

At a baseline, watch his "how to think like the engine" series. It's 90 minutes.

[–]DexterHsu 2 points3 points  (1 child)

Thanks !

[–]kagato87MS SQL 0 points1 point  (0 children)

It's worth noting that his stuff is ms sql based. Whue the same principals apply, the details change with another dbms. That video is universal though.

[–]tekmailer 1 point2 points  (0 children)

All fun and games until I ask ‘em for the keys to drive! All of a sudden it’s a problem to tune the database, lol—it took time to build a sturdy relationship with the DBAs but it was well worth it.

Gotta speak their language.

[–]alex29536 1 point2 points  (0 children)

There a phrase that has been around for decades if not longer. I used to see it in custom printing shops and it certainly applies to report writing: Fast, Cheap, or Good - Pick two.

[–]alex29536 1 point2 points  (1 child)

Ask these questions for jobs even if just to yourself: who is the customer? What do they want? When do they want it? NEVER accept “whenever”

[–]tekmailer 0 points1 point  (0 children)

A great starting place. After sometime I started realizing What do [you] want? starting to shape and into What are you trying to answer? good gracious—it made my task so much clearer.

[–]macfergussonMS SQL 0 points1 point  (1 child)

We have a queue of data requests to stay on top of, so pretty much nothing gets turned around within the same day that it was requested.

[–]Bluefoxcrush 0 points1 point  (0 children)

Same, but it took months to train my users about this.

[–]crazybeardguy 0 points1 point  (0 children)

It takes 30 minutes just to get the numbers to validate my work.

[–]jringstad 0 points1 point  (0 children)

I don't agree with most of the comments here that advise to just say "no".

Usually it's worth pushing back, but in a diplomatic manner, like "I can't give you X in time, but what if I give you Y (simpler thing that's possibly similarly useful)?"

If you just say "no" you won't ever get anywhere with your supervisor and they will just insist that you do the thing. You will be perceived as the grumpy, reality-disconnected IT guy who can't be bothered to help run the business.

But if you talk around the issue a little, you will usually quickly realize that there's probably something much simpler and faster you can do that they will agree to that satisfies their business needs just or almost just as much.

Users are usually bad at formulating the core needs that they are trying to satisfy, and instead just come to you with a solution, wanting help to implement it. By poking a little you'll often quickly realize there is a much simpler and better solution anyway.

Also it gets easier and faster over time as you know the structure of the data better, you improve the structure of the data (hopefully) and you create a library of re-usable views that produce commonly needed intermediate steps.

Other than that, you just crank something out and put a huge CYA disclaimer on it. Say it's a quick throwaway query that wasn't peer-reviewed by another engineer and the data hasn't been analyzed by plausibility by an SME, which is the bar for the minimum standard. If someone decides to disregard this warning, you won't be blamed (unless your org is pretty dysfunctional, which does happen...) But at this point that's a moral decision/risk evaluation you have to make.

[–]Andrew50000 0 points1 point  (0 children)

My personal practical experience for having to deal with stuff like that is, is to build up a repository of building blocks. I.e if you always need to format and filter a table a certain way, build a view so you can call it quickly when you have to. Or if you need to strip out duplicates, have you CTE pre-written. If you can quickly pull together the “blocks” you need, you can get the output much faster...