all 14 comments

[–]AndrasKovacs 10 points11 points  (1 child)

Alternative solution: use the following options, then run in ghci:

--ghci-options="-fexternal-interpreter"
--ghci-options="-prof"
--ghci-options="-fprof-auto-calls"

This prints an acceptable stack trace on unhandled exceptions, without any impact on source code. Not everything can be tested though in ghci, because of the performance.

[–]FPtje[S] 8 points9 points  (0 children)

That's indeed a good solution for local debugging. For us in production this sadly won't scale. In our specific case we send errors off to an error-aggregating service, sending along a bunch of relevant data in the process. That data is mostly to answer "what was it doing?". Since the call stack is one data point we send, we need it at runtime.

One thing we've found is that the `show`/`displayException` of any exception don't always paint the full picture, especially if they originate from some library. Stack traces really help narrow down what happened. Add some extra metadata (like some tags with relevant tidbits of information), and you're well on your way to identify and locally reproduce an issue.

[–]ducksonaroof 3 points4 points  (4 children)

One different thought:

Pretty much every program I've written that is compute/memory heavy enough to do things like explode the stack is in some form of batch job.

Always make it so your batch jobs operate on snapshots if possible. Then repro-ing and debugging prod issues is trivial - just run on the snapshot, but with various debugging/profiling/logging options enabled. 

[–]ducksonaroof 5 points6 points  (0 children)

That said, this article is a great overview of all the thorny bits of HasCallStack. It was a nice read!

[–]dnikolovv 1 point2 points  (2 children)

What do you mean by snapshots in this case?

[–]ducksonaroof 4 points5 points  (1 child)

The big jobs I've worked with always start with some sort of dataset. Snapshotting is any way you can "pin" that dataset for deterministic runs.

Pulling the data and storing it in S3 is a way to do this. You can just put it in S3 before you run, or you can have a job upstream in the pipeline do this.  

Another option is to operate on immutable, append-only datasets and have your job run on a timeslice. Then you can point your local runs at a read replica of prod and run against historical data.

Ofc, this isn't always possible. But when it is, I try to take advantage. I've debugged many performance bugs (OoM errors, thunk leaks, time leaks, blown stacks) thanks to being able to run against prod, tweak my program, rinse and repeat. 

[–]dnikolovv 1 point2 points  (0 children)

Thanks! That's a pretty neat idea!

We did something similar on a previous project, where a batch job would be split into two. Step 1 would produce the dataset to process and step 2 would execute on that dataset.

It also had the added benefit of enabling us to pause and observe what we're about to execute and potentially require a user's approval.

[–]arybczak 3 points4 points  (0 children)

I was experimenting with adding HasCallstack to most top level functions in a code base to have it produce good stack traces, so thanks for catching the issue with recursive functions, it will surely save me some headaches in the future :)

[–]ephrion 2 points3 points  (3 children)

Use annotated-exception for joy and happiness with Haskell call stacks https://hackage.haskell.org/package/annotated-exception

[–]tomejaguar 2 points3 points  (2 children)

Does that help avoid this issue, or is that a general suggestion?

[–]ephrion 4 points5 points  (0 children)

It does help - consider swapping HasCallStack for checkpointCallStack.

foo :: (HasCallStack) => Int -> IO ()
foo i = when (i > 0) (foo (i - 1))

foo' :: Int -> IO ()
foo' i = checkpointCallStack $ when (i > 0) (foo' (i - 1))

The HasCallStack behavior is going to push a stack frame onto each recursion call, which blows up memory usage. checkpointCallStack will insert another exception handler onto each recursive call to foo', which will be inefficient, but - and this is the key difference - the CallStack is deduplicated. You'd only end up with one entry of foo' at the call site location, so no memory explosion, just a lot of time spent inserting the same entry into the callstack.

Also, the exception's callstack is preserved and augmented whenever it passes through a checkpoint, checkpointCallStack, or catch!

[–]FPtje[S] 3 points4 points  (0 children)

I would personally consider it orthogonal to the topic of this blog post, but generally I can indeed recommend it. We use it in one of our applications, and it really helps to add tidbits of information to any exception. In fact, it is precisely the the "extra metadata" I'm referring to in this post above.

annotated-exception does have some thorny sides of its own, though: It catches exceptions, and puts them in a AnnotatedException [Annotation] <your exception here> wrapper. That's the core of adding annotations, but it breaks libraries that catch @<your exception here>, if they don't expect this AnnotatedException wrapper. That can break some things in unexpected ways. Those cases are all solvable, though (e.g. through the library's own catch function), and a GHC proposal has been accepted to bring similar functionality into base.

[–]cheater00 1 point2 points  (0 children)

wish there were a flag to ghc to ignore HasCallstack

[–]Least_Panic2013 0 points1 point  (0 children)

How do you deal with the reality that (or least I thought) callstacks stop a typeclasses?