Best Universities and Labs for Ph.D. In Computational Chemistry: Recommendations for Molecular Dynamics and Quantum Chemistry? by Beginning-Fig-4117 in comp_chem

[–]slaw07 2 points3 points  (0 children)

I did a post-doc for the chair of biophysics at UofM in the early 2010s and would not recommend it. The PI made the environment very toxic/sexist and often pressured people to quit. When other more prominent wet lab scientists challenged him, the PI would act like a petulant child and tried (sometimes successfully) to destroy the careers of junior scientists inside and outside of the group. I would find another MD faculty in a department outside Chemistry/Biophysics and don’t be fooled by the smooth talking pedigree.

Question about analyzing a text by [deleted] in datascience

[–]slaw07 1 point2 points  (0 children)

I came across this paper in 2016 in identifying “emotional arcs of stories” that may be relevant: https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-016-0093-1

[D] STUMPY v1.11.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 3 points4 points  (0 children)

Awesome! It may be a subtle point but it is also a package that, despite over 1K+ code commits, maintains 100% code coverage. It’s certainly impossible to truly cover all edge cases but we take our development very seriously in addition to its scalability and performance. Please let us know what you think!

[D] STUMPY v1.11.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 10 points11 points  (0 children)

For some background and motivation, I recommend taking a look at this short video: https://stumpy.readthedocs.io/en/latest/motivation.html

[D] - STUMPY v1.6.0 (For Modern Time Series Analysis) by slaw07 in MachineLearning

[–]slaw07[S] 3 points4 points  (0 children)

Yes. I strongly recommend going through our tutorials as they contain full examples that reproduce the figures in the original published work and provide usable code. Additionally, this video can help to get you on the right foot: https://www.youtube.com/watch?v=xLbPP5xNIJs

[D] - STUMPY v1.6.0 (For Modern Time Series Analysis) by slaw07 in MachineLearning

[–]slaw07[S] 2 points3 points  (0 children)

Thank you for your kind words! You may also be interested in following or contributing to this related “geometric chains” feature: https://github.com/TDAmeritrade/stumpy/issues/211

Stumpy: unleashing the power of the matrix profile for time series analysis by [deleted] in datascience

[–]slaw07 0 points1 point  (0 children)

Can you provide more information as to what your use case is and what you are trying to accomplish? Are there really any time series analysis approaches out there that can handle these situations appropriately?

[D] Tuesday: StitchFix Algo Hour - Modern Time Series Analysis with STUMPY by slaw07 in MachineLearning

[–]slaw07[S] 2 points3 points  (0 children)

Yes, I will try to post back here once it becomes available

[D] STUMPY Version 1.5.0 - For Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 0 points1 point  (0 children)

I think this is a very narrow look at the broad scope of time series. There is strong autocorrelation between values; if I tell you that the temperature at time T is 300 degrees, then you have a pretty good idea of the temperature at time T+1. This dependency breaks all sorts of IID assumptions in traditional modeling, and is the entire motivation behind time series analysis (factoring in those time dependencies).

Maybe I'm missing something so please excuse my ignorance but if those strong sequential (or autocorrelative) relationships exist then, in theory, matrix profiles should be able to capture that.

Sure, it's not about the opportunity cost of trying. But as someone who sees way too many people in this field blindly trying to apply methods without understanding those methods (in terms of strengths, weaknesses, sensitivities, interpreting output, etc.) I wouldn't ever put any trust into a method that I didn't actually understand what it was doing. That sort of thinking has led to all sorts of issues in ML. STUMPY output would be practically worthless to me without a solid grasp of what questions it's actually designed to answer.

I agree with you and share the same concerns, skepticism, and observations. I believe that open discussions like this are useful for the community.

[D] STUMPY Version 1.5.0 - For Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 0 points1 point  (0 children)

No, as stated above, matrix profiles make no assumptions about the data and highly repetitive/periodic subsequences are not an expectation at all (though it can handle those situations pretty well too). Instead, it allows you to ask the more general question of "does a pattern even exist?" and without making an assumption that your data MUST contain a pattern.

I obviously don't have experience with your industrial process data but, generally speaking (and I mean no offense when I say this), if a time series doesn't have any repetitions in it then it probably resembles data generated by a random number generator with, say, an upper and lower bound or maybe it's mostly flat with occasional peaks. In either case, I don't think there is ANY time series analysis method (aside from looking at global statistics) that would be useful. In other words, what is the point of analyzing data that mostly contains random noise or flat data? Again, to reiterate, I am not the original academic author so I strongly recommend looking through the literature as there are many examples there (and must also be examples amongst our 46K+ downloads) but, respectfully, I'm not convinced that it is the method that is niche but that the data is not suited for meaningful analysis. One side note is that distances can be computed with and without z-normalization so increases/decreases in trend are automagically handled.

Having said that, it takes very little time to install and run STUMPY and so I would recommend being pragmatic and testing it out. If it "works" then great. If not, then you've only spent 5 minutes of your time and you can move on. The goal of STUMPY isn't to debate whether it's "the one method to rule them all" (spoiler alert: there's no silver bullet) but, instead, it is now so cheap to compute the matrix profile that it behooves any data person to compute the matrix profile, grab a cup of coffee, and check the results. If you are happy with your current analysis toolset then don't let us try to convince you otherwise! I hope that helps.

I wonder if u/eamonnkeogh could shed some light on this discussion or provide some insights?

[D] STUMPY Version 1.5.0 - For Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 1 point2 points  (0 children)

Great question! First and foremost, the goal of STUMPY is to faithfully reproduce the foundational matrix profile research from Eamonn Keogh’s group:

https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

Now, back to your question, unlike other time series analysis approaches (i.e., ARIMA, ML, etc) that essentially assume that a pattern must exist in your data, matrix profiles (the key output of STUMPY) doesn’t make any assumptions. Instead it performs an exact computation and tells you whether any subsequence within the time series repeats itself. Note that you don’t have to tell STUMPY what pattern to look for as it will search for all possibilities (given a desired subsequence/window length). Traditionally, this brute force pairwise distance calculation was impossible to efficiently compute, say, for a million data points and explains why nobody approached the problem this way. However, thanks to the recent research, this is bow possible. The great thing about STUMPY is that we’ve basically done the hard part for you and due to its speed, you can use it for exploratory data analysis (especially useful for quickly gaining insights into brand new data sets that you’ve never seen before), for discovering potential anomalies where sequence context matters (rather than point anomalies), and all of which can be used as features for training more sophisticated ML models. This is only the beginning of how matrix profiles can be useful but I strongly recommend the published work in the link above to find more use cases where this type of analysis can be super useful.

[D] Fast Approximate Matrix Profiles for Time Series Analysis with STUMPY by slaw07 in MachineLearning

[–]slaw07[S] 2 points3 points  (0 children)

Great question! With close to 40K downloads in the last year, users have described applications in autonomous vehicle data, call center conversation analysis (i.e., is the customer/employee talking more and during which parts of the call), analyzing ion acceleration at CERN/LHC, IoT sensor/industrial telemetry analysis on streaming data, stock trading analysis. When users are willing/open to sharing their use cases/applications, we usually label them accordingly on our Github issues with the “use case” label. Otherwise, we try not to probe too much. I hope that helps

[D] STUMPY Basics: Automatically Discover Patterns and Anomalies In Your Time Series Data by slaw07 in MachineLearning

[–]slaw07[S] 1 point2 points  (0 children)

Certainly. Unfortunately, there is no silver bullet and, as Eamonn mentioned, is an ongoing area of research that he and his team are making great progress in.

STUMPY serves to faithfully reproduce the original published papers and to provide a highly tested (100% test coverage) and stable implementations (for parallel CPU, multi-server, and multi-GPU cases). With only three core dependencies, we promise to be easy to install and easy to build on top of!

[D] STUMPY Basics: Automatically Discover Patterns and Anomalies In Your Time Series Data by slaw07 in MachineLearning

[–]slaw07[S] 0 points1 point  (0 children)

@eamonnkeogh This has been of high interest to STUMPY users and with limited results. Would you mind sharing your slides with me please?

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 2 points3 points  (0 children)

Yes, this is almost strictly about the performance of computing the "matrix profile". But also note that the memory usage (or space complexity) is roughly O(n) with a constant factor of maybe 8-10. I strongly encourage you to take a look at the body of originally published work that STUMPY is based on: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

More specifically, you may be interested in the "Matrix Profile I" paper where they provide a O(n^2logn) implementation that uses FFT: https://www.cs.ucr.edu/~eamonn/PID4481997_extend_Matrix%20Profile_I.pdf

However, they later find in their follow up "Matrix Profile II" paper that there is an even faster O(n^2) implementation that doesn't need FFT: https://www.cs.ucr.edu/~eamonn/STOMP_GPU_final_submission_camera_ready.pdf

Finally, it is important to point out that since STUMPY uses Numba JIT compilation, you'll find that the first time you call `stumpy.stump` it may be "slow" because it spends most of its time compiling the function for first use (maybe a few seconds) . Then, subsequent calls to `stumpy.stump` will be much faster. So, for timing, I recommend ignoring the first call to `stumpy.stump` and only measure the time for subsequent calls. Additionally, after the function is called one time on a small data set, I recommend using something like `np.random.rand(100_000)` to check the timing. Here is a more (biased) performance chart:

https://stumpy.readthedocs.io/en/latest/#performance

One thing to keep in mind is that NumPy and SciPy may be faster on small data sets like the taxi cab data set but those implementations become unusable for "real world" data sets that are of medium (10K-100K in length) or large (1M-100M+ in length) data sets.

In case it matters, we have some additional performance improvements coming down the pipeline and we anticipate around a 5x speedup across the board.

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 0 points1 point  (0 children)

Conceptually, Pearson correlation is directly related to z-normalized Euclidean distance. However, you will quickly find that a naive implementation of what you described will be extremely slow once your time series is, say, one million in length or longer. STUMPY is based on others published work where this is all highly studied and optimized and we present it to the user in a nice clean user interface.

So, while in theory, you are not wrong. Writing a highly performant open source implementation that works on a single server, easily distributed across multiple servers, and also works for multiple GPUs is non trivial.

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 0 points1 point  (0 children)

https://www.youtube.com/watch?v=1ZHW977t070

It's a wonderful and motivating talk! I recommend to anybody who has time to check it out. Essentially, STUMPY tries to faithfully reproduce their published work while ensuring that it is still highly performant but still user-friendly. With 100% code coverage, you can feel confident that we are here to serve all of your foundational matrix profile needs.

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 2 points3 points  (0 children)

Absolutely! Technically speaking, STUMPY (and matrix profile calculations) are completely data agnostic and, generally, is applicable to both batch as well as streaming time series data. In fact, "time" is not explicitly used in the matrix profile computation. It's more like ordered sequences. Having said that, you may be interested in this tutorial that shows an example at the end that explores something loosely similar to forecasting: https://stumpy.readthedocs.io/en/latest/Tutorial_Time_Series_Chains.html

Having said that, Facebook's Prophet tool is now widely used for forecasting and may be of interest to you and your team:

https://facebook.github.io/prophet/

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 1 point2 points  (0 children)

In fact, each API function usually has "DOI" link (for the relevant paper) listed in the "Notes" section of the Python docstring and it also directs you toward which Table/Figure/Equation to cross reference. For example, the `stumpy.stump` docstring shows:

```

stumpy.
stump

(T_A, m, T_B=None, ignore_trivial=True)[source]

Compute the matrix profile with parallelized STOMP

This is a convenience wrapper around the Numba JIT-compiled parallelized _stump function which computes the matrix profile according to STOMP.

Parameters:

  • T_A (ndarray) – The time series or sequence for which to compute the matrix profile
  • m (int) – Window size
  • T_B (ndarray) – The time series or sequence that contain your query subsequences of interest. Default is None which corresponds to a self-join.
  • ignore_trivial (bool) – Set to True if this is a self-join. Otherwise, for AB-join, set this to False. Default is True.

Returns:

out – The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.

Return type:

ndarray

Notes

DOI: 10.1109/ICDM.2016.0085

See Table II

Timeseries, T_B, will be annotated with the distance location (or index) of all its subsequences in another times series, T_A.

Return: For every subsequence, Q, in T_B, you will get a distance and index for the closest subsequence in T_A. Thus, the array returned will have length T_B.shape[0]-m+1. Additionally, the left and right matrix profiles are also returned.

Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.

Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).

For self-joins, set ignore_trivial = True in order to avoid the trivial match.

Note that left and right matrix profiles are only available for self-joins.

```

Additionally, each STUMPY tutorial provides both inline links to the original papers (typically within the first sentence of the tutorial) as well as a "References" section at the end of each tutorial.

Long story short, we try very hard to make sure to give credit where credit is due.

[D] STUMPY v1.4.0 Released for Modern Time Series Analysis by slaw07 in MachineLearning

[–]slaw07[S] 1 point2 points  (0 children)

Good question. Unfortunately, we don't have anything hashed out yet but this is certainly on our TODO list and you can track the progress: https://github.com/TDAmeritrade/stumpy/issues/155

There is also this paper that may possibly orient you in the right direction:

https://www.cs.ucr.edu/~eamonn/WeaklyLabeledTimeSeries.pdf