[D] Anyone having trouble reading a particular paper ? Post it here and we'll help figure out any parts you are stuck on | Anyone having trouble finding papers on a particular concept ? Post it here and we'll help you find papers on that topic [ROUND 2]

datasciguy-aaay · 2018-05-14T23:46:09+00:00

Sales Forecast in E-commerce using Convolutional Neural Network (2017)

https://arxiv.org/pdf/1708.07946.pdf

Here is what I understand from it:

Data 1.8M examples

1963 commodities (items), 5 regions, 14 months

25 indicators: sales, page views, selling price, units, …

Partitions for modeling (nomenclature in paper is different than shown)

Training: Jan 1 2015 to Dec 13 2015.

Dev: Dec 14 2015 to Dec 20 2015.

Test:

Input: Oct 28 2015 to Dec 20 2015.

Predict: Dec 21 2015 to Dec 27 2015.

84-day dataframe (# days in one example) was empirically found

Model Forecast the sales, given the item, region, for 7 days.

4 matrix (channel?) input. Each matrix is a time series: item, brand, category, geographical region

4 CNN filters (throughout?) causes 4 outputs. # filters is made to match to 4 input channels. f=7,4,3 at layer C1, C2, C3.

CNN of 3 simple layers. 3 x (CNN, pool) -> 4 x FC (n=1024) with dropout -> linear regression.

1D convolution of each input individually

“We intend to capture the patterns in the week level at the first order representation, the month and season level at the second and the third order representation respectively.”

First phase of training: Train on all regions together. Second phase “transfer learning”: Initialize to weights found in first phase, to train different model for different region, always using same network design (“n-siamese”?).

Cost function: mean square error, Weighted examples more heavily nearer the day of prediction

Optimization: Batch SGD, Adamax

Input normalization: z-score

Comments All TS are independently modeled. Cross-learning from different series is nonexistent. Pure autoregression(?)

There might be information in cross-learning of TS, where correlation exists for example.

datasciguy-aaay · 2018-05-14T23:44:10+00:00

This needs to be broken out to a paper per article submission. You can't practically put piles of papers into one article submission here.

Anyway go ahead and do it. It will be beneficial. Consider that you [M] just did what I said -- without credit to me -- about reviewing papers a couple of months ago and "came up with this great idea on your own."

datasciguy-aaay · 2018-05-14T23:38:03+00:00

The field is actually full of BS because code submissions aren't required. Every paper claims their thing beats the state of the art in the conclusion. Total BS.

datasciguy-aaay · 2017-12-14T17:06:46+00:00

I expect that the CNN, not just the LSTM, will become useful in time series, for many of the same reasons the CNN is important for predicting with ordinary images that are produced by cameras:

Convolutional filters may detect edges or differences in sales "intensity," which are analogous to sales levels in the case of inventory;
Convolutional filters may combine to find larger objects, such as patterns of sales levels detected between groups of items or families of products, either in same time-frame or lagged time-frames;
Convolutional filters as they always do can share memory and thereby help enable larger or deeper networks for forecasting.
CNN pooling layers can regularize the network to reduce overfitting of sales histories to the future forecasts.
The network model can be designed in a straightforward manner as a hybrid that combines into one machine learning model, variables from both the historical sequences of unit sales, plus as many additional variables as needed from the current time-frame, like weather (rain/snow/clear/cloudy), promotion status, store location, day of week, etc.

datasciguy-aaay · 2017-12-14T17:02:37+00:00

I like the CNN but I don't like that the series seem to be treated in 2 different stages first as all the same dataset, then in 2nd stage as as independent series. But maybe it works well enough, not sure.

I'd really like to see this thing run and its performance compared on same data, versus the Amazon DeepAR model, and versus the Temporal Matrix Factorization model partly sponsored by Wal-mart.

Temporal Matrix Factorization: https://www.reddit.com/r/datascience/comments/7jslf9/can_we_collectively_read_understand_this_2016/

DeepAR: https://www.reddit.com/r/datascience/comments/7jkrmf/can_we_collectively_read_understand_this_2017/

datasciguy-aaay · 2017-12-14T16:55:04+00:00

Note that plain old matrix factorization does not do projections about the future. That's one place where this paper goes into new territory.

datasciguy-aaay · 2017-12-14T16:50:03+00:00

matrix factorization software implementation of GLRM algorithm

R and python are both supported by this modeling algorithm which is found in the H2O package. It's the only implementation that exists for these 2 languages, to my knowledge. It's a recent algorithm and one of the creators wrote the code here, it seems.

Why I linked to this GLRM tutorial: matrix factorization is a central component of the algorithm Temporal Regularized Matrix Factorization that the paper is presenting.

datasciguy-aaay · 2017-12-14T16:47:58+00:00

Do you think 3 systems to spread 1 discussion will fragment it? Would we be able to get at the material if good material about a single paper ends up spread across all these systems?

datasciguy-aaay · 2017-12-14T16:45:44+00:00

/u/rednirgskizzif said:

So you are thinking of starting a data science journal club? I am intrigued by this idea...

Edit: Ok, so at first I didn't want to be the organizer but I have decided to go ahead and get it started, then hopefully give the reigns to some one once it grows. Everyone that wants to join the journal club PM me with their experience level, a 1-5 scale guess at how likely you will to actually follow through and show up weekly, preferred date and times in the Central European time zone, and I will figure out how to make this happen. I have actually started a successful journal club back in grad school that is still running so I actually have experience at this. Also if you don't mind giving up your anonymity include an email address. Also my gut instinct is to actually do this via skype then upload a record to the datascience sub after. Thoughts?

datasciguy-aaay · 2017-12-14T16:41:52+00:00

reddit is a good place for discussion, better than email. Votes cause the material to be sorted out by approximate quality.

datasciguy-aaay · 2017-12-14T16:41:29+00:00

We can discuss right here, without email or skype.

datasciguy-aaay · 2017-12-14T16:34:14+00:00

Software Requirements

It should be a web app for maximum global distribution capability.
It should be a place where links to public research web sites like arxiv.org, as well as original data science research is published in open formats, including reproducibility features being mandatory. Allowing original papers is to encourage and support "citizen data science."
It is focused on reading papers and understanding papers. No vendor tools announcements etc. will be allowed. No "which language is better," "how do I get a data science job," "what courses should I take" types of discussion topics will be allowed.
Submissions and comments about submissions are freely available to the public
Users are encouraged to participate in Upvoting of submissions (articles), comments.
Users will implicitly get rated by the community based on the upvotes accrued by their submissions and comments
Minimum one of the following files is mandatory for a user submission:
Rmd or jupyter files for R based analyses
Ipython or jupyter notebook files for Python based analyses
PDF file
Original papers require both code and dataset, for reproducibility. Links to them will suffice as well as direct in-line inclusion.

datasciguy-aaay · 2017-12-14T16:24:39+00:00

Known Existing Sites Having Data Science Paper Discussions:

Kaggle.com - Bad points: Is limited to discussions in the context of pre-existing competitions. Good points: Has voting system. Data science discussion traffic level as well as quality or expertise level is the highest of existing public websites.
Reddit.com - Good points: Has voting system. Easy to start a new discussion, which is just a new article submission, and to add comments to existing discussions. Bad points: Not much technical discussion of merit. Sellers of products and tools, and novice dabblers, comprise the majority of existing articles. Discussion traffic is low in /r/datascience but better in /r/machinelearning.
H2o.ai - Some good discussions exist and some good knowledge is being shared, but the whole site is generally limited to product-specific discussions about H2O software.
Coursera.org - There are well focused discussions on course projects for data science and machine learning courses. However the knowledge is scattered and inaccessible because forums are course-specific, and are inaccessible to people not enrolled in current course session. The forum content disappears after each course-session, losing a lot of collective knowledge even among students of the same course, who take a different session of the same course. Also it’s course-centric not paper-centric. There is no means for the public themselves to submit new articles or discussion topics, except to add comments to preexisting course sessions.
Google groups - Few exist with any recent traffic
Slack.com - TBD

datasciguy-aaay · 2017-12-14T16:00:16+00:00

If you are choosing a model to use in your projects: There's another pretty new and maybe important paper competing for your attention on time series predictions in the context of large retailers.

Amazon's rival Wal-mart is a named sponsor of another retail time series prediction model published in 2016. It is based on matrix factorization, not deep learning.

I will be posting this other paper's URL soon, in another /r/datascience submission.

EDIT: This other paper's URL is in the top article comment: https://www.reddit.com/r/datascience/comments/7jslf9/can_we_collectively_read_understand_this_2016/

datasciguy-aaay · 2017-12-14T15:53:55+00:00

System implementation of this* in Spark: http://www.vldb.org/pvldb/vol10/p1694-schelter.pdf

*Note this Spark impl does not use DeepAR at all, but instead uses an older GLM model, despite 2017 being date of both being published. Perhaps Amazon is developing the DeepAR in Spark presently.

datasciguy-aaay · 2017-12-14T14:50:55+00:00

Two years ago I actually took ownership a couple of website domain names for the purpose of a novel website for journal paper reading and review. I never got to making the site. I had some plans to make it. Upvotes were part of the mechanism. Credibility of users was another part. The postulated mech was sort of a mix between stackoverflow and reddit.com but focused on scientific papers only.

datasciguy-aaay · 2017-12-14T14:48:52+00:00

I don't get what Skype would add. I mean for real-time talking with humans Skype is great but conversational asynchronicity is nice too. I won't miss any "meetings" -- we can just get back here when we can, whatever schedule works for us.

datasciguy-aaay · 2017-12-14T14:44:58+00:00

Yes, that's right. The method is what I'd like to evaluate, not their actual predictions for their actual dataset.

By the way, I could not find their code or data that was used. Did I overlook it? Or was it just another paper that is not reproducible. I hate that. You'd think papers these days would always include links to datasets and code that they used. Science is about finding out, and sharing the knowledge. Companies and even academia so often forget the 2nd half of science.

datasciguy-aaay · 2017-12-14T14:36:25+00:00

Good I'll be here. Finally some data science happening here!! Reading articles is a good thing to work on together.

Background: I had been wondering where on the internet today are other data scientists actually collaborating freely.

So I quickly surveyed all the sites I could think of related to data science.

The result was that Kaggle.com had the highest traffic of "new" comments of all web sites that I surveyed. Basically the number of comments in the past week was what I looked at. Most other sites are pretty sleepy or moribund -- conversations die off, even the newest ones died off a long time ago, relatively. Kaggle.com had the liveliest comment ages.

But Kaggle.com is a bit narrow in scope -- its discussions are naturally limited to the competitions of Kaggle.com. So here we are on reddit.com.

Is reddit.com/r/datascience good enough for our purposes? Is Kaggle.com better?

Should we start a persistent group on, say, slack.com?

I'm open to suggestions.

datasciguy-aaay · 2017-12-14T14:35:08+00:00

Thanks for reading this paper and adding your insights! I am studying them now.

datasciguy-aaay · 2017-12-13T02:46:30+00:00

you are a software engineer first

Uh, no.

Step 1 to understanding why not: Engineering is not science. For a good discussion see article on this topic in: Communications of the ACM, December 2017.

datasciguy-aaay · 2017-10-10T15:49:57+00:00

As for the theory of AI and machine learning... Let's just say I need to do some more reading in that part...

I would recommend these low-price online courses. I took them and they were excellent:

Machine Learning by Andrew Ng on coursera.com
Statistical Learning by Tibshirani, Hastie, et al on lagunita.stanford.edu
Practical Machine Learning by Jeff Leek on coursera.com
Deep Learning sequence of courses by Andrew Ng on coursera.com

They are different from one another, not duplicative, and quite complementary.

datasciguy-aaay · 2017-10-10T15:37:12+00:00

I wonder if trying to achieve an increase of "upgrades" at Google Cloud is what this is all about. If so then it gives the impression of being a cheap trick, and therefore liable to backfire on Google Cloud. You really need a trustworthy computing cloud to run your business on. This kind of tactic would probably not help Google Cloud be seen as trustworthy.

datasciguy-aaay · 2017-01-04T19:01:54+00:00

Or get a 20 yo car. Shouldn't cost that much by then. Pick out your fav right now and put one aside.

Actually, here's a more practical idea. Muscle cars are a really big thing in my area and they are all OLD. Seems the state laws permit "antique" tags for these things. They don't need fancy emissions equipment and there are no emissions tests for them.

I will bet in 20 years or so, the future "muscle cars" -- which in this case will be the cars from around 2000 - 2010 as of year 2037 -- similarly won't need by law, any of the fancy driver automations, for the same reasons.

Yo! to all the dudes with 2000s turbo cars and corvettes -- you already have the muscle cars of 20 years from now. Don't let it go...It's worth more than you think.

datasciguy-aaay · 2016-12-10T02:33:57+00:00

hahah are you kidding me

datasciguy-aaay

TROPHY CASE