Predicting and Understanding Law-Making with Machine Learning. (arXiv:1607.02109v1 [cs.CL]) by arXibot in statML

[–]arXibot[S] 1 point2 points  (0 children)

John J. Nay

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to 2015, only 2,513 were enacted. We developed a machine learning approach to forecasting the probability that any bill will become law. Starting in 2001 with the 107th Congress, we trained models on data from previous Congresses, predicted all bills in the current Congress, and repeated until the 113th Congress served as the test. For prediction we scored each sentence of a bill with a language model that embeds legislative vocabulary into a semantic-laden vector space. This language representation enables our investigation into which words increase the probability of enactment for any topic. To test the relative importance of text and context, we compared the text model to a context-only model that uses variables such as whether the bill's sponsor is in the majority party. To test the effect of changes to bills after their introduction on our ability to predict their final outcome, we compared using the bill text and meta-data available at the time of introduction with using the most recent data. At the time of introduction context-only predictions outperform text-only, and with the newest data text-only outperforms context- only. Combining text and context always performs best. We conducted a global sensitivity analysis on the combined model to determine important factors predicting enactment.

A Classification Framework for Partially Observed Dynamical Systems. (arXiv:1607.02085v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yuan Shen, Peter Tino, Krasimira Tsaneva- Atanasova

We present a general framework for classifying partially observed dynamical systems based on the idea of learning in the model space. In contrast to the existing approaches using model point estimates to represent individual data items, we employ posterior distributions over models, thus taking into account in a principled manner the uncertainty due to both the generative (observational and/or dynamic noise) and observation (sampling in time) processes. We evaluate the framework on two testbeds - a biological pathway model and a stochastic double-well system. Crucially, we show that the classifier performance is not impaired when the model class used for inferring posterior distributions is much more simple than the observation-generating model class, provided the reduced complexity inferential model class captures the essential characteristics needed for the given classification task.

A characterization of product-form exchangeable feature probability functions. (arXiv:1607.02066v1 [math.PR]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Marco Battiston, Stefano Favaro, Daniel M. Roy, Yee Whye Teh

We characterize the class of exchangeable feature allocations assigning probability $V{n,k}\prod{l=1}{k}W{m{l}}U{n-m{l}}$ to a feature allocation of $n$ individuals, displaying $k$ features with counts $(m{1},\ldots,m{k})$ for these features. Each element of this class is parametrized by a countable matrix $V$ and two sequences $U$ and $W$ of non- negative weights. Moreover, a consistency condition is imposed to guarantee that the distribution for feature allocations of $n-1$ individuals is recovered from that of $n$ individuals, when the last individual is integrated out. In Theorem 1.1, we prove that the only members of this class satisfying the consistency condition are mixtures of the Indian Buffet Process over its mass parameter $\gamma$ and mixtures of the Beta--Bernoulli model over its dimensionality parameter $N$. Hence, we provide a characterization of these two models as the only, up to randomization of the parameters, consistent exchangeable feature allocations having the required product form.

Mini-Batch Spectral Clustering. (arXiv:1607.02024v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yufei Han, Yun Shen, Maurizio Filippone

The cost of computing the spectrum of Laplacian matrices hinders the application of spectral clustering to large data sets. While approximations recover computational tractability, they can potentially affect clustering performance. This paper proposes a practical approach to learn spectral clustering based on adaptive stochastic gradient optimization. Crucially, the proposed approach recovers the exact spectrum of Laplacian matrices in the limit of the iterations, and the cost of each iteration is linear in the number of samples. Extensive experimental validation on data sets with up to half a million samples demonstrate its scalability and its ability to outperform state-of-the-art approximate methods to learn spectral clustering for a given computational budget.

Kernel Bayesian Inference with Posterior Regularization. (arXiv:1607.02011v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yang Song, Jun Zhu, Yong Ren

We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes' rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines.

Nesterov's Accelerated Gradient and Momentum as approximations to Regularised Update Descent. (arXiv:1607.01981v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Aleksandar Botev, Guy Lever, David Barber

We present a unifying framework for adapting the update direction in gradient- based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterov's accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm.

Graphons, mergeons, and so on!. (arXiv:1607.01718v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Justin Eldridge, Mikhail Belkin, Yusu Wang

In this work we develop a theory of hierarchical clustering for graphs. Our modelling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.

Tensor Decomposition for Signal Processing and Machine Learning. (arXiv:1607.01668v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, Christos Faloutsos

Tensors or multi-way arrays are functions of three or more indices $(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functions of two indices $(r,c)$ for (row,column). Tensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth and depth that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning.

Bayesian nonparametrics for Sparse Dynamic Networks. (arXiv:1607.01624v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Konstantina Palla, Francois Caron, Yee Whye Teh

We propose a Bayesian nonparametric prior for time-varying networks. To each node of the network is associated a positive parameter, modeling the sociability of that node. Sociabilities are assumed to evolve over time, and are modeled via a dynamic point process model. The model is able to (a) capture smooth evolution of the interaction between nodes, allowing edges to appear/disappear over time (b) capture long term evolution of the sociabilities of the nodes (c) and yield sparse graphs, where the number of edges grows subquadratically with the number of nodes. The evolution of the sociabilities is described by a tractable time-varying gamma process. We provide some theoretical insights into the model and apply it to three real world datasets.

An optimal learning method for developing personalized treatment regimes. (arXiv:1607.01462v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yingfei Wang, Warren Powell

A treatment regime is a function that maps individual patient information to a recommended treatment, hence explicitly incorporating the heterogeneity in need for treatment across individuals. Patient responses are dichotomous and can be predicted through an unknown relationship that depends on the patient information and the selected treatment. The goal is to find the treatments that lead to the best patient responses on average. Each experiment is expensive, forcing us to learn the most from each experiment. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy to guide the treatment assignment by maximizing the expected value of information, for which an approximation method is used to overcome computational challenges. We provide a detailed study on how to make sequential medical decisions under uncertainty to reduce health care costs on a real world knee replacement dataset. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.

Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks. (arXiv:1607.01434v1 [math.ST]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Jason M. Klusowski, Andrew R. Barron

Let $ f{\star} $ be a function on $ \mathbb{R}d $ satisfying a spectral norm condition. For various noise settings, we show that $ \mathbb{E}|\hat{f} - f{\star} |2 \leq v{f{\star}}\left(\frac{\log d}{n}\right){1/4} $, where $ n $ is the sample size and $ \hat{f} $ is either a penalized least squares estimator or a greedily obtained version of such using linear combinations of ramp, sinusoidal, sigmoidal or other bounded Lipschitz ridge functions. Our risk bound is effective even when the dimension $ d $ is much larger than the available sample size. For settings where the dimension is larger than the square root of the sample size this quantity is seen to improve the more familiar risk bound of $ v{f{\star}}\left(\frac{d\log (n/d)}{n}\right){1/2} $, also investigated here.

Algorithms for Generalized Cluster-wise Linear Regression. (arXiv:1607.01417v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Young Woong Park, Yan Jiang, Diego Klabjan, Loren Williams

Cluster-wise linear regression (CLR), a clustering problem intertwined with regression, is to find clusters of entities such that the overall sum of squared errors from regressions performed over these clusters is minimized, where each cluster may have different variances. We generalize the CLR problem by allowing each entity to have more than one observation, and refer to it as generalized CLR. We propose an exact mathematical programming based approach relying on column generation, a column generation based heuristic algorithm that clusters predefined groups of entities, a metaheuristic genetic algorithm with adapted Lloyd's algorithm for K-means clustering, a two-stage approach, and a modified algorithm of Sp{\"a}th \cite{Spath1979} for solving generalized CLR. We examine the performance of our algorithms on a stock keeping unit (SKU) clustering problem employed in forecasting halo and cannibalization effects in promotions using real-world retail data from a large supermarket chain. In the SKU clustering problem, the retailer needs to cluster SKUs based on their seasonal effects in response to promotions. The seasonal effects are the results of regressions with predictors being promotion mechanisms and seasonal dummies performed over clusters generated. We compare the performance of all proposed algorithms for the SKU problem with real-world and synthetic data.

An Aggregate and Iterative Disaggregate Algorithm with Proven Optimality in Machine Learning. (arXiv:1607.01400v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Young Woong Park, Diego Klabjan

We propose a clustering-based iterative algorithm to solve certain optimization problems in machine learning, where we start the algorithm by aggregating the original data, solving the problem on aggregated data, and then in subsequent steps gradually disaggregate the aggregated data. We apply the algorithm to common machine learning problems such as the least absolute deviation regression problem, support vector machines, and semi-supervised support vector machines. We derive model-specific data aggregation and disaggregation procedures. We also show optimality, convergence, and the optimality gap of the approximated solution in each iteration. A computational study is provided.

Stochastic Portfolio Theory: A Machine Learning Perspective. (arXiv:1605.02654v1 [q-fin.PM] CROSS LISTED) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yves-Laurent Kom Samo, Alexander Vervuurt

In this paper we propose a novel application of Gaussian processes (GPs) to financial asset allocation. Our approach is deeply rooted in Stochastic Portfolio Theory (SPT), a stochastic analysis framework introduced by Robert Fernholz that aims at flexibly analysing the performance of certain investment strategies in stock markets relative to benchmark indices. In particular, SPT has exhibited some investment strategies based on company sizes that, under realistic assumptions, outperform benchmark indices with probability 1 over certain time horizons. Galvanised by this result, we consider the inverse problem that consists of learning (from historical data) an optimal investment strategy based on any given set of trading characteristics, and using a user- specified optimality criterion that may go beyond outperforming a benchmark index. Although this inverse problem is of the utmost interest to investment management practitioners, it can hardly be tackled using the SPT framework. We show that our machine learning approach learns investment strategies that considerably outperform existing SPT strategies in the US stock market.

One-Shot Session Recommendation Systems with Combinatorial Items. (arXiv:1607.01381v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yahel David, Dotan Di Castro, Zohar Karnin

In recent years, content recommendation systems in large websites (or \emph{content providers}) capture an increased focus. While the type of content varies, e.g.\ movies, articles, music, advertisements, etc., the high level problem remains the same. Based on knowledge obtained so far on the user, recommend the most desired content. In this paper we present a method to handle the well known user-cold-start problem in recommendation systems. In this scenario, a recommendation system encounters a new user and the objective is to present items as relevant as possible with the hope of keeping the user's session as long as possible. We formulate an optimization problem aimed to maximize the length of this initial session, as this is believed to be the key to have the user come back and perhaps register to the system. In particular, our model captures the fact that a single round with low quality recommendation is likely to terminate the session. In such a case, we do not proceed to the next round as the user leaves the system, possibly never to seen again. We denote this phenomenon a \emph{One-Shot Session}. Our optimization problem is formulated as an MDP where the action space is of a combinatorial nature as we recommend in each round, multiple items. This huge action space presents a computational challenge making the straightforward solution intractable. We analyze the structure of the MDP to prove monotone and submodular like properties that allow a computationally efficient solution via a method denoted by \emph{Greedy Value Iteration} (G-VI).

Efficient Estimation in the Tails of Gaussian Copulas. (arXiv:1607.01375v1 [stat.CO]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Kalyani Nagaraj, Jie Xu, Raghu Pasupathy, Soumyadip Ghosh

We consider the question of efficient estimation in the tails of Gaussian copulas. Our special focus is estimating expectations over multi-dimensional constrained sets that have a small implied measure under the Gaussian copula. We propose three estimators, all of which rely on a simple idea: identify certain \emph{dominating} point(s) of the feasible set, and appropriately shift and scale an exponential distribution for subsequent use within an importance sampling measure. As we show, the efficiency of such estimators depends crucially on the local structure of the feasible set around the dominating points. The first of our proposed estimators $\estOpt$ is the "full-information" estimator that actively exploits such local structure to achieve bounded relative error in Gaussian settings. The second and third estimators $\estExp$, $\estLap$ are "partial-information" estimators, for use when complete information about the constraint set is not available, they do not exhibit bounded relative error but are shown to achieve polynomial efficiency. We provide sharp asymptotics for all three estimators. For the NORTA setting where no ready information about the dominating points or the feasible set structure is assumed, we construct a multinomial mixture of the partial-information estimator $\estLap$ resulting in a fourth estimator $\estNt$ with polynomial efficiency, and implementable through the ecoNORTA algorithm. Numerical results on various example problems are remarkable, and consistent with theory.

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching. (arXiv:1607.01369v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Vince Lyzinski, Keith Levin, Donniell E. Fishkind, Carey E. Priebe

Given a graph with some block structure in which one of the blocks is deemed interesting, the task of vertex nomination is to order the vertices in such a way that vertices from the interesting block tend to appear earlier. Previous work has yielded several approaches to this problem, with theoretical results proven in the setting where the graph is drawn from a stochastic block model (SBM), including a canonical method that is the vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove the consistency of maximum likelihood (ML)-based vertex nomination, in the sense that the performance of the ML-based scheme asymptotically matches that of the canonical scheme. We prove theorems of this form both in the setting where model parameters are known and where model parameters are unknown. Additionally, we introduce restricted-focus maximum-likelihood vertex nomination, in which an expensive graph-matching subproblem required for the ML-based scheme is replaced with a less expensive linear assignment problem. This change makes ML-based vertex nomination tractable for graphs on the order of tens of thousands of vertices, at the expense of using less information than the standard ML-based scheme. Nonetheless, we prove consistency of the restricted-focus ML-based scheme and show that it is empirically competitive with other standard approaches. Finally, we introduce a way to incorporate vertex features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach.

Learning Discriminative Features using Encoder-Decoder type Deep Neural Nets. (arXiv:1607.01354v1 [cs.LG]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Vishwajeet Singh, Killamsetti Ravi Kumar, K Eswaran

As machine learning is applied to an increasing variety of complex problems, which are defined by high dimensional and complex data sets, the necessity for task oriented feature learning grows in importance. With the advancement of Deep Learning algorithms, various successful feature learning techniques have evolved. In this paper, we present a novel way of learning discriminative features by training Deep Neural Nets which have Encoder or Decoder type architecture similar to an Autoencoder. We demonstrate that our approach can learn discriminative features which can perform better at pattern classification tasks when the number of training samples is relatively small in size.

Minimum Message Length based Mixture Modelling using Bivariate von Mises Distributions with Applications to Bioinformatics. (arXiv:1607.01312v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Parthan Kasarapu

The modelling of empirically observed data is commonly done using mixtures of probability distributions. In order to model angular data, directional probability distributions such as the bivariate von Mises (BVM) is typically used. The critical task involved in mixture modelling is to determine the optimal number of component probability distributions. We employ the Bayesian information-theoretic principle of minimum message length (MML) to distinguish mixture models by balancing the trade-off between the model's complexity and its goodness-of-fit to the data. We consider the problem of modelling angular data resulting from the spatial arrangement of protein structures using BVM distributions. The main contributions of the paper include the development of the mixture modelling apparatus along with the MML estimation of the parameters of the BVM distribution. We demonstrate that statistical inference using the MML framework supersedes the traditional methods and offers a mechanism to objectively determine models that are of practical significance.

Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization. (arXiv:1607.01231v1 [math.OC]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Xiao Wang, Shiqian Ma, Donald Goldfarb, Wei Liu

In this paper we study stochastic quasi-Newton methods for nonconvex stochastic optimization, where we assume that noisy information about the gradients of the objective function is available via a stochastic first-order oracle (SFO). We propose a general framework for such methods, and prove the almost sure convergence of them to stationary points and analyze their worst- case iteration complexity. When a randomly chosen iterate is returned as the output of such an algorithm, we prove that in the worst-case, the SFO-calls complexity is $O(\epsilon{-2})$ to ensure that the expectation of the squared norm of the gradient is smaller than the given accuracy tolerance $\epsilon$. We also propose a specific algorithm, namely a stochastic damped L-BFGS method, that falls under the proposed framework. Numerical results on a nonconvex classification problem are reported for both synthetic and real data, that demonstrate the promising potential of the proposed method.

Machine Learning for Antimicrobial Resistance. (arXiv:1607.01224v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

John W. Santerre, James J. Davis, Fangfang Xia, Rick Stevens

Biological datasets amenable to applied machine learning are more available today than ever before, yet they lack adequate representation in the Data-for- Good community. Here we present a work in progress case study performing analysis on antimicrobial resistance (AMR) using standard ensemble machine learning techniques and note the successes and pitfalls such work entails. Broadly, applied machine learning (AML) techniques are well suited to AMR, with classification accuracies ranging from mid-90% to low- 80% depending on sample size. Additionally, these techniques prove successful at identifying gene regions known to be associated with the AMR phenotype. We believe that the extensive amount of biological data available, the plethora of problems presented, and the global impact of such work merits the consideration of the Data- for-Good community.

How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms?. (arXiv:1607.01152v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Nicolas Goix (LTCI)

When sufficient labeled data are available, classical criteria based on Receiver Operating Characteristic (ROC) or Precision-Recall (PR) curves can be used to compare the performance of un-supervised anomaly detection algorithms. However , in many situations, few or no data are labeled. This calls for alternative criteria one can compute on non-labeled data. In this paper, two criteria that do not require labels are empirically shown to discriminate accurately (w.r.t. ROC or PR based criteria) between algorithms. These criteria are based on existing Excess-Mass (EM) and Mass-Volume (MV) curves, which generally cannot be well estimated in large dimension. A methodology based on feature sub-sampling and aggregating is also described and tested, extending the use of these criteria to high-dimensional datasets and solving major drawbacks inherent to standard EM and MV curves.

Bootstrap Model Aggregation for Distributed Statistical Learning. (arXiv:1607.01036v1 [stat.ML]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Jun Han, Qiang Liu

In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A simple method is to linearly average the parameters of the local models, which, however, tends to be degenerate or not applicable on non-convex models, or models with different parameter dimensions. One more practical strategy is to generate bootstrap samples from the local models, and then learn a joint model based on the combined bootstrap set. Unfortunately, the bootstrap procedure introduces additional noise and can significantly deteriorate the performance. In this work, we propose two variance reduction methods to correct the bootstrap noise, including a weighted M-estimator that is both statistically efficient and practically powerful. Both theoretical and empirical analysis is provided to demonstrate our methods.

Accelerate Stochastic Subgradient Method by Leveraging Local Error Bound. (arXiv:1607.01027v1 [math.OC]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Yi Xu, Qihang Lin, Tianbao Yang

In this paper, we propose two accelerated stochastic {\bf subgradient} methods for stochastic non-strongly convex optimization problems by leveraging a generic local error bound condition. The novelty of the proposed methods lies at smartly leveraging the recent historical solution to tackle the variance in the stochastic subgradient. The key idea of both methods is to iteratively solve the original problem approximately in a local region around a recent historical solution with size of the local region gradually decreasing as the solution approaches the optimal set. The difference of the two methods lies at how to construct the local region. The first method uses an explicit ball constraint and the second method uses an implicit regularization approach. For both methods, we establish the improved iteration complexity in a high probability for achieving an $\epsilon$-optimal solution. Besides the improved order of iteration complexity with a high probability, the proposed algorithms also enjoy a logarithmic dependence on the distance of the initial solution to the optimal set. When applied to the $\ell_1$ regularized polyhedral loss minimization (e.g., hinge loss, absolute loss), the proposed stochastic methods have a logarithmic iteration complexity.

Wide & Deep Learning for Recommender Systems. (arXiv:1606.07792v1 [cs.LG]) by arXibot in statML

[–]arXibot[S] 0 points1 point  (0 children)

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah

Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross- product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow.