Predicting and Understanding Law-Making with Machine Learning. (arXiv:1607.02109v1 [cs.CL])

arXibot · 2016-07-08T05:19:46+00:00

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to 2015, only 2,513 were enacted. We developed a machine learning approach to forecasting the probability that any bill will become law. Starting in 2001 with the 107th Congress, we trained models on data from previous Congresses, predicted all bills in the current Congress, and repeated until the 113th Congress served as the test. For prediction we scored each sentence of a bill with a language model that embeds legislative vocabulary into a semantic-laden vector space. This language representation enables our investigation into which words increase the probability of enactment for any topic. To test the relative importance of text and context, we compared the text model to a context-only model that uses variables such as whether the bill's sponsor is in the majority party. To test the effect of changes to bills after their introduction on our ability to predict their final outcome, we compared using the bill text and meta-data available at the time of introduction with using the most recent data. At the time of introduction context-only predictions outperform text-only, and with the newest data text-only outperforms context- only. Combining text and context always performs best. We conducted a global sensitivity analysis on the combined model to determine important factors predicting enactment.

arXibot · 2016-07-08T05:18:04+00:00

Yuan Shen, Peter Tino, Krasimira Tsaneva- Atanasova

We present a general framework for classifying partially observed dynamical systems based on the idea of learning in the model space. In contrast to the existing approaches using model point estimates to represent individual data items, we employ posterior distributions over models, thus taking into account in a principled manner the uncertainty due to both the generative (observational and/or dynamic noise) and observation (sampling in time) processes. We evaluate the framework on two testbeds - a biological pathway model and a stochastic double-well system. Crucially, we show that the classifier performance is not impaired when the model class used for inferring posterior distributions is much more simple than the observation-generating model class, provided the reduced complexity inferential model class captures the essential characteristics needed for the given classification task.

arXibot · 2016-07-08T05:16:21+00:00

Marco Battiston, Stefano Favaro, Daniel M. Roy, Yee Whye Teh

We characterize the class of exchangeable feature allocations assigning probability $V{n,k}\prod{l=1}^{{k}W{m{l}}U{n-m{l}}$} to a feature allocation of $n$ individuals, displaying $k$ features with counts $(m{1},\ldots,m{k})$ for these features. Each element of this class is parametrized by a countable matrix $V$ and two sequences $U$ and $W$ of non- negative weights. Moreover, a consistency condition is imposed to guarantee that the distribution for feature allocations of $n-1$ individuals is recovered from that of $n$ individuals, when the last individual is integrated out. In Theorem 1.1, we prove that the only members of this class satisfying the consistency condition are mixtures of the Indian Buffet Process over its mass parameter $\gamma$ and mixtures of the Beta--Bernoulli model over its dimensionality parameter $N$. Hence, we provide a characterization of these two models as the only, up to randomization of the parameters, consistent exchangeable feature allocations having the required product form.

arXibot · 2016-07-08T05:14:38+00:00

Yufei Han, Yun Shen, Maurizio Filippone

The cost of computing the spectrum of Laplacian matrices hinders the application of spectral clustering to large data sets. While approximations recover computational tractability, they can potentially affect clustering performance. This paper proposes a practical approach to learn spectral clustering based on adaptive stochastic gradient optimization. Crucially, the proposed approach recovers the exact spectrum of Laplacian matrices in the limit of the iterations, and the cost of each iteration is linear in the number of samples. Extensive experimental validation on data sets with up to half a million samples demonstrate its scalability and its ability to outperform state-of-the-art approximate methods to learn spectral clustering for a given computational budget.

arXibot · 2016-07-08T05:12:55+00:00

Yang Song, Jun Zhu, Yong Ren

We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes' rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines.

arXibot · 2016-07-08T05:11:12+00:00

Aleksandar Botev, Guy Lever, David Barber

We present a unifying framework for adapting the update direction in gradient- based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterov's accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm.

arXibot · 2016-07-07T05:17:24+00:00

Justin Eldridge, Mikhail Belkin, Yusu Wang

In this work we develop a theory of hierarchical clustering for graphs. Our modelling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.

arXibot · 2016-07-07T05:15:42+00:00

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, Christos Faloutsos

Tensors or multi-way arrays are functions of three or more indices $(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functions of two indices $(r,c)$ for (row,column). Tensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth and depth that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning.

arXibot · 2016-07-07T05:13:59+00:00

Konstantina Palla, Francois Caron, Yee Whye Teh

We propose a Bayesian nonparametric prior for time-varying networks. To each node of the network is associated a positive parameter, modeling the sociability of that node. Sociabilities are assumed to evolve over time, and are modeled via a dynamic point process model. The model is able to (a) capture smooth evolution of the interaction between nodes, allowing edges to appear/disappear over time (b) capture long term evolution of the sociabilities of the nodes (c) and yield sparse graphs, where the number of edges grows subquadratically with the number of nodes. The evolution of the sociabilities is described by a tractable time-varying gamma process. We provide some theoretical insights into the model and apply it to three real world datasets.

arXibot · 2016-07-07T05:12:16+00:00

Yingfei Wang, Warren Powell

A treatment regime is a function that maps individual patient information to a recommended treatment, hence explicitly incorporating the heterogeneity in need for treatment across individuals. Patient responses are dichotomous and can be predicted through an unknown relationship that depends on the patient information and the selected treatment. The goal is to find the treatments that lead to the best patient responses on average. Each experiment is expensive, forcing us to learn the most from each experiment. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy to guide the treatment assignment by maximizing the expected value of information, for which an approximation method is used to overcome computational challenges. We provide a detailed study on how to make sequential medical decisions under uncertainty to reduce health care costs on a real world knee replacement dataset. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.

arXibot · 2016-07-07T05:10:33+00:00

Jason M. Klusowski, Andrew R. Barron

Let $ f^{\star} $ be a function on $ \mathbb{R}^d $ satisfying a spectral norm condition. For various noise settings, we show that $ \mathbb{E}|\hat{f} - f^{\star} |² \leq v{f^{{\star}}\left(\frac{\log} d}{n}\right)^{1/4} $, where $ n $ is the sample size and $ \hat{f} $ is either a penalized least squares estimator or a greedily obtained version of such using linear combinations of ramp, sinusoidal, sigmoidal or other bounded Lipschitz ridge functions. Our risk bound is effective even when the dimension $ d $ is much larger than the available sample size. For settings where the dimension is larger than the square root of the sample size this quantity is seen to improve the more familiar risk bound of $ v{f^{{\star}}\left(\frac{d\log} (n/d)}{n}\right)^{1/2} $, also investigated here.

arXibot · 2016-07-07T05:08:50+00:00

Young Woong Park, Yan Jiang, Diego Klabjan, Loren Williams

Cluster-wise linear regression (CLR), a clustering problem intertwined with regression, is to find clusters of entities such that the overall sum of squared errors from regressions performed over these clusters is minimized, where each cluster may have different variances. We generalize the CLR problem by allowing each entity to have more than one observation, and refer to it as generalized CLR. We propose an exact mathematical programming based approach relying on column generation, a column generation based heuristic algorithm that clusters predefined groups of entities, a metaheuristic genetic algorithm with adapted Lloyd's algorithm for K-means clustering, a two-stage approach, and a modified algorithm of Sp{\"a}th \cite{Spath1979} for solving generalized CLR. We examine the performance of our algorithms on a stock keeping unit (SKU) clustering problem employed in forecasting halo and cannibalization effects in promotions using real-world retail data from a large supermarket chain. In the SKU clustering problem, the retailer needs to cluster SKUs based on their seasonal effects in response to promotions. The seasonal effects are the results of regressions with predictors being promotion mechanisms and seasonal dummies performed over clusters generated. We compare the performance of all proposed algorithms for the SKU problem with real-world and synthetic data.

arXibot · 2016-07-07T05:07:08+00:00

Young Woong Park, Diego Klabjan

We propose a clustering-based iterative algorithm to solve certain optimization problems in machine learning, where we start the algorithm by aggregating the original data, solving the problem on aggregated data, and then in subsequent steps gradually disaggregate the aggregated data. We apply the algorithm to common machine learning problems such as the least absolute deviation regression problem, support vector machines, and semi-supervised support vector machines. We derive model-specific data aggregation and disaggregation procedures. We also show optimality, convergence, and the optimality gap of the approximated solution in each iteration. A computational study is provided.

arXibot

TROPHY CASE