Examining the impact of data quality and completeness of electronic health records on predictions of patients risks of cardiovascular disease. (arXiv:1911.08504v1 [stat.AP])

The objective is to assess the extent of variation of data quality and completeness of electronic health records and impact on the robustness of risk predictions of incident cardiovascular disease (CVD) using a risk prediction tool that is based on routinely collected data (QRISK3). The study design is a longitudinal cohort study with a setting…

Gromov-Wasserstein Factorization Models for Graph Clustering. (arXiv:1911.08530v1 [cs.LG])

We propose a new nonlinear factorization model for graphs that are with topological structures, and optionally, node attributes. This model is based on a pseudometric called Gromov-Wasserstein (GW) discrepancy, which compares graphs in a relational way. It estimates observed graphs as GW barycenters constructed by a set of atoms with different weights. By minimizing the…

A Framework for Challenge Design: Insight and Deployment Challenges to Address Medical Image Analysis Problems. (arXiv:1911.08531v1 [stat.AP])

In this paper we aim to refine the concept of grand challenges in medical image analysis, based on statistical principles from quantitative and qualitative experimental research. We identify two types of challenges based on their generalization objective: 1) a deployment challenge and 2) an insight challenge. A deployment challenge’s generalization objective is to find algorithms…

Robust Learning of Discrete Distributions from Batches. (arXiv:1911.08532v1 [cs.LG])

Let $d$ be the lowest $L_1$ distance to which a $k$-symbol distribution $p$ can be estimated from $m$ batches of $n$ samples each, when up to $\beta m$ batches may be adversarial. For $\beta<1/2$, Qiao and Valiant (2017) showed that $d=\Omega(\beta/\sqrt{n})$ and requires $m=\Omega(k/\beta^2)$ batches. For $\beta<1/900$, they provided a $d$ and $m$ order-optimal algorithm…

Heterogeneous Deep Graph Infomax. (arXiv:1911.08538v1 [cs.LG])

Graph representation learning is to learn universal node representations that preserve both node attributes and structural information. The derived node representations can be used to serve various downstream tasks, such as node classification and node clustering. When a graph is heterogeneous, the problem becomes more challenging than the homogeneous graph node learning problem. Inspired by…

Prediction Focused Topic Models for Electronic Health Records. (arXiv:1911.08551v1 [cs.LG])

Electronic Health Record (EHR) data can be represented as discrete counts over a high dimensional set of possible procedures, diagnoses, and medications. Supervised topic models present an attractive option for incorporating EHR data as features into a prediction problem: given a patient’s record, we estimate a set of latent factors that are predictive of the…

Towards Reducing Bias in Gender Classification. (arXiv:1911.08556v1 [cs.LG])

Societal bias towards certain communities is a big problem that affects a lot of machine learning systems. This work aims at addressing the racial bias present in many modern gender recognition systems. We learn race invariant representations of human faces with an adversarially trained autoencoder model. We show that such representations help us achieve less…

Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation. (arXiv:1911.08567v1 [cs.LG])

An automated metric to evaluate dialogue quality is vital for optimizing data driven dialogue management. The common approach of relying on explicit user feedback during a conversation is intrusive and sparse. Current models to estimate user satisfaction use limited feature sets and employ annotation schemes with limited generalizability to conversations spanning multiple domains. To address…

Representation Learning with Multisets. (arXiv:1911.08577v1 [cs.LG])

We study the problem of learning permutation invariant representations that can capture “flexible” notions of containment. We formalize this problem via a measure theoretic definition of multisets, and obtain a theoretically-motivated learning model. We propose training this model on a novel task: predicting the size of the symmetric difference (or intersection) between pairs of multisets.…