Kevin Hu Dissertation Defense

Automating Data Visualization through Recommendation

Demand for data visualization has exploded in recent years with the increasing availability and use of data across domains. Traditional visualization techniques require users to manually specify visual encodings of data through code or clicks. While manual specification is necessary to create bespoke visualizations, it renders visualization inaccessible to those without technical backgrounds. As a result, visualization recommender systems, which automatically generate results for users to search and select, have gained popularity. Here, I present systems, methods, and data repositories to contextualize and improve visualization recommender systems.

The first contribution is DIVE, a publicly available and open source system that combines rule-based recommender systems with manual specification. DIVE integrates state-of-the-art data model inference, visualization, statistical analysis, and storytelling capabilities into a unified workflow. In a controlled experiment, we show that DIVE significantly improves task performance among a group of 67 professional data scientists. Over 15K users have uploaded 7.5K datasets to DIVE since its release.

In response to the limitations of rule-based recommender systems, VizML is a machine learning-based method for visualization recommendation. VizML uses neural networks trained on a large corpus of dataset–visualization pairs to predict visualization design choices, such as visualization type and axis encoding, with an accuracy of over 85%, exceeding that of base rates and baseline models. Benchmarking with a crowdsourced test set, we show that our model achieves human-level performance when predicting consensus visualization type.

To support learned visualization systems, VizNet is a large-scale visualization learning and benchmarking repository consisting of over 31M real-world datasets. To demonstrate VizNet's utility as a platform for conducting crowdsourced experiments with ecologically valid data, we replicate a prior perceptual effectiveness study, and demonstrate how a metric of visualization effectiveness can be learned from experimental results. Our results suggest a promising method for efficiently crowdsourcing the annotations necessary to train and evaluate machine learning-based visualization recommendation at scale.

Enabled by the availability of real-world data, Sherlock is a deep learning approach to semantic type detection. We train Sherlock on 686K data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. A multi-input neural network achieves a support-weighted F1 score of 0.89, exceeding that of a decision tree baseline, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

I conclude by discussing three opportunities for future research. The first describes design considerations for mixed-initiative interactions in AI-infused visualization systems such as DIVE. The second reviews recent work on statistical validity of insights derived from visualization recommenders, which is an especially important consideration with learned systems such as VizML. Lastly, I assess the benefits of learning visualization design from non-experts then present experimental evidence towards measuring the gaps between expert and non-expert judgment.

Committee members:

César Hidalgo, Associate Professor, MIT Program in Media Arts and Sciences

Tim Kraska, Associate Professor, MIT CSAIL

Arvind Satyanarayan, Assistant Professor, MIT CSAIL

More Events