رتبها

Victor Dibia

12 تغريدة 36 قراءة May 04, 2023

🧵 Excited to share some findings from building LIDA - a tool for automatic data exploration, visualization and infographics!
We are only scratching the surface of how LLMs (#chatgpt #gpt4) can revolutionize data visualization.
microsoft.github.io #GenerativeAI

microsoft.github.io/lida/

LIDA [Beta] | LIDA: Automated Visualizations with LLMs

LIDA is a tool to automatically generate visualizations and infographics from data using large langu...

2\n How it works
LIDA casts visualization/infographics generation as a multi-stage code generation problem using LLMs. Accomplishes this via a summarizer, goal explorer, vizgenerator and infographics generator modules.

Systems that support users in the automatic creation of visualizations must address several subtasks - understand the semantics of data, enumerate relevant visualization goals and generate visualization specifications.
In this work, we pose visualization/infographic generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks. We present LIDA, a novel tool for generating grammar-agnostic visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER that converts data into a rich but compact natural language summary, a GOAL EXPLORER that enumerates visualization goals given the data, a VISGENERATOR that generates, refines, evaluates, repairs, executes and filters visualization code, and an INFOGRAPHER module that yields data-faithful stylized graphics using IGMs.

3\n Step 1: Data Summarization
The LLM needs a compact but rich representation of the data as context.
We use rules (col types, properties) + LLM enrichment (col descriptions, semantic type).
Impact: ~7% reduction in visualization error rate.
microsoft.github.io

A system that generates visualizations from data should have some “familiarity” with the data. However, we cannot offer the entirety of the data to the model (context limitations). Rather, we need a compact but information dense representation of the data that the model can use as “grounding context” in addressing visualization tasks. The SUMMARIZER module in LIDA achieves this in two stages. First, we construct a dictionary with properties (i) Extracted atomic types (e.g., integer, string, boolean) based on the pandas library (ii) General data field properties (e.g., # of unique samples, max and min, range etc.) and an illustrative non-null list of n samples from each column. This summary is then optionally enriched by an LLM or a user via the LIDA ui to include - semantic description of the dataset, field descriptions and predicting semantic type for each field.

microsoft.github.io/lida/gallery/?…

LIDA [Beta] | Gallery

LIDA is a tool to automatically generate visualizations and infographics from data using large langu...

4\n Step 2: Goal Exploration
With the right data summary, and prompt, LLMs can work really well in generating data-grounded questions, with rationale.
EDA for “free”.

This module generates data exploration goals, given the data summary. In our implementation, we express this as a multitask generation problem for an LLM to solve. For each goal, the LLM must generate a question (hypothesis), a visualization that addresses the question and rationale. Requiring the model to produce rationale tends to lead to more meaningful goals.

[Prompt Excerpt: You are a skilled data analyst. Given the data summary provided, generate a set of n goals that fit the data. … Your response must use the following format]

5\n Step 3: Grammar Agnostic Automated visualization
LLMs are quite adept at writing code and can be tasked to generate visualizations in any language/grammar as long as it can be represented as code. R, Python, C? GGPlot, Seaborn, Matplotlib? All possible.

This image was generated by an LLM directly from data. The LLM generated the question, and code for generating an associated visualization

6\n Step 4: Infographic Generation
Takes raster images provided by the LIDA pipeline and generates stylized "data-faithful" infographics. Many applications in personalization, data story generation.

This module is tasked with generating stylized graphics based on output from the VIZGENERATOR module. It implements a library of visual styles described in natural language that are applied directly to visualization images. Note that the style library is editable by the user. These styles are applied in generating infographics using the text-conditioned image-to-image generation capabilities of diffusion models [5] implemented using the Peacasso library api. An optional post processing step is then applied to improve the resulting image (e.g., replace axis with correct values from visualization, removing grid lines, and sharpening edges).

7\n VizOps - Visualization Explanation and Accessibility
When we represent visualizations as code, we can apply many operations on this representation including - natural language based refinement, explanation (accessibility descriptions), self-evaluation.

8\n VizOps - Self-Evaluation and Automatic Repair
Do LLMs encode visualization best practices? Are they calibrated to self evaluate across multiple visualization quality dimensions? GPT-4 shows very compelling results! Best of all, we can self-evaluate and self repair.

9\n Evaluation ..
Wait up .. how do we evaluate LIDA? We are currently using two metrics - Visualization error rates (VER) and self-evaluated visualization quality (SEVQ) metric (via GPT-4).
VER has been critical in informing prompt/scaffold design.

The chart shows error rates for 2280 charts created by LIDA using datasets from the vega datasets repository. It shows 4 conditions i) no_summary : only the dataset filename is provided to the model. Worst performance ii) schema: only the list of field names and file name are provided to the model iii) no_enrich: a data summary is provided to the model but without enrichment such as data description, field descriptions or semantic types iv) enrich: a summary is provided with LLM-based enrichment.

10\n Design Reflections
LIDA aims to be reliable (always provide a valid visualization), accurate (always provide a high quality visualization), and fast (as few LLM calls as possible).
While this is constantly being improved, the scaffolds and prompt engineering is critical

$10\n Design Reflections LIDA aims to be reliable (always provide a valid visualization), accurate...$

10\n Learn more in the paper.
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
arxiv.org

arxiv.org/abs/2303.02927

LIDA: A Tool for Automatic Generation of Grammar-Agnostic...

Systems that support users in the automatic creation of visualizations must address several subtasks...

11\n Gallery
A gallery of example visualization goals and visualizations created with LIDA microsoft.github.io

microsoft.github.io/lida/gallery/

LIDA [Beta] | Gallery

LIDA is a tool to automatically generate visualizations and infographics from data using large langu...

جاري تحميل الاقتراحات...

LIDA [Beta] | LIDA: Automated Visualizations with LLMs

LIDA [Beta] | Gallery

LIDA: A Tool for Automatic Generation of Grammar-Agnostic...

LIDA [Beta] | Gallery

التصنيفات

المزيد من هذا الكاتب

مواضيع ذات صلة

الأكثر اعجابا

LIDA [Beta] | LIDA: Automated Visualizations with LLMs

LIDA [Beta] | Gallery

LIDA: A Tool for Automatic Generation of Grammar-Agnostic...

LIDA [Beta] | Gallery

التصنيفات

المزيد من هذا الكاتب

مواضيع ذات صلة

الأكثر اعجابا

فك الثريد