PyGWalker: Tableau-like exploration inside Jupyter

March 16, 2024

|repo-review

by Florian Narr

PyGWalker: Tableau-like exploration inside Jupyter

PyGWalker embeds Graphic Walker — an open-source Tableau alternative — directly into Jupyter notebooks. Pass it a DataFrame, get an interactive drag-and-drop visualization UI inline in the cell output. No server to spin up, no exports to a separate tool.

Why I starred it

The pitch is simple, but the implementation is interesting. Most notebook visualization tools are wrappers around matplotlib or plotly — you call a function, it renders a static chart. PyGWalker does something different: it ships an entire React-based BI frontend as a bundled JavaScript blob, injects it into the notebook output cell, and then maintains a two-way communication channel between the frontend and the Python kernel so that larger datasets can be queried server-side via DuckDB as you interact with the UI.

The data size detection that automatically decides whether to ship the full dataset to the browser or fall back to kernel-computed queries is the part worth looking at.

How it works

The entry point is pyg.walk(df), which constructs a PygWalker instance in pygwalker/api/pygwalker.py. The constructor checks self.data_parser.data_size against JUPYTER_BYTE_LIMIT and sets kernel_computation accordingly if you haven't specified it explicitly:

# pygwalker/api/pygwalker.py
suggest_kernel_computation = self.data_parser.data_size > JUPYTER_BYTE_LIMIT
self.kernel_computation = suggest_kernel_computation if kernel_computation is None else kernel_computation
self.origin_data_source = self.data_parser.to_records(500 if self.kernel_computation else None)

When kernel_computation=True, only 500 sample rows go to the browser for initial render. Subsequent queries from the drag-and-drop UI get translated from a JSON payload into SQL and executed via DuckDB in the kernel.

The payload-to-SQL translation is delegated to gw_dsl_parser, a separate package:

# pygwalker/utils/payload_to_sql.py
def get_sql_from_payload(table_name, payload, field_meta=None):
    from gw_dsl_parser import get_sql_from_payload as __get_sql_from_payload
    sql = __get_sql_from_payload(table_name, payload, field_meta)
    return sql

That's a thin shim. The real DSL parsing lives in gw_dsl_parser, which is published separately and not in this repo — so debugging the query translation requires going one layer deeper into a different package.

The frontend itself is bundled as pygwalker/templates/dist/pygwalker-app.iife.js and gets zlib-compressed + base64-encoded at import time in pygwalker/services/render.py:

with open(os.path.join(ROOT_DIR, 'templates', 'dist', 'pygwalker-app.iife.js'), 'r') as f:
    GWALKER_SCRIPT = f.read()
    GWALKER_SCRIPT_BASE64 = compress_data(GWALKER_SCRIPT)

The compressed script is injected into a Jinja2 template on every render. That bundle is large — this is a full React + Vega-Lite application, not a widget.

The bidirectional communication between Python kernel and the browser frontend goes through HackerCommunication in pygwalker/communications/hacker_comm.py. The name isn't marketing — it's genuinely a hack: it uses hidden ipywidgets.Text widgets as a message bus, serializing JSON into .value and using .observe() to trigger the receive handler. A Lock and a time.sleep(0.1) prevent concurrent message collisions. The comment in the source is honest about it:

def send_msg_async(self, action: str, data: Dict[str, Any], rid: Optional[str] = None):
    """
    To transmit messages through a widget,
    there will be problems during concurrency,
    because the timing of front-end rendering is not sure,
    so a sleep is temporarily added to solve it violently
    """

Five hidden Text widgets handle inbound messages (one per slot, indexed 0–4), one handles outbound. It works, and it sidesteps the need for a WebSocket server, but it's not something you'd call elegant.

The data parsers in pygwalker/data_parsers/ cover pandas, polars, modin, Spark, and a database connector. The pandas parser infers field types by checking dtype.kind'fcmiu' maps to quantitative, 'M' and string-that-looks-like-a-date maps to temporal, everything else falls to nominal. There's a is_geo_field() check that catches column names like latitude, longitude, lat, lng and forces them to dimension/quantitative.

Using it

Basic usage:

import pandas as pd
import pygwalker as pyg

df = pd.read_csv("data.csv")
walker = pyg.walk(df)

For larger datasets, enable DuckDB-backed kernel computation and persist the chart state:

walker = pyg.walk(
    df,
    spec="./chart_meta.json",
    kernel_computation=True,
)

The spec file saves your chart configuration. You save manually via the UI — there's no auto-save yet, though it's flagged in the README.

For Streamlit:

from pygwalker.api.streamlit import StreamlitRenderer
import streamlit as st

@st.cache_resource
def get_renderer():
    df = pd.read_csv("data.csv")
    return StreamlitRenderer(df, spec="./gw_config.json", spec_io_mode="rw")

renderer = get_renderer()
renderer.explorer()

The @st.cache_resource decorator matters here — without it you'll re-instantiate the renderer on every rerender, which is expensive.

You can also export charts programmatically after saving in the UI:

walker.save_chart_to_file("Chart 1", "chart1.svg", save_type="svg")
png_bytes = walker.export_chart_png("Chart 1")

Rough edges

The gw_dsl_parser dependency being a separate opaque package is the biggest friction point. If a query translates incorrectly, you can't read the source in this repo to understand why — you have to go digging in a different package. The payload_to_sql.py shim that wraps it doesn't help with debugging.

The HackerCommunication sleep-based message locking is a known limitation. It works in practice but means rapid UI interactions can queue up messages in ways that feel sluggish on slower machines.

The spec file for chart persistence requires a manual save click in the UI. Sessions that close before you save lose chart state.

No test suite worth speaking of — ci: add pytest timeout and hang diagnostics was one of the most recent commits, suggesting the test infrastructure is still being sorted out. Conda users need pip for gw_dsl_parser, which the README documents but is still a rough edge.

The bundle size is real. Shipping a compressed React + Vega-Lite application into a notebook cell output has weight. On large notebooks with many cells, this accumulates.

Bottom line

If you're doing exploratory data analysis in Jupyter and want to skip writing matplotlib code to answer "what does this distribution look like by category," PyGWalker is a fast shortcut. The DuckDB kernel mode makes it genuinely useful for datasets that don't fit in browser memory. It's not a replacement for a real BI tool when you need to share dashboards — but for in-notebook EDA, the tradeoff of a heavy JS bundle for an interactive drag-and-drop interface is often worth it.

Kanaries/pygwalker on GitHub
Kanaries/pygwalker