Connecting LanceDB to PuppyGraph: Building the Proxy That Should Not Exist

PuppyGraph connects to data sources over JDBC. LanceDB does not speak JDBC. These two facts should have ended the project immediately.

They did not.

The Problem Worth Solving

PuppyGraph is a graph analytics engine that treats any SQL-accessible data source as a graph. You define which rows are vertices and which rows are edges, and it handles multi-hop traversals without you moving your data anywhere.

LanceDB is a vector database built on Apache Arrow. It stores embeddings alongside structured columns and supports approximate nearest neighbor search natively.

The use case is straightforward: you want to traverse a product graph and run vector similarity search in the same query. Find products similar to this one, then walk two hops through the ingredient graph to find related items. Neither system alone does this well. Together they could.

The challenge is the connection layer. PuppyGraph speaks Postgres wire protocol through JDBC. LanceDB has a Python API and an Arrow interface. There is no overlap.

buenavista as the Bridge

buenavista is a Python library that wraps a query backend in a Postgres-wire-compatible server. You give it something that can run SQL, and it exposes that as a Postgres endpoint any client can connect to.

DuckDB has a LanceDB extension that lets you register LanceDB tables and query them with SQL. So the plan was:

PuppyGraph (JDBC) --> buenavista (Postgres wire) --> DuckDB --> LanceDB

In theory this is clean. In practice, PuppyGraph’s JDBC driver runs schema discovery queries on startup that broke at step three.

The Bug

When PuppyGraph connects over JDBC, the first thing its driver does is call getColumns to discover the table schema. That call translates to a SQL query that does a LEFT JOIN against pg_type, the Postgres system catalog table that maps type OIDs to type names.

DuckDB has partial Postgres catalog compatibility but pg_type is not fully implemented. The LEFT JOIN against it returned nothing, PuppyGraph got an empty schema, and the connection failed silently.

The fix was to intercept that specific query pattern before it hit DuckDB and rewrite it to something DuckDB could handle:

import re

class PatchedDuckDBSession(DuckDBSession):
    PG_TYPE_PATTERN = re.compile(
        r'LEFT\s+JOIN\s+pg_catalog\.pg_type', re.IGNORECASE
    )

    def execute_sql(self, sql: str, parameters=None):
        if self.PG_TYPE_PATTERN.search(sql):
            sql = self._rewrite_getcolumns(sql)
        return super().execute_sql(sql, parameters)

    def _rewrite_getcolumns(self, sql: str) -> str:
        # Replace pg_type join with a literal type mapping subquery
        # DuckDB has the column info, we just need to satisfy the join
        return sql.replace(
            "LEFT JOIN pg_catalog.pg_type",
            "LEFT JOIN (SELECT 0 AS oid, '' AS typname) AS pg_type"
        )

This is not elegant. It is a surgical patch that makes one specific query pattern return something instead of nothing. But it worked.

The Setup

Once the connection issue was resolved, the full setup looks like this:

import duckdb
import lancedb
from buenavista.backends.duckdb import DuckDBSession
from buenavista.server import BuenaVistaServer

# Create LanceDB connection and load a table
db = lancedb.connect("./beauty_data")
table = db.open_table("products")

# Create DuckDB connection
conn = duckdb.connect()
conn.execute("ATTACH ':memory:' AS beauty")
conn.execute("USE beauty")

# Register LanceDB table as a DuckDB view
conn.register("products", table.to_arrow())

# Start buenavista Postgres server
server = BuenaVistaServer(
    host="0.0.0.0",
    port=5433,
    session_class=PatchedDuckDBSession,
    connection=conn
)
server.serve_forever()

PuppyGraph then connects to localhost:5433 as if it were Postgres. The schema JSON maps products to a vertex label and the relationships table to an edge label.

The Demo

The dataset is OpenBeautyFacts, which has products, ingredients, and brand relationships. After loading it into LanceDB and registering it in DuckDB, you can run a Gremlin traversal like:

g.V().hasLabel('Product')
  .has('category', 'moisturizer')
  .out('CONTAINS_INGREDIENT')
  .in('CONTAINS_INGREDIENT')
  .hasLabel('Product')
  .dedup()
  .limit(10)
  .values('name')

This finds all moisturizers that share an ingredient with any other moisturizer. On a raw SQL join this is a self-join with a dedup step. In Gremlin it reads like what it actually is: a two-hop traversal.

The vector search side adds a similarity query before the traversal: find the ten products most similar to a given embedding, then walk the ingredient graph from those results.

What This Is Not

The connection works. The traversals work. But there are rough edges.

The pg_type rewrite is fragile. A different version of PuppyGraph’s JDBC driver could generate a different query pattern and the patch would not cover it. The right fix is a proper Postgres catalog implementation in DuckDB, which is a much larger project.

Performance is also limited by the Arrow conversion step. Every time PuppyGraph pulls data from the LanceDB table, it goes through to_arrow() and then back through DuckDB’s query engine. For large tables this adds meaningful latency.

Both of these are known issues. The goal was a working proof of concept that demonstrates the integration is possible at all, and that goal is met.

The full code is on GitHub.