← all posts

PuppyGraph: Graph Analytics Without the Overhead

Most data engineers I talk to have a rough mental model of graph databases: Neo4j, property graphs, Cypher queries. You store nodes and edges, you traverse them, you find relationships that SQL joins struggle with.

That model is fine until you ask: where does the data actually live?

With traditional graph databases, the answer is “inside the graph database.” You ingest data from your warehouse, transform it, and store it in a specialized format. Now you have two copies of your data, a sync problem, and a whole new operational surface to manage.

PuppyGraph takes a different approach. It does not store your data at all.

What PuppyGraph Actually Is

PuppyGraph is a graph query engine that sits on top of existing data sources. You point it at a Postgres database, a DuckDB file, a Parquet table on S3, and it lets you run graph traversals against those tables without moving the data anywhere.

You define a schema that says: “these rows in this table are vertices, these rows in that table are edges.” PuppyGraph handles the traversal logic and translates it into SQL queries against your existing stores.

The query language is Gremlin or openCypher, both standard graph query languages. Your data stays where it is.

Why This Matters for Data Engineers

If you have ever tried to operationalize graph analytics on warehouse data, you know the pain. The usual path is:

  1. Export your tables to CSV
  2. Load them into Neo4j or a similar graph DB
  3. Keep them in sync as the source data changes
  4. Manage two query surfaces for your users

Step 3 is where most projects die. The sync job is always slightly behind, always slightly broken, and the moment you need to debug something you are not sure which copy of the data you are actually looking at.

PuppyGraph removes steps 1 through 3. The graph layer is just a query interface. The data stays in your warehouse.

What It Looks Like in Practice

You write a schema JSON that describes how your tables map to a graph. For a simple product dataset it looks something like this:

{
  "vertices": [
    {
      "label": "Product",
      "tableSource": {
        "schema": "public",
        "table": "products"
      },
      "id": {
        "fields": ["id"]
      }
    }
  ],
  "edges": [
    {
      "label": "CONTAINS_INGREDIENT",
      "fromVertex": "Product",
      "toVertex": "Ingredient",
      "tableSource": {
        "schema": "public",
        "table": "product_ingredients"
      },
      "fromId": { "fields": ["product_id"] },
      "toId": { "fields": ["ingredient_id"] }
    }
  ]
}

Once the schema is loaded, you connect a Gremlin client and run traversals. PuppyGraph compiles them to SQL and runs them against your data source.

Where It Gets Interesting

The real use cases are queries that are technically possible in SQL but painful to write and even more painful to optimize.

Finding all products that share at least three ingredients with a given product. Detecting circular supplier dependencies in a supply chain. Tracing pollution propagation through a sensor network based on wind direction and proximity.

These are multi-hop traversal problems. In SQL you are writing self-joins or recursive CTEs that database optimizers were not built for. In Gremlin you express the traversal directly and PuppyGraph figures out the execution plan.

I have been building connectors that extend PuppyGraph to work with data sources it does not natively support, starting with LanceDB. That is a longer story for the next post.