Auto-generating Data Quality Checks with AI
https://www.xkcd.com/1831/
This week we are unveiling a simple tool to help you get started with CSV data quality using CsvPath. It is called CsvPath AutoGen.
AutoGen is 100% AI-driven. That comes with both benefits and drawbacks. The benefit is that you get essentially a brainstorming partner that can help you think about data quality in a focused, declarative, and scalable way for free. The downside is that even the best AI today — and we believe we used the best available — is better at offering useful suggestions than delivering clear right answers. In short, our tool is not a substitute for a person, it is just a savvy whiteboard partner.
First, let’s talk results
Before we tell how we used AI in AutoGen, first, the bonafides. AutoGen’s results aren’t perfect. We haven’t tuned them to that level, yet. But they can be surprisingly on-the-mark. A good example came from feeding a New Relic log sample into AutoGen. The sample was representative—nothing special. It looked like this:
AutoGen returned a single page containing four csvpaths. This is what we taught the assistant: use multiple csvpaths to simplify development and maintenance. The four csvpaths had 10 validation rules and one summary print statement at the bottom. It looked good so we tried running it.
Amazingly, we only had to fix two small things. One was a trivial typo. Oddly, the assistant added an ‘i’ at the end of a regular expression, outside the ‘/’ delimiters. AutoGen is quite good at regular expressions, so the typo was a surprise. Nothing easier to fix, though. The other mistake was that the assistant used a ‘>’ operator rather than the lt() function. Another easy to spot trivial slip-up.
Here’s a look at AutoGen’s nicely commented csvpath validation statements:
The result? We ran the generated csvpaths, having made the two minor fixes, and got excellent results. It was very satisfying, to be sure!
And in the interest of completeness, here is the Python snippet that ran the AutoGen csvpaths.
You might reasonably think that that is some ugly output. But as validator output goes, it is pretty reader-friendly. And CsvPath gives you several straightforward ways to make prettier, even report-quality, output. Beyond that, I hope you’re also pondering that the AI decided that printing a summary count of statements by log level at the bottom of the validation report was a good idea. We didn’t ask for that, it just happened.
The important takeaway here is that by adding structure — using CsvPath — and a broadly capable AI — Anthropic’s Claude — we were able to deliver a useful, non-trivial data quality starting point in about 30 minutes.
How did we build AutoGen?
AutoGen uses a well-known AI API from Anthropic generally known as Claude. Claude is a conversational AI. It was built by running a large corpus through a dense, broad and deep similarity scoring-based algorithm that ripples error adjustments backward, adjusting the network till the output matches the expected text. I don’t have any special insight into Anthropic’s tools and methods, but I led teams building RAG AI vertical search engine products using the same principles and similar ML algorithms at a smaller scale eight to ten years prior, so the concepts are familiar.
AutoGen has a big advantage over some other uses of generative AI: it is highly structured. You put in your CSV example. We contextualize it within a large but focused set of technical instructions. That context feeds into a conversational back-and-forth with the AI. When we’re done, the AI API returns a set of CsvPath statements and a high level explaination of the rubrics the AI used in crafting the statements.
The structured nature of the conversation allows us to minimize the scope for the AI to simply make stuff up. As you probably know, AIs are very good at that. We can explain to it how the language works, what strategies make sense, provide examples, and warn the AI about things that are not desirable. Our instructions are evolving. When we started they were the size of a novella—about 40kb. We’re learning where to cut back. Sometimes less is more.
To set up the AutoGen user’s interaction, we do a lot of product management-like iterations. First, we give the AI a persona. We tell them about their DataOps job, the type of company they work for, how they think about data quality, and how important their data governance efforts are. Then we begin a process of conversationally educating them.
We want to elicit both knowledge-use and creative-effort within strict parameters, so we take an approach that may seem odd. Our goal is for the pattern match to be like when you go to Stack Overflow and a really smart developer takes a significant amount of time out of their day helping you. To better get the AI to give that behavior, we model it for the AI. It’s the flip side of “garbage in, garbage out” — enthusiastic and supportive in, enthusiastic and supportive out.
By this time we’re several exchanges into the conversation and we haven’t yet asked for the AI to do work for you. Happily, all of this conversational back-and-forth and education was on our side. We mock up the interaction we would love to have with Claude when time permits. Not having the time to chat, we put words in the AI’s mouth. It accepts our words as its own and continues on from there. This way we make a conversational AI into a more focused Q&A machine.
When you request that AutoGen create some CsvPath statements for you we take your example CSV and any instructions you provide, wrap them in our final instructions and requirements, and send the whole ~40kb package to the API. Finally, we parse it to add links, formatting, and other modifications. And that’s it.
What you get is just grist for the mill — essentially pseudocode to get you started. Think of AutoGen as being like CoPilot for CsvPath.
The Hard Thing About AI
We have had the pleasure of working on CSV systems and AI algorithms and AI systems fed by CSV data. Our takeaway? AI is more glamorous than CSV, for sure. And it is much, much harder to achieve consistent value with AI. But there is a certain similar feeling to them. Both are messy, hard to do well, full of corner cases, and very challenging to operate at speed and scale because they exist in loosely structured environments.
Having made a tool for adding structure to the world of CSV, to create AutoGen we had to gain a similar level of structure for our AI training and evaluation data. As many people have found out, this is the unglamorous and hard side of AI. Our approach is a work in progress.
We look at the problem as having two parts:
Joining static and dynamic information sets
Creating a robust and automated fitness evaluation of results
AutoGen has two dynamic inputs. Your examples and instructions are one. The other is the output of the CsvPath library, as a tool that Claude can execute to see if its CsvPath statements are workable. Tool-use is an exciting part of the puzzle. At the cost of a more conversational (and expensive) interaction, we can allow Claude to request that we run statements through the library for them. We send the library outputs back to Claude so they can refine their statements before we present them to you. To be clear, we’re still experimenting with that approach, but so far it looks amazing.
The fitness function is equally important. If we don’t know what good looks like how can we nudge Claude towards better output? It is the same problem as the back-propagation algorithm that lets Anthropic gradually train Claude to generate realistic answers. Our challenge is simpler, but not conceptually different. To make it quicker and scalable, we built a toolkit for organizing and assembling prompts and contextual data, and numerically assessing the resulting output. Using a raft of numeric indicators, and our rubrics for good CsvPath, we can over time nudge Claude towards a better and better outcome. It’s a WIP—and it’s looking more and more awesome.
AutoGen is waiting for you here
Remember that you are working with an AI that is eager to please but doesn’t have all the answers. AutoGen’s output should give you a good starting point. You will still need to learn CsvPath and apply it in your own automation systems, but you’ll have a much faster start. And if you need more ideas, help getting your first paths working, or want to brainstorm on the bigger picture, reach out. We love answering questions and helping people get started.