Jit- announcement icon

Announcing Jit’s AI Agents: Human-directed automation for your most time-consuming AppSec tasks.

Read the blog

In this article

Inside CyberArk’s Journey: What It Really Takes to Run RAG Agents in Production

a man wearing a white shirt and black tie
By Michael Balber

Published May 25, 2025.

Inside CyberArk’s Journey: What It Really Takes to Run RAG Agents in Production

As part of the DevSecNext AI series, Jit hosted Michael Balber—Principal Software Architect at CyberArk—for an in-depth session on how his team built and evaluated real Retrieval-Augmented Generation (RAG) agents in production. Unlike abstract discussions about LLMs and assistants, Michael shared a grounded view of what it takes to deploy agents that don’t just talk—they act. From answering support questions to triggering privileged actions and auditing behavior across enterprise systems, this was a behind-the-scenes look at how to go from toy demos to trusted AI systems inside real security products.

Watch the video here:



In this guest post, Michael shares the journey from prototype to production—through architecture, evaluation, and iteration.

As part of our AI journey at CyberArk, I set out to build production-grade agents that could do more than just chat. We weren’t looking for another demo—we needed assistants that could reason over sensitive enterprise data, act through APIs, and be evaluated just like any other feature in our platform. In this post, I’ll walk through how we built and productionized a Retrieval-Augmented Generation (RAG) agent from the ground up, how we handled evaluation at every stage, and why we ultimately chose a toolkit approach over black-box copilots. If you’re building serious AI functionality into your product, this is the story of how we made it real—and made sure it worked.

From Recordings to Real-Time Insights

At CyberArk, we focus on securing identities and protecting access to sensitive systems—especially those with elevated privileges. As part of that, we capture detailed recordings of remote admin sessions for audit and compliance purposes. 

But these recordings, while valuable, have historically been difficult to analyze at scale. One of the most high-impact use cases we uncovered came from that very challenge: how to make sense of millions of hours of privileged session recordings. These screen captures documented every keystroke and click from remote admin sessions—crucial for forensic analysis, but practically impossible to review manually. With the rise of multimodal LLMs, we realized we could change that. 

By feeding session frames into a model, we were able to extract structured actions—like “User entered AWS Console” or “Copied secret from Secret Manager”—transforming these passive recordings into searchable, summarized audit logs. Suddenly, we had a way to detect risky behavior in real time and surface it to internal security teams without hours of manual review.

With the introduction of multimodal LLMs, we started experimenting with feeding session frames into a model to extract structured activity. Instead of just playing back video, we could suddenly generate statements like: “User entered AWS Console” or “Copied secret from Secret Manager.” That completely changed the game.

This gave us the ability to generate real-time audit summaries. Internal security teams no longer had to scrub through hours of footage—they could get a short, accurate summary of what actually happened. These summaries didn’t just help with post-incident reviews; they directly improved our ability to detect risky behavior and respond faster.

Assistants That Do More Than Answer Questions

Of course, we didn’t stop at summarization. We wanted agents that could take action. So we started building an LLM-based assistant embedded directly in the product. It could answer knowledge questions like “Why did this fail?” but also invoke real APIs to create users, delete records, or trigger escalations.

To build something reliable, we had to really understand how LLMs work. These models don’t have magical knowledge—they just predict likely sequences of text. They’re extremely good at pattern completion, but only as good as the inputs and context you give them. That’s why designing a robust retrieval layer was critical to everything we built.

Vectorization, Chunking, and Retrieval: The RAG Core

Our assistant was based on a RAG (Retrieval-Augmented Generation) architecture. That meant the model didn’t just rely on what it was trained on—it needed to pull real-time, relevant context from our internal docs and knowledge base.

The chunking strategy turned out to be one of the most important design decisions. When using a retrieval-based approach, we first need to split large documents—like internal product manuals or system docs—into smaller pieces (or “chunks”) that can be indexed and searched efficiently. 

At first, we tried using fixed-size chunks—just slicing text every few hundred tokens—but that led to context fragmentation. Key information was often split across chunks, making it hard for the model to find complete answers. To address that, we moved to a hierarchical chunking strategy: small, focused chunks were still used for precise retrieval, but during answer generation, we passed the larger “parent” chunk that contained the small one along with surrounding context. This gave the model more of the surrounding page or section to work with—often the difference between a vague answer and a useful one.

Later, we explored Anthropic’s “Contextual Chunking” approach. According to Anthropic’s research it increases the accuracy by 5-7% —a big lift when you’re running this at scale. The idea was to take each small chunk and enrich it with page-level metadata and context before embedding it.

Evaluation Is Everything

Our first version of the assistant looked good, but something felt off. The answers were inconsistent. We had no way to say whether it was improving or regressing. That’s when we decided we needed a full evaluation pipeline.

We initially asked internal experts to write question-answer pairs. It worked to a point—but didn’t scale. So we flipped it. We gave the LLM internal documentation and asked it to generate questions and answers itself. Since this was a linguistic task, not a knowledge one, the model did a great job. We then reviewed and filtered that dataset with experts.

From there, every pull request and commit ran against that dataset. We tracked several key metrics: 

  • Did the answer cite the correct document? 

  • How much of the expected answer did the response actually cover? 

  • Did the output match the structure and tone of expert-written answers? 

If a change didn’t outperform our baseline on those metrics, it didn’t make it in––eliminating guesswork altogether.

Choosing Tools Over Black Boxes

One of the most important lessons I’ve learned: black-box copilots are hard to trust, and even harder to debug. If something breaks, you have no idea why, and that’s why we took a toolkit approach at CyberArk, instead.

We used AWS Bedrock for orchestration and open-source libraries like RAGAS for evaluation. This gave us control over every part of the system—parsing, chunking, embedding, retrieval, generation. We could tune it, swap out parts, and understand the impact of every change.

For example, we were able to benchmark smaller models and see they performed just as well as larger ones—saving us significant compute cost. But we never would have caught that if we didn’t have a strong evaluation in place.

This work wasn’t about building something flashy. It was about building something real—and being able to prove it. Every part of the pipeline—chunking, prompting, retrieval, evaluation—was tuned, tested, and improved incrementally.

The biggest takeaway? If you want AI agents that actually work, you need more than a good prompt. You need visibility, metrics, and ownership of your entire architecture. Otherwise, you’re not deploying assistants—you’re crossing your fingers.