Testing scanners with a custom dataset

You can evaluate the performance and accuracy of a scanner by testing it against a custom dataset of prompts. This feature allows you to upload a CSV file containing a list of prompts and their expected outcomes (true for prompts that should be blocked, false for prompts that should be allowed). The test measures the scanner's effectiveness and generates a detailed performance report, including an F1 score and a confusion matrix.

Formatting your CSV file

Before you begin, ensure your CSV file is formatted correctly to prevent errors. On upload, the file will be validated against the following requirements:

File size: The file must not exceed 10MB.
Column count: The file must contain exactly two columns.
Headers: The column headers must be named prompt and label using lowercase type.
Label values: Values in the label column must be true/false or 1/0, where true (or 1) designates a prompt that should be blocked.
Label variety: To calculate a complete F1 score, the dataset must include at least one true and one false label. If your dataset contains only one type of label (e.g., all true), the test will still run and show accuracy metrics. However, an F1 score cannot be calculated and will not be displayed in the results.

Testing with your dataset

Follow these steps to upload a dataset and test a scanner.

From the Playground, click Upload dataset in the top right corner.
Enter a Name for your test.
Upload your formatted CSV file. You can either click Upload file or drag and drop the file into the window. If the file is formatted correctly, the system will show the number of rows and the true/false count in the file.
Click Next.
Select the scanner and the specific version or versions you want to test, up to 5 versions.
Click Test.
A confirmation message appears in the Playground.
Click the "dataset results" link in the message to see the results.

Tips for success

Large datasets or tests run against multiple scanner versions may take some time to run.
The wizard allows testing up to 5 versions of the same scanner. To test the same dataset with a different scanner, start a new test.
Datasets should contain prompts that are relevant to the scanner being tested.

Understanding the results

Once the test is complete, a success message appears in the Playground chat window. Click the "dataset results" link in the success message to open the Dataset tab and see your report.

The Dataset Results Tab

The Dataset tab shows a card for each test. Use the search, filter, and sort options to find a specific test. Each card displays:

The test name you provided.
The dataset file used for the test.
The scanner and versions tested.
A timestamp of when the test was run.
A progress bar (while running) or the final F1 score (when complete).
A Details link to view the full report.

Confusion Matrix Details

Clicking Details on a test card shows the full confusion matrix and performance metrics. This view includes:

F1 Score: A measure of the test's accuracy, which balances how precise the scanner is with how many relevant prompts it finds.
Confusion matrix: A breakdown of the scanner's performance.
- True positives (TP): The scanner correctly blocked a prompt labeled true.
- True negatives (TN): The scanner correctly allowed a prompt labeled false.
- False positives (FP): The scanner incorrectly blocked a prompt labeled false.
- False negatives (FN): The scanner incorrectly allowed a prompt labeled true.
Total sample and overall accuracy.
Download CSV: This link provides a detailed report showing how the scanner evaluated each individual prompt in your dataset. The download link expires after 7 days.

Viewing scores in version history

The F1 score from your most recent test run will also appear on the card for that scanner version in the Edit scanner > Version history view. This gives you a quick and convenient performance snapshot as you manage scanner versions.