Free AI evaluation tool launches on GitHub

Jeff J Hunter
December 23, 2025

Anthtropic launches Bloom for free

Hey AI Enthusiast,

Anthropic just released Bloom.

An open source tool for automated behavioral evaluations of AI models.

Bloom takes a researcher-specified behavior and quantifies its frequency and severity across automatically generated scenarios. It generates targeted evaluation suites for arbitrary behavioral traits in days instead of weeks.

Anthropic tested it across four behaviors delusional sycophancy, instructed sabotage, self-preservation, and self-preferential bias on 16 frontier models.

Available now on GitHub. Free and open source.

But the real Tool Tuesday story isn't about AI evaluation frameworks.

It's about the free Chrome extension that summarizes everything you read and watch.

But first, today's prompt (then the tool that saves hours...)

🔥 Prompt of the Day 🔥

AI Personal Brand Authority System

Act as a personal branding specialist. Using ChatGPT, create one content system for [EXPERTISE AREA] that builds your authority fast.

Essential Details:

Your Expertise: [WHAT YOU'RE KNOWN FOR]
Target Audience: [WHO YOU HELP]
Content Pillars: [3-5 MAIN TOPICS]
Publishing Frequency: [HOW OFTEN YOU POST]
Primary Platform: [WHERE YOUR AUDIENCE IS]
Authority Goal: [WHAT SUCCESS LOOKS LIKE]

Create one authority system including:

Your unique POV (what you believe that others don't)
90-day content calendar (topics pre-planned by pillar)
Content generation prompts (ChatGPT prompts for each post type)
Story bank (your best credibility stories, ready to use)
Engagement triggers (questions that start conversations)
Repurposing guide (one piece becomes 10)

Output as: Ready-to-use content system with prompts and calendar.

Build authority through consistent, distinctive content.

🤖 Tool Tuesday 🤖

Gist AI: Free Chrome Extension That Summarizes Everything

Gist AI is a free Chrome extension that summarizes websites, YouTube videos, and PDFs in one click.

What Gist AI Actually Does

Most people waste hours reading long articles, watching entire YouTube videos, or skimming through PDFs to find key points.

Gist AI extracts them instantly.

One click. You get the summary.

Three Main Features

1. Summarize Websites

Open any article. Click the Gist AI icon. Get a summary of the key points.

No more reading 3,000-word articles to find the one insight you need.

2. Summarize YouTube Videos

Click the extension on any YouTube video. Get a summary of the content.

Better yet: Gist AI shows you timestamps. You can jump directly to the segments that matter.

No more watching 45-minute videos for 2 minutes of useful information.

3. Summarize PDFs

Works for PDFs found online and PDFs saved on your device.

Upload a research paper, report, or document. Get a summary in seconds.

The Feature That Actually Matters: Read More

Most summarizers give you bullet points and that's it.

Gist AI has a "Read More" feature. You can deep-dive into the source of any summary point that interests you.

Click "Read More" and it takes you directly to that section of the article or that timestamp in the video.

This is the difference between a tool that teases information and a tool that helps you actually consume it.

How to Use It

Install Gist AI from Chrome Web Store
Pin the extension to your toolbar
Open any website, YouTube video, or PDF
Click the Gist AI icon to summarize

For PDFs on your computer: Upload them in the PDF tab.

That's it. No account needed. No paywall. Completely free.

Why This Tool Is Different

It's free. No trial period. No premium tier. Just free.

No data collection. Gist AI doesn't collect user data. The only information shared with ChatGPT is the article content to generate the summary.

Works everywhere. Websites, YouTube, online PDFs, local PDFs. One extension for everything.

Saves time immediately. You don't need to learn anything. Install it, click it, get summaries.

Who This Tool Is For

Students: Reading multiple research papers for assignments. Gist AI cuts reading time in half.

Professionals: Combing through reports, articles, or documentation for key insights. Get what you need in seconds.

Content Consumers: Anyone who reads articles or watches videos daily. Stop wasting time on fluff.

Researchers: Processing large volumes of information quickly.

The Real Use Cases

Research without reading entire papers.

Watch only the relevant parts of long YouTube tutorials.

Skim reports and extract key data points.

Process client documents faster.

Consume more content in less time.

Why This Matters

Information overload is real.

Everyone's trying to read more, watch more, learn more. But there aren't more hours in the day.

Gist AI doesn't just save time. It changes how you consume information.

You stop reading everything. You start reading what matters.

You stop watching entire videos. You jump to the segments you need.

You stop processing fluff. You extract value immediately.

That's the shift.

Did You Know?

AI can identify which paintings in museums are most likely to be stolen by analyzing visitor gaze patterns and determining which pieces trigger the strongest emotional responses.

🗞️ Breaking AI News 🗞️

Anthropic released Bloom, an open source agentic framework for generating behavioral evaluations of frontier AI models.

Available now on GitHub. Free and open source.

What Bloom Does

Bloom generates targeted evaluation suites for arbitrary behavioral traits in AI models.

Takes a researcher-specified behavior. Quantifies its frequency and severity across automatically generated scenarios.

Anthropic tested it on four alignment-relevant behaviors across 16 models: Delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias.

Using Bloom, these evaluations took only a few days to conceptualize, refine, and generate.

Why This Matters

High-quality behavioral evaluations are essential for understanding alignment in frontier AI models.

But evaluations take a long time to develop. Then they risk becoming obsolete. Evaluations can contaminate training sets for new models, or capabilities improve so much that the evaluation no longer tests what's intended.

We need faster, more scalable ways to generate evaluations for misaligned behavior.

Bloom solves this.

How Bloom Works

Four automated stages:

1. Understanding: Analyzes the researcher's behavior description and example transcripts to generate detailed context about what to measure and why.

2. Ideation: Generates evaluation scenarios designed to elicit the target behavior. Each scenario specifies the situation, simulated user, system prompt, and interaction environment.

3. Rollout: Scenarios are rolled out in parallel, with an agent dynamically simulating both user and tool responses to elicit the sought-after behavior in the target model.

4. Judgment: A judge model scores each transcript for the presence of the behavior. A meta-judge produces suite-level analysis.

Unlike fixed evaluation sets, Bloom produces different scenarios on each run while measuring the same underlying behavior.

Validation

Anthropic validated Bloom against two questions:

Can Bloom distinguish models with different behavioral tendencies?

Yes. Anthropic tested Bloom on production Claude models versus intentionally misaligned "model organisms." Bloom successfully separated the model organism from the production model in 9 out of 10 cases.

How well-calibrated is the Bloom judge against human judgment?

Claude Opus 4.1 showed the strongest correlation with human judgment (Spearman correlation of 0.86).

Case Study: Self-Preferential Bias

Anthropic replicated an evaluation from the Claude Sonnet 4.5 system card that measures "self-preferential bias"—models' tendency to favor themselves in decision-making tasks.

Using Bloom, they reproduced the same ranking of models as the system card's evaluation. Confirmed that Sonnet 4.5 exhibits the least bias of the models tested.

They also discovered that increased reasoning effort reduces self-preferential bias in Claude Sonnet 4. Lower bias didn't come from selecting other models more evenly—it increasingly recognized the conflict of interest and declined to judge its own option.

What This Means

As AI systems grow more capable and are deployed in increasingly complex environments, the alignment research community needs scalable tools for exploring their behavioral traits.

Bloom is designed to facilitate this.

Early adopters are already using Bloom to evaluate nested jailbreak vulnerabilities, test hardcoding, measure evaluation awareness, and generate sabotage traces.