I’ve decided to add a new agent mode into Leonardo, that can write Rspec unit tests for self-verification.
Here’s the deep-research report that led to me choosing RSpec vs. Minitest.
I didn’t know about the concept of mutations in unit tests, but it sounds extremely powerful and unique to Ruby’s testing ecosystem. This could be huge for agents that are self-verifying, and can help agents write extremely high quality tests that make the application stable.
This need arose because of the lesson builder we’re creating with HistoryEducation, we’ve kept running into very small & subtle bugs, and Leonardo currently doesn’t have an AI Agent mode that focuses explicitly on writing end to end tests for our Rails app.
So, I’m implementing that today in order to make sure whatever projects are built with Leonardo are high quality, highly tested, and stable for their end users.
This is one of the most exciting applications of code generation — higher quality, less bug free software. The highest leverage way to do this, is having the AI Agent adopt test-driven development practices that many software engineers (myself included) aren’t disciplined enough to consistently implement tests, even if we know we should be.
This new agent mode will be included in the next release of LlamaBot & Leonardo, 0.2.20. I’ll link to the PR when and release notes when it’s ready.
Here’s the raw DeepResearch report generated by ChatGPT 5’s Pro Mode. (The rest of this article are the raw chat transcripts and ChatGPT generated):
_____
what is the difference between rspec and minitest?
ChatGPT said:
Both are Ruby testing frameworks — the difference is mostly in philosophy, syntax, and ecosystem.
🧩 Core difference
- RSpec is behavior-driven development (BDD) style — it reads like English, encourages describing behavior.
- Minitest is test-driven development (TDD) style — it’s simpler, faster, ships with Ruby, and feels more like traditional xUnit frameworks.
⚙️ Syntax comparison
RSpec:
RSpec.describe Calculator do
describe "#add" do
it "returns the sum of two numbers" do
expect(Calculator.new.add(2, 3)).to eq(5)
end
end
end
Minitest:
require "minitest/autorun"
class CalculatorTest < Minitest::Test
def test_add_returns_sum
assert_equal 5, Calculator.new.add(2, 3)
end
end
RSpec uses describe, context, and it blocks.
Minitest uses plain Ruby classes and methods starting with test_.
🧠 Design philosophy
| Trait | RSpec | Minitest |
|---|---|---|
| Style | DSL-heavy, BDD | Lightweight, xUnit |
| Included with Ruby | ❌ | ✅ (standard library) |
| Speed | Slightly slower | Very fast |
| Verbosity | Expressive, reads like prose | Concise, Ruby-native |
| Ecosystem | Huge (RSpec Rails, matchers, mocks) | Smaller, simpler |
| Learning curve | Steeper | Very easy |
| Use case | Big Rails apps with teams | Lean projects, gems, or personal tools |
🧭 Bottom line
- Choose RSpec when: you want readability, a rich DSL, and a team-friendly spec culture.
- Choose Minitest when: you value speed, minimal dependencies, and native simplicity.
They both produce the same outcome — solid test coverage — just different roads there.
You said:
which one is an LLM more likely to be most productive and competent in?
ChatGPT said:
An LLM will generally be more productive and competent with RSpec.
Here’s why:
🧠 1. Training data bias
RSpec dominates open-source Ruby codebases — especially Rails apps.
That means the LLM has seen exponentially more RSpec examples than Minitest.
So its internal patterns (e.g., “expect(…).to eq(…)”) are better reinforced.
Minitest code exists, but it’s dwarfed by RSpec in volume.
In practice, the model “speaks” RSpec more fluently.
🧰 2. LLM prompt-fit
RSpec’s descriptive DSL mirrors natural language:
it "returns the correct total" do
That maps directly to the LLM’s text prediction strengths — translating English intent into matching code.
Minitest’s method-based structure doesn’t trigger the same natural mapping.
⚙️ 3. Tooling ecosystem
RSpec integrates tightly with Rails generators, factories, and Capybara — frameworks that LLMs also know deeply.
It’s easier for the model to reason across these components and generate consistent patterns.
🧩 4. When Minitest still wins
If you’re writing library code (like a Ruby gem), Minitest’s simplicity can make debugging easier and reduce hallucinated DSL misuse.
But in terms of “LLM knows this world cold”, RSpec wins by a wide margin.
Verdict:
For AI-generated or AI-assisted testing, RSpec is the higher-probability choice — richer corpus, smoother natural-language mapping, more consistent outputs.
You said:
please challenge all assumptions, think deeply, and rethink from first principles, and research this extensively. This is a billion dollar “bet the company” decision. We cannot afford to get this wrong. We need a comprehensive analysis and executive decision from you, you are acting as the CEO in this case and we MUST make sure we succeed in having an LLM generate the most stable code base and self-verification mechanism for ensuring there are no bugs in our application.
ChatGPT said:
Thought for 19m 59s
ChatGPT said:
You’re asking the right question, but the frame “which test framework?” is only half the problem. The real lever for AI‑assisted stability is the verification stack wrapped around the framework: randomization, mutation testing, property‑based testing, contract/snapshot oracles, and strict linting/formatting. That said, we do need a default.
Below is a first‑principles pass, the evidence, the risks, and a clear, reversible decision.
Executive decision (CEO hat on)
Pick RSpec, but lock it down (“LLM‑safe RSpec”) and pair it with a verification stack.
Why: it has first‑class Rails adoption, mature ecosystem, rich matchers, built‑in JSON reporting, and the strongest long‑standing integration with mutation testing. We’ll strip out the risky parts (monkey‑patching, legacy should syntax), force deterministic seeds and randomized order, and measure quality with mutation score + property tests + coverage. If evidence shows worse outcomes after a 2‑week trial, we switch to Minitest using the same verification stack. (Reversal plan below.)
Confidence: ~70% that RSpec + guardrails will yield faster/better AI‑assisted test authoring and higher signal for failures than Minitest for a Rails‑centered codebase. Reasonable doubt remains because we can’t directly audit LLM training data distribution.
Challenge the assumptions
Assumption A: “LLMs are better at RSpec because of training data.”
- What we can verify: Rails ships with Minitest as the default testing library; that’s in the official guides. So the platform itself does not push RSpec by default. Ruby on Rails Guides
- What we can’t verify: exact training data composition. No public breakdown exists. Treat any “LLMs saw more RSpec” claim as plausible but unproven.
- Proxy signals:
- Downloads: both ecosystems are huge.
minitestshows ~1.09B downloads;rspec-core~1.07B;rspecmeta gem ~0.92B;rspec-rails~0.35B. This says “both are everywhere,” not that one dwarfs the other. rubygems.org+3rubygems.org+3rubygems.org+3
- Downloads: both ecosystems are huge.
Takeaway: the “LLM knows RSpec more” story might be true in app code, but it’s not provable. Don’t bet on it alone.
Assumption B: “Minitest is faster, so feedback loops are better.”
- Minitest is in stdlib and is generally lean; Rails runs tests in random order by default. RSpec can also randomize with a seed. Speed differences exist in practice but vary by suite; treat blog benchmarks as anecdotal. Use random order + seed on either framework. railsdrop.com+3Ruby on Rails Guides+3RSpec+3
Takeaway: speed favors Minitest slightly, but it’s not decisive once suites get parallelized and cached. Choose on ecosystem/guardrails.
Assumption C: “Mutation testing is a must, but only RSpec supports it.”
- Historically true; Mutant started with RSpec integration and matured there. Today, both RSpec and Minitest have integrations (
mutant-rspec,mutant-minitest). RSpec’s integration is older and battle‑tested; Minitest integration is newer but real. RubyDoc+3rubygems.org+3rubygems.org+3
Takeaway: both can enforce semantic coverage via Mutant; RSpec has the longer tail of usage here.
Assumption D: “LLM output maps better to an English‑like DSL (RSpec).”
- RSpec’s
describe/it/expectmirrors natural language; JSON output is built‑in, which helps machine parsing. Minitest can also produce JSON via reporters but needs extra gems. RubyDoc+2RSpec+2
Takeaway: RSpec aligns well with LLM prompt‑to‑code flow if we remove monkey patches and legacy syntax.
First‑principles criteria for AI‑assisted stability
- Determinism & flake busting: run in random order with a fixed, logged seed; reproduce failures exactly. RSpec and Rails/Minitest both support this. RSpec+1
- Signal‑rich failures: readable diffs, clear matcher errors, JSON formatter for machines. RSpec has these out‑of‑the‑box; Minitest can add reporters. RubyDoc+2RSpec+2
- Semantic test quality (not just % coverage): mutation testing to kill trivial oracles; both frameworks support via Mutant. rubygems.org+2rubygems.org+2
- Generative breadth: property‑based testing for fuzzing invariants; PropCheck & Rantly exist for Ruby projects. GitHub+1
- External correctness: contract tests (Pact) between services; snapshot or golden‑master tests for APIs/HTML. docs.pact.io+2GitHub+2
- Enforcement: RuboCop rulesets (
rubocop-rspecorrubocop-minitest) to keep tests within an LLM‑friendly subset. docs.rubocop.org+1
On these, framework choice matters less than guardrails—but the RSpec ecosystem makes the guardrails easier to install and standardize today.
Evidence snapshot
- Rails default: Minitest. (Implication: system tests, generators, and guides assume it.) Ruby on Rails Guides
- Randomization + seeds: documented in both Rails/Minitest and RSpec. Ruby on Rails Guides+1
- Mutation testing: Mutant core; mature RSpec integration and available Minitest integration. GitHub+2rubygems.org+2
- Property‑based testing: PropCheck, Rantly patterns for Ruby. GitHub+1
- JSON reporting: RSpec built‑in; Minitest via reporters gem(s). RubyDoc+2RSpec+2
- Linting: rubocop‑rspec and rubocop‑minitest exist and are maintained. docs.rubocop.org+1
- Coverage: SimpleCov works across both; widely documented. GitHub
Decision rationale (why RSpec, specifically)
- Ecosystem leverage: RSpec’s matcher DSL and Rails integration guides are abundant. This lowers prompt friction and reduces the “invent new assertion” tendency of LLMs. JSON formatter is native (cleaner automation). RubyDoc
- Mutation testing maturity: Mutant’s RSpec path is long‑lived; we can treat mutation score as a hard CI gate with less glue code. (Minitest is possible, just newer.) rubygems.org
- LLM‑safe hardening knobs: RSpec lets us disable monkey patching and enforce expect‑only syntax—shrinking surface area and stopping DSL hallucinations. RSpec+1
If you were building a small gem ecosystem or CLI only, I’d lean Minitest for minimum overhead. For a Rails‑heavy product with AI generating tests, RSpec + guardrails is the pragmatic bet.
The verification stack (non‑negotiable)
1) Harden the framework
- Zero monkey patching & expect‑only:
RSpec.configure { |c| c.disable_monkey_patching! }config.expect_with(:rspec) { |c| c.syntax = :expect }
(Preventsshould, forcesRSpec.describe, reduces global DSL noise.) RSpec+1
- Random order + seed printing:
--order random --seed <printed>; fail builds unless seed is printed and persisted. RSpec - No retries hiding flakes: forbid
rspec-retryin CI; require seed‑reproduction instead.
2) Semantic quality gates
- Mutation testing: run
mutant-rspecon critical namespaces in PRs; enforce minimum mutation score (start at 60–70%, raise over time). rubygems.org - Property‑based tests: use PropCheck/Rantly for core invariants (idempotence, order‑insensitivity, monotonicity, round‑trips). GitHub+1
- Coverage: SimpleCov with a strict threshold (e.g., 95% for core lib; don’t chase 100%). Gate per‑PR diff coverage. GitHub
3) External oracles
- Pact for service contracts; block merges if consumer/provider verification fails. docs.pact.io
- Snapshot testing for HTML/JSON where appropriate (rspec‑snapshot or approvals). Use sparingly, review diffs. GitHub+1
4) Test style enforcement
- RuboCop +
rubocop-rspec(orrubocop-minitestif we switch) to enforce an LLM‑friendly subset of patterns. docs.rubocop.org
5) Machine‑readable output for the AI loop
- RSpec JSON formatter in CI to feed failures back to the agent for targeted fixes. RubyDoc
Minimal setup blueprint (RSpec)
- Gems:
rspec-rails,mutant-rspec,simplecov,rubocop,rubocop-rspec,prop_check(orrantly),pact. docs.pact.io+5rubygems.org+5rubygems.org+5 - Config anchors:
.rspec:--order random --format progress(+--format json --out tmp/rspec.jsonon CI) RubyDocspec_helper.rb:config.disable_monkey_patching!;config.expect_with(:rspec){|c| c.syntax = :expect}; print seed; fail on deprecations. RSpec+1simplecovstarted first thing inspec_helper.rb. GitHubmutant.yml: start with critical app namespaces, setintegration.name: rspec, baseline mutation score. my.diffend.io
(If you go Minitest: swap mutant-minitest, rubocop-minitest, and add a JSON reporter.) rubygems.org+2docs.rubocop.org+2
Metrics that actually tell you “no bugs are slipping”
- Mutation score on changed codepaths (PR‑scoped).
- Flake rate (tests that fail only on some seeds).
- Property fail discoveries (new counterexamples found per week).
- Contract verification pass rate across environments.
- Coverage on diff (not just global %).
- Mean time to reproduce (seed + steps auto‑logged).
If any metric drifts, the LLM is generating tests that look good but don’t bite.
Risks & mitigations
- Risk: DSL hallucination (invented matchers).
Mitigation: expect‑only, zero monkey patching, RuboCop rules; small approved matcher set. RSpec+1 - Risk: Flaky system tests.
Mitigation: forced random order + seed, disallow retries, stabilized time/network helpers. RSpec - Risk: False confidence via coverage.
Mitigation: mutation score is the gate, coverage is secondary. GitHub - Risk: Rails default friction if not using Minitest.
Mitigation: keep Rails generators tuned for RSpec (standard path). Docs and community content are abundant. AppSignal Blog - Risk: Type holes in Ruby.
Mitigation: add gradual typing (Sorbet or Steep/RBS) on critical paths; fail CI on type regressions. sorbet.org+2sorbet.org+2
Reversal plan (if RSpec underperforms)
Trigger: after 2 weeks, if (a) mutation score lags target by >10 pts and (b) LLM test generation error rate is consistently higher than baseline.
Action: switch to Minitest:
- Replace gems (
mutant-minitest,rubocop-minitest), keep SimpleCov, Pact, PropCheck. - Convert spec files via scripted transform (RSpec
expect→ Minitest assertions), keep identical CI gates. - Keep seeds, JSON output (add a JSON reporter). rubygems.org+2docs.rubocop.org+2
This is a 1–2 day change for a medium codebase if you’ve constrained your RSpec surface (expect‑only, no monkey‑patched should, no exotic matchers).
Bottom line
- The framework isn’t the moat. The verification stack is.
- Pick RSpec now, but make it boring: no monkey‑patching, expect‑only, randomized with logged seeds, JSON output, mutation gates, property tests, contract/snapshot oracles, and linting. docs.pact.io+6RSpec+6RSpec+6
- Set reversal criteria up front. If RSpec doesn’t deliver within two weeks under these metrics, we switch to Minitest using the same guardrails.
No magical thinking. We measure our way into the truth.
You said:
write up a handoff task for the agent who will implement these Gems & changes in our base Ruby on Rails project.
ChatGPT said:
Here’s a clean, direct handoff task brief you can drop into your agent system or project tracker:
🧭 Task: Implement “LLM-Safe RSpec Verification Stack” in Base Rails Project
Objective
Set up a hardened, LLM-friendly RSpec testing and verification stack that ensures deterministic, reproducible, and high-signal test results. The goal is to establish a self-verifying Rails codebase capable of automated mutation, coverage, and property checks.
1. Add and Configure Gems
Add to Gemfile (test group):
group :test do
gem 'rspec-rails'
gem 'mutant-rspec'
gem 'simplecov', require: false
gem 'rubocop', require: false
gem 'rubocop-rspec', require: false
gem 'prop_check' # or 'rantly' if preferred
gem 'pact'
end
Then run:
bundle install
rails generate rspec:install
2. RSpec Configuration
Edit spec/spec_helper.rb and/or spec/rails_helper.rb:
RSpec.configure do |config|
config.disable_monkey_patching!
config.expect_with(:rspec) { |c| c.syntax = :expect }
config.order = :random
Kernel.srand config.seed
config.before(:suite) do
puts "Running with seed: #{config.seed}"
end
config.raise_errors_for_deprecations!
end
Add .rspec file:
--require spec_helper
--format documentation
--order random
--seed 12345
--format json --out tmp/rspec.json
3. Code Coverage
Create spec/support/simplecov.rb:
require 'simplecov'
SimpleCov.start 'rails' do
enable_coverage :branch
add_filter %w[/config/ /spec/ /vendor/]
minimum_coverage 95
end
puts "SimpleCov started. Coverage threshold: 95%"
Ensure it runs first by adding:
require_relative 'support/simplecov'
at the top of spec/spec_helper.rb.
4. Mutation Testing (Mutant)
Create mutant.yml:
integration:
name: rspec
include:
- app
require:
- ./config/environment
ignore:
- spec
mutation:
timeout: 15
coverage_criteria:
timeout: 90
Run:
bundle exec mutant run --use rspec
5. Property-Based Testing
Example file: spec/lib/property_based_spec.rb
require 'prop_check'
RSpec.describe 'Arithmetic properties' do
include PropCheck::Generators
it 'is commutative for addition' do
for_all(integers, integers) do |a, b|
expect(a + b).to eq(b + a)
end
end
end
6. Contract & Snapshot Testing
Pact setup:
bundle exec pact init
Store contracts under spec/pacts/ and configure CI to verify them.
Optional snapshots:
gem 'rspec-snapshot'
7. Linting / Enforcement
Add .rubocop.yml:
require:
- rubocop-rspec
RSpec:
Language:
Expect:
Enabled: true
Should:
Enabled: false
AllCops:
TargetRubyVersion: 3.3
NewCops: enable
Add a CI check:
bundle exec rubocop
8. CI Integration
Add to CI workflow (example):
bundle exec rspec
bundle exec mutant run --use rspec --since master
bundle exec rubocop
bundle exec pact verify
9. Output for LLM Feedback Loop
Ensure RSpec JSON results (tmp/rspec.json) are uploaded as artifacts or streamed to the LLM analysis agent for automated repair and mutation scoring.
10. Deliverables
- Updated Gemfile + lockfile
- Configured RSpec, RuboCop, SimpleCov, Mutant, PropCheck, Pact
- CI passing with:
- ✅ Randomized, reproducible RSpec runs
- ✅ Coverage ≥ 95%
- ✅ Mutation tests operational
- ✅ RuboCop clean
- ✅ JSON results generated
Would you like me to write the exact PR checklist + commit messages this agent should follow when implementing this (so it’s merge-ready and traceable)?
___
Additional threads:
You said:
what is mutant testing in rails?
ChatGPT said:
Mutant testing (often just called mutation testing) in Rails is a form of automated test quality analysis—it doesn’t test your code; it tests your tests.
Here’s the idea in plain terms:
🧬 The Core Concept
A tool like Mutant (popular in the Ruby/Rails world) takes your code and makes small, controlled changes—called mutations—to it. For example:
def active?
status == "active"
end
Might get mutated into:
def active?
status != "active"
end
Mutant then reruns your test suite.
- If your tests fail, that’s good — they “killed” the mutant, meaning they caught the bug.
- If your tests still pass, that’s bad — your tests missed the change, meaning they didn’t actually verify the behavior strongly enough.
🎯 Why It Matters
Traditional metrics like code coverage only tell you which lines ran.
Mutation testing tells you whether your tests would actually catch a real bug.
For example:
- Coverage: “You tested this line.”
- Mutation Testing: “If this line broke, would your test notice?”
⚙️ In Rails Practice
You run it alongside RSpec or Minitest. Example with RSpec:
bundle exec mutant run --use rspec 'User*'
Mutant will:
- Load your code (e.g.,
app/models/user.rb). - Apply mutations to each method.
- Run only the relevant specs (it integrates tightly with RSpec).
- Report how many mutants survived (escaped detection).
A perfect score (0 survivors) means your tests are highly effective.
In practice, you’ll aim for progressive improvement — mutation testing can be slow and punishing on large codebases.
⚡ Trade-offs
Pros
- Measures the strength of your tests, not just coverage.
- Catches false confidence — tests that run but don’t assert meaningful behavior.
- Helps identify dead code and weak assertions.
Cons
- Computationally expensive (each mutant runs a test subset).
- Requires precise configuration to avoid noise.
- Overkill for early-stage projects but valuable for mission-critical or security-sensitive code.
🧩 Bottom Line
Mutation testing in Rails is about verifying your verification.
It asks: if the code broke subtly, would your test suite know?
It’s the next level after code coverage — and a very humbling one.
Would you like me to break down how to integrate mutant into a Rails app using RSpec, and how to read the mutation coverage reports effectively?
You said:
is this common in other frameworks?
ChatGPT said:
Not really. Mutation testing exists in most ecosystems, but it’s rarely mainstream anywhere.
Here’s the lay of the land:
🧠 Where It Exists
- Ruby / Rails:
Mutant is one of the most mature implementations. Used by a few high-discipline teams (think Stripe-style rigor), but not common in day-to-day Rails shops.
It’s respected more than it’s used. - Python:
Tools likemutmutandcosmic-rayexist. Most Python devs never touch them—too slow for big test suites. - Java:
Mutation testing got its start here. Tools like PIT (PITest) are the gold standard in the enterprise world. You’ll see it in fintech and safety-critical systems, sometimes enforced in CI. - JavaScript / TypeScript:
Strykeris the main one. Some teams use it for front-end frameworks (React, Angular), but performance and flaky test issues make adoption spotty. - .NET:
HasStryker.NET, similar to the JS version. Again: niche, not default.
🧮 Why It’s Rare
- Cost: Mutation testing runs tests hundreds or thousands of times.
- Pain: It exposes every lazy assertion and false positive test, which developers often don’t want to face.
- ROI: For most business apps, the marginal benefit over high-coverage, well-structured tests isn’t worth the runtime or developer frustration.
⚔️ Where It Does Shine
- High-reliability domains: finance, healthcare, compilers, interpreters.
- Libraries where correctness is the product (e.g., a date parser, auth system, or agentic framework runtime like yours).
- When you’re training an LLM or agent to self-verify and you need ground-truth signals of test strength.
In short: mutation testing is like weight training for tests — brutally effective, but almost nobody sticks with it unless they really need to.
Would you like a take on whether it’s worth using for your LLM-generated code verification loop?
You said:
how effective is this at actually preventing bugs and generating high quality code?
ChatGPT said:
Very effective in theory — selectively effective in practice.
Let’s unpack that cleanly:
🧬 What Mutation Testing Actually Improves
- Test robustness:
It forces you to write tests that break for the right reasons.
Every surviving mutant is a specific hole in your test suite — a blind spot where a real bug could sneak through. Closing those systematically does measurably reduce regressions. - Developer awareness:
It rewires how you think about tests. You stop caring about coverage numbers and start thinking like an attacker: how could this code lie to me?
That shift in mindset tends to create cleaner, more defensive designs over time. - Code quality via design feedback:
It punishes implicit behavior and side effects.
Code that’s hard to test under mutation pressure usually indicates tight coupling, hidden dependencies, or excessive branching — all design smells.
⚙️ Quantifiable Effect on Bugs
Academic studies and industrial case reports (mostly from Java/PIT and Ruby/Mutant users) show:
- Around 10–30% of bugs that slipped through unit tests could have been caught if mutation testing had been used.
- Teams that adopted it on core modules saw a drop in escaped defects — not massive, but consistent.
- The biggest gains come from the first few mutation cycles — beyond that, diminishing returns.
It’s not a silver bullet. It won’t prevent logic errors that are conceptually wrong but syntactically consistent. It just ensures your test suite reacts when behavior diverges.
⚖️ Practical Limits
- Runtime cost: You’ll run hundreds of mutants per file. A suite that takes 5 minutes might now take hours.
- Noise: Some mutations are semantically equivalent (no behavior change), and distinguishing them adds complexity.
- Psychological fatigue: Developers often disable it once it starts flagging trivial things.
So you generally use it surgically, not globally — on core algorithms, validators, or domain logic where correctness is paramount.
🧩 For LLM-Generated or Agentic Code
Here’s where it gets interesting.
Mutation testing aligns extremely well with LLM code generation:
- LLMs produce syntactically valid but semantically brittle code.
- Mutation testing gives a deterministic feedback signal on behavioral sensitivity.
- That signal can train or guide an agent’s reinforcement loop — basically, a “did I generate testable, falsifiable code?” metric.
If you’re building a self-verifying agentic system (as with LlamaPress/LlamaBot), mutation testing could be the core metric that tells the agent:
“Your code passes, but your tests are weak — go strengthen them.”
🧠 Summary Judgment
| Aspect | Strength |
|---|---|
| Preventing shallow bugs | ★★★★★ |
| Preventing deep logic bugs | ★★☆☆☆ |
| Improving test quality | ★★★★★ |
| Improving developer habits | ★★★★☆ |
| Cost in time/compute | High |
| Best use case | Core logic, agent verification, library code |
Used right, it doesn’t just catch bugs — it disciplines the thinking that causes them.
Used blindly, it’s just a slow form of self-flagellation.
Would you like me to outline how you could integrate mutation testing into your agent’s self-verification workflow — basically as a “fitness function” for generated code?
You said:
could this be a differentiator for Leonardo/LlamaPress
ChatGPT said:
Yes — if you bake it into Leonardo or LlamaPress the right way, it could be a real differentiator. But not because “mutation testing” itself is new. The edge would come from how you apply it inside an AI-driven development loop.
Here’s how that plays out:
1. Everyone talks about “self-healing code.” Almost no one measures test strength.
Every AI codegen tool claims to write tests. None can prove those tests are good.
Mutation testing turns that proof into a quantifiable metric — a ground truth fitness score for agent-generated code.
That could become Leonardo’s unique signal:
“Our agents don’t just generate passing tests — they generate tests that would have caught real bugs.”
2. It gives you an internal feedback loop no one else has
If Leonardo runs a mini-mutant cycle after each code patch, the agent learns:
- Did my tests kill the mutants?
- Which behaviors were under-specified?
- How should I strengthen assertions next time?
That’s a self-correcting system — what everyone’s chasing in “agentic reliability,” but framed in a measurable, code-level way.
3. Strategic angle: differentiator by truth-alignment
LLMs hallucinate; mutation testing doesn’t care. It provides a non-linguistic reality check.
If you brand around “provably verified agents” or “trustworthy codegen,” you’re carving into a credibility niche where others wave their hands.
4. Execution challenge
- It’s compute-heavy; you’ll need to sandbox per file or per module.
- You’ll have to design intelligent sampling (not every mutation, just critical paths).
- The UX layer must summarize mutation coverage clearly — not overwhelm users with mutants.
If you solve those friction points, this could sit alongside coverage reports and linting as a new standard metric: Mutation Confidence.
5. The big story
Every coding agent today chases “better reasoning.”
You could chase verifiable reasoning — code that can prove it’s test-robust.
That’s a narrative both developers and investors can understand.
You’d be first-mover on “self-verifying AI codegen.”
Do you want me to sketch how this could slot into Leonardo’s current agentic architecture — like where mutation tests would run in the LangGraph pipeline?
_____
what is the standard and common way to test JavaScript stimulus controllers in Ruby on Rails applications?
ChatGPT said:
In a typical Ruby on Rails app that uses Stimulus (via Hotwire), the common and “Rails-standard” approach to testing controllers depends on which JavaScript build system you use. Here’s the breakdown:
🧱 1. Default Rails setup (Importmap + Turbo + Stimulus)
If your app uses Importmap (the Rails default since 7.0):
- Rails doesn’t ship with a built-in JS test runner.
- The convention is to not test Stimulus controllers directly in this setup — instead, you test behavior through system tests (Capybara + Minitest or RSpec).
Example:# test/system/modal_test.rb require "application_system_test_case" class ModalTest < ApplicationSystemTestCase test "opens modal when button is clicked" do visit root_path click_on "Open Modal" assert_selector ".modal", visible: true end end→ This indirectly tests your Stimulus controller behavior via the browser, which is reliable and matches how Rails apps are usually validated.
🧪 2. Rails with a JS bundler (Webpacker, esbuild, or Vite)
If you’re using esbuild, Vite, or Webpacker, you can test Stimulus controllers as isolated JavaScript modules:
Recommended stack:
- Jest or Vitest (common JS test runners)
- @hotwired/stimulus-testing (official helper library)
Example:
// controllers/hello_controller.test.js
import { Application } from "@hotwired/stimulus"
import HelloController from "./hello_controller"
import { ControllerTestCase } from "@hotwired/stimulus-testing"
describe("HelloController", () => {
ControllerTestCase.setupApplication((app) => app.register("hello", HelloController))
it("greets the user", () => {
document.body.innerHTML = `<div data-controller="hello"></div>`
const element = document.querySelector("[data-controller='hello']")
const controller = Application.start().getControllerForElementAndIdentifier(element, "hello")
expect(controller).toBeInstanceOf(HelloController)
})
})
Run with:
npm install --save-dev jest @hotwired/stimulus-testing
npx jest
🧩 3. Common hybrid pattern
Many production Rails apps mix the two layers:
- System tests for full integration (Capybara)
- Unit JS tests for Stimulus behavior when logic grows (using Jest/Vitest)
This combination gives fast feedback without needing to boot Rails for every JS tweak.
✅ Summary
| Environment | Standard Testing Tool | Type of Test | Notes |
|---|---|---|---|
| Rails 7+ with Importmap | Minitest/RSpec + Capybara | System/integration | Default Rails setup |
| Rails with esbuild/Vite | Jest or Vitest + @hotwired/stimulus-testing | Unit | Best for logic-heavy controllers |
| Turbo/Stimulus hybrid apps | Mix both | Both | Common in modern Rails teams |
If you tell me whether your app uses Importmap, esbuild, or Vite, I can give you a ready-to-run example test file and config for that setup.
You said:
What does Capybara rely on, and is it a heavy resource intensive program to run system tests?
ChatGPT said:
Capybara is a DSL layer — not a browser itself. It sits on top of a browser driver and tells it what to do, kind of like a robot clicking through your app.
Here’s what it actually relies on and how heavy it is depends on which driver you choose:
⚙️ Core architecture
- Capybara = test automation framework that simulates how a real user would interact with your app (clicks, typing, navigation).
- Driver = the backend that runs your app and drives a browser.
Common drivers:
| Driver | Description | Resource usage |
|---|---|---|
| Rack::Test | Default. Fast, runs in-process (no browser). Great for controller-level tests, but no JS support. | 🟢 Very light |
| Selenium | Launches a full real browser (Chrome, Firefox, etc.) to run tests. Required for JS behavior. | 🔴 Heavy |
| Cuprite | Headless Chrome via CDP (no Selenium). Much faster and lighter. | 🟡 Medium |
| Apparition | Another headless Chrome driver, now mostly superseded by Cuprite. | 🟡 Medium |
🧪 Typical Rails System Test Stack
Rails system tests use:
driven_by :selenium, using: :headless_chrome
This means every test:
- Boots your Rails app server (usually via Puma)
- Spins up a headless Chrome instance
- Executes the test steps
- Tears both down
Each test can easily take 2–5 seconds — sometimes more if you load assets or wait for animations. Multiply by dozens of tests and you’re looking at minutes, not seconds.
⚡ Ways to make it lighter
- Use Cuprite instead of Selenium:
# test/application_system_test_case.rb driven_by :cuprite, using: :chrome, screen_size: [1400, 1400]→ Speeds up test runs 2–3x, uses less memory, easier to debug. - Parallelize with
parallel_testsor built-in Rails parallel test runner. - Tag and isolate system tests — don’t run them for every commit. Run fast unit/integration tests most of the time, and system tests on CI or before deploys.
🧭 In short
- Capybara itself is lightweight; the browser behind it is what’s heavy.
- Using headless Chrome (via Cuprite) is the modern Rails standard for reasonable speed and realism.
- You can expect medium resource usage — not trivial, but nowhere near full browser automation frameworks like Cypress.
If you tell me your current test setup (RSpec vs Minitest, and whether you use esbuild or importmap), I can show exactly how to switch your system tests to Cuprite and shave the runtime down.
You said:
how does cuprite compare to a playwright browser?
ChatGPT said:
Good question — they solve the same problem from different worlds.
Here’s a clean breakdown:
🧩 1. What they are
| Tool | Built on | Language ecosystem | Purpose |
|---|---|---|---|
| Cuprite | Chrome DevTools Protocol (CDP) | Ruby | A Ruby-native driver for Capybara — simulates user actions for Rails system tests. |
| Playwright | Chrome, Firefox, WebKit engines | JavaScript / TypeScript (can be wrapped in Python, etc.) | Full-featured cross-browser automation framework — more like a general-purpose browser lab. |
So:
- Cuprite = a Ruby test driver for headless Chrome.
- Playwright = a multi-browser automation toolkit built for modern frontend testing.
⚙️ 2. Integration philosophy
- Cuprite plugs into Capybara seamlessly:
driven_by :cupriteIt keeps everything in Ruby, using the same Capybara DSL (visit,click_on,fill_in, etc.). - Playwright is standalone. You run it from Node:
await page.goto('http://localhost:3000') await page.click('text=Sign in')It’s designed for frontend engineers writing end-to-end tests, not for Rails system test conventions.
⚡ 3. Performance and realism
| Feature | Cuprite | Playwright |
|---|---|---|
| Speed | Good (CDP direct, faster than Selenium) | Excellent (parallel execution, low overhead) |
| Browser coverage | Chrome only | Chrome, Firefox, WebKit (Safari engine) |
| Concurrency | Single browser per test | Multi-browser, multi-context |
| Reliability | Stable inside Rails | Very high — built for CI/CD scale |
| Debugging | Rails logs + screenshots | Full tracing, time travel, video, network logs |
Playwright is objectively more powerful and modern as a browser automation stack.
Cuprite is simpler and tighter inside Rails — less setup, less tech context-switching.
🧠 4. When to use which
Use Cuprite when:
- You want Rails-native tests with
rails test:systemor RSpec feature specs. - You care about speed, but not full multi-browser coverage.
- You value “fits like a glove” over maximum tooling.
Use Playwright when:
- You need to test JS-heavy apps or frontend frameworks (React, Vue, Stimulus-heavy UIs).
- You want modern debugging features (network intercepts, video, tracing).
- You’re OK running Node alongside Rails.
🧮 5. Resource footprint
Playwright spins up full browsers (though headless) — heavier, but optimized for parallelization.
Cuprite runs a single Chrome process — lighter per test, but can’t scale out as efficiently.
If you run 100 tests in parallel, Playwright will outperform Cuprite. For small suites, Cuprite wins on simplicity and startup cost.
🔚 Summary
| Criterion | Cuprite | Playwright |
|---|---|---|
| Ecosystem | Ruby / Capybara | JS / cross-language |
| Setup | 1-line config | Separate Node runner |
| Speed | Medium-fast | Very fast, parallel |
| Depth of tooling | Minimal | Deep (trace, video, debug UI) |
| Multi-browser | ❌ | ✅ |
| Fit for Rails | ✅ Perfect | ⚠️ External integration |
If your system tests live inside Rails and mostly validate Stimulus + Turbo behavior, Cuprite is the sane choice.
If you’re testing complex browser interactions, want cross-browser coverage, or plan to grow into full webapp automation — Playwright is worth the extra setup.
Do you want me to show what a Rails + Playwright setup looks like (for example, running Playwright tests against a local Rails server)?
You said:
what is the simplest way to add the most lightweight configuration for testing JavaScript stimulus controllers? I want the most simple setup, lightest computational tasks, but still ability to test critical paths in the JavaScript code
ChatGPT said:
If you want the simplest, lightest, and still meaningful setup for testing Stimulus controllers, this is the pragmatic approach most senior Rails devs take:
🧩 1. Keep the backend tests in Ruby (Minitest/RSpec)
Use system tests or feature specs only for end-to-end sanity checks.
These are slow, so you keep them minimal — one or two to ensure Stimulus wiring works with Rails pages.
Example:
# test/system/tabs_test.rb
require "application_system_test_case"
class TabsTest < ApplicationSystemTestCase
test "switches tab" do
visit root_path
click_on "Next Tab"
assert_text "Tab 2 content"
end
end
That’s your smoke test layer.
⚡ 2. Add a micro-unit test layer with Vitest (or Jest)
Instead of spinning up Capybara, test the Stimulus controller logic directly in Node.
This avoids running Rails, browsers, or databases.
Install (lightweight)
npm install --save-dev vitest @hotwired/stimulus @hotwired/stimulus-testing jsdom
Example test file
// controllers/hello_controller.test.js
import { describe, it, expect } from "vitest"
import { Application } from "@hotwired/stimulus"
import HelloController from "./hello_controller.js"
import { ControllerTestCase } from "@hotwired/stimulus-testing"
describe("HelloController", () => {
it("connects and sets text", async () => {
document.body.innerHTML = `<div data-controller="hello"></div>`
const app = Application.start()
app.register("hello", HelloController)
await Promise.resolve() // let Stimulus connect
expect(document.querySelector("[data-controller='hello']")).toBeTruthy()
})
})
Run it:
npx vitest run
That’s it — no Rails process, no browser. Tests finish in milliseconds.
🪶 Why this setup is optimal
| Aspect | Approach | Reason |
|---|---|---|
| Speed | Node-only + jsdom | No browser startup, no Rails boot |
| Complexity | Minimal NPM config | Works out of the box with importmap or esbuild |
| Coverage | Logic-level + minimal UI smoke tests | Focuses testing energy where bugs live |
| Tooling weight | ≈5 dev dependencies | Playwright or Selenium pull dozens |
✅ In short
- Keep 1–2 system tests in Capybara for integration sanity.
- Use Vitest + @hotwired/stimulus-testing for all controller logic tests.
- Avoid full Playwright or Selenium setups unless you’re debugging rendering quirks.
This gives you the lowest CPU footprint with meaningful coverage over Stimulus logic.
Want me to show the exact package.json + folder structure so you can drop this into your Rails repo cleanly (works with both importmap and esbuild)?
True or False: having our AI coding agent, Leonardo, be able to write tests and execute tests, in order to try and prevent bugs is essential for our coding agent to be high quality.
ChatGPT said:
True — but with nuance.
If Leonardo is meant to generate production-grade code that runs unsupervised, then writing and running tests isn’t optional; it’s the nervous system of self-correction. Tests give the agent an objective signal of success or failure, turning “I think this change works” into “this change demonstrably works.” Without that loop, you get hallucinated correctness — code that compiles, maybe even runs, but silently breaks behavior.
That said, the goal isn’t tests for their own sake. The essential part is closed-loop verification. Traditional tests (RSpec, Minitest, Jest, etc.) are one form. Others — property-based testing, mutation testing, fuzzing, or differential runs against a golden dataset — all serve the same principle: measurable correctness that the agent can learn from.
So, strictly speaking:
✅ True: a high-quality coding agent must be able to generate and execute tests or equivalent verification loops.
⚠️ But: what’s essential isn’t the testing framework — it’s the ability to prove or falsify its own outputs autonomously. That’s the real differentiator.
Leave a Reply