Blog

  • System Prompt Caching Allows Broke Founder to Pay for Groceries

    Last night, I noticed these emails where Anthropic was charging my credit card for $13-$15, multiple times a day.

    When I dug into this more, I saw that yesterday I had 47M inputs tokens for Claude!

    And it cost me ~$50. This was shocking because I recently switched to Claude Haiku 4.5 to avoid this precise issue.

    Leonardo is a hungry beast for tokens.

    I recently gave Leonardo the ability to write & run Rspec tests. And these tests take a lot of input tokens, especially if it’s dumping all your SQL transaction queries into the LLM context window. 🤷

    But beyond turning off verbose SQL statement logging to reduce the amount of input tokens, another thing stood out to me on Anthropic’s pricing page.

    Cached tokens are only $0.10 per million tokens! That’s 1/10th the cost. That sounds amazing to me, so I decided to implement it into Leonardo’s test building mode.

    But first, let’s just discuss costs and do an experiment.

    Notice how many tokens this uses just to ask it to explore my project and tell me about it.

    Claude Haiku ingested 139K of tokens. (The system prompt alone is 16K tokens).

    BUT, notice it had 8 turns! Meaning, this system prompt got fed back into Claude Haiku 8 times! 16K * 8 = 128K tokens, just on ingesting the system prompt alone!

    That single turn cost 15 cents. And as the conversation drags on, the cost skyrockets, because at each message, it continuously ingests the entire conversation, including all the previous tokens.

    If this was Claude Sonnet, it’d be even more expensive. This 139K tokens would cost 45 cents. But just wait, it gets worse.

    I then asked Leonardo to run Rspec, and the cost jumped up again. Now with 340,000 tokens, it cost 36 cents for this additional message from the user.

    For Claude Sonnet, this would have been $1.08.

    And now, this is where my budget for groceries really gets murdered: asking Leonardo to implement unit tests where there’s gaps in coverage!

    I asked Leonardo to implement new tests to boost test coverage. And Leonardo sent over 50!! messages, and implemented a handful of new tests using RSpec.

    This run cost $2.48.

    With Claude Sonnet, this would have been approximately $7.44.

    Now, you can see how spending 12 hours on this yesterday, why I ended up with a bill of $47, (even when using the affordable Claude Haiku). This makes both me and my girlfriend sad. (Sorry babe, we have to cancel our plans to see Wicked 2, Leonardo ate all our money).

    But.. I don’t want to overly complain, because this is still pretty awesome.

    Even if it’s expensive, Leonardo generated 1,244 new lines of high quality Ruby/RSpec code that tests our Ruby on Rails project. It boosted test coverage and wrote over 80 new tests.

    To do this, it ingested thousands of lines of code to understand the project well enough. I mean, just look at this beautiful Ruby code.

    Pre LLMs, assume a Jr. Dev spent a full working day doing this. If you’re paying said Jr. Dev a 75K/year salary, we’re looking at ~$322.95 for that full days worth of work.

    So even if we used Claude Sonnet, and it cost $10 for this, that’s an order of magnitude lower on cost, and multiple orders of magnitude lower on time efficiency.

    That being said, I’m a broke founder.

    So if I’m choosing whether to pay API credits over paying my groceries (not hyperbole if this is $50/day in token costs), then I’m going with groceries so I don’t starve.

    Luckily, I can have my LLM tokens and eat them too, by implementing something called prompt caching.

    The cost savings are dramatic, let me show you.

    The main code change was to switch off LangChain’s SystemMessage object, and use good ol’ JSON.

    At first, Claude Code tried to add in additional_kwargs={"cache_control": {"type": "ephemeral"}} as an input argument to the SystemMessage constructor, but I wasn’t able to get it to trigger output.

    In LangChain’s documentation on Anthropic Caching, they actually have the example where they just use a JSON object for the system message, so that’s what I decided to go with.

    Source: https://reference.langchain.com/python/integrations/langchain_anthropic/ChatAnthropic/#langchain_anthropic.chat_models.ChatAnthropic

    from langchain_anthropic import ChatAnthropic
    
    model = ChatAnthropic(model="claude-sonnet-4-5-20250929")
    
    messages = [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "Below is some long context:",
                },
                {
                    "type": "text",
                    "text": f"{long_text}",
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "user",
            "content": "What's that about?",
        },
    ]
    
    response = model.invoke(messages)
    response.usage_metadata["input_token_details"]

    As a reminder, here’s the costs without caching:

    And here’s the exact same experiment ran, but with caching enabled:

    By just caching the system prompt, I cut the cost down to almost half of what it was previously!

    What’s awesome too, is that I’m only caching the system prompt.

    If I were properly caching the entire thread of messages, the cost would come down even more.

    But, since I’m a lazy founder just trying to ship capability increases (such as making Leonardo 10x smarter for coding Ruby on Rails projects), I’ll hold off on more LLM caching work until I get my next crazy high bill from Anthropic. 🙂

    In summary, if you’re a broke founder worried about the costs of groceries, yet you’re hopelessly addicted to using Claude to write code, and especially if you’re writing your own system prompts, consider implementing Anthropic’s prompt caching mechanism.

    With ❤️,

    • Kody

  • AI Coding Agent that Writes Rails Tests using Rspec

    I’ve decided to add a new agent mode into Leonardo, that can write Rspec unit tests for self-verification.

    Here’s the deep-research report that led to me choosing RSpec vs. Minitest.

    I didn’t know about the concept of mutations in unit tests, but it sounds extremely powerful and unique to Ruby’s testing ecosystem. This could be huge for agents that are self-verifying, and can help agents write extremely high quality tests that make the application stable.

    This need arose because of the lesson builder we’re creating with HistoryEducation, we’ve kept running into very small & subtle bugs, and Leonardo currently doesn’t have an AI Agent mode that focuses explicitly on writing end to end tests for our Rails app.

    So, I’m implementing that today in order to make sure whatever projects are built with Leonardo are high quality, highly tested, and stable for their end users.

    This is one of the most exciting applications of code generation — higher quality, less bug free software. The highest leverage way to do this, is having the AI Agent adopt test-driven development practices that many software engineers (myself included) aren’t disciplined enough to consistently implement tests, even if we know we should be.

    This new agent mode will be included in the next release of LlamaBot & Leonardo, 0.2.20. I’ll link to the PR when and release notes when it’s ready.

    Here’s the raw DeepResearch report generated by ChatGPT 5’s Pro Mode. (The rest of this article are the raw chat transcripts and ChatGPT generated):

    _____

    what is the difference between rspec and minitest?

    ChatGPT said:

    Both are Ruby testing frameworks — the difference is mostly in philosophy, syntax, and ecosystem.

    🧩 Core difference

    • RSpec is behavior-driven development (BDD) style — it reads like English, encourages describing behavior.
    • Minitest is test-driven development (TDD) style — it’s simpler, faster, ships with Ruby, and feels more like traditional xUnit frameworks.

    ⚙️ Syntax comparison

    RSpec:

    RSpec.describe Calculator do
      describe "#add" do
        it "returns the sum of two numbers" do
          expect(Calculator.new.add(2, 3)).to eq(5)
        end
      end
    end
    

    Minitest:

    require "minitest/autorun"
    
    class CalculatorTest < Minitest::Test
      def test_add_returns_sum
        assert_equal 5, Calculator.new.add(2, 3)
      end
    end
    

    RSpec uses describe, context, and it blocks.
    Minitest uses plain Ruby classes and methods starting with test_.


    🧠 Design philosophy

    TraitRSpecMinitest
    StyleDSL-heavy, BDDLightweight, xUnit
    Included with Ruby✅ (standard library)
    SpeedSlightly slowerVery fast
    VerbosityExpressive, reads like proseConcise, Ruby-native
    EcosystemHuge (RSpec Rails, matchers, mocks)Smaller, simpler
    Learning curveSteeperVery easy
    Use caseBig Rails apps with teamsLean projects, gems, or personal tools

    🧭 Bottom line

    • Choose RSpec when: you want readability, a rich DSL, and a team-friendly spec culture.
    • Choose Minitest when: you value speed, minimal dependencies, and native simplicity.

    They both produce the same outcome — solid test coverage — just different roads there.

    You said:

    which one is an LLM more likely to be most productive and competent in?

    ChatGPT said:

    An LLM will generally be more productive and competent with RSpec.

    Here’s why:


    🧠 1. Training data bias

    RSpec dominates open-source Ruby codebases — especially Rails apps.
    That means the LLM has seen exponentially more RSpec examples than Minitest.
    So its internal patterns (e.g., “expect(…).to eq(…)”) are better reinforced.

    Minitest code exists, but it’s dwarfed by RSpec in volume.
    In practice, the model “speaks” RSpec more fluently.


    🧰 2. LLM prompt-fit

    RSpec’s descriptive DSL mirrors natural language:

    it "returns the correct total" do
    

    That maps directly to the LLM’s text prediction strengths — translating English intent into matching code.
    Minitest’s method-based structure doesn’t trigger the same natural mapping.


    ⚙️ 3. Tooling ecosystem

    RSpec integrates tightly with Rails generators, factories, and Capybara — frameworks that LLMs also know deeply.
    It’s easier for the model to reason across these components and generate consistent patterns.


    🧩 4. When Minitest still wins

    If you’re writing library code (like a Ruby gem), Minitest’s simplicity can make debugging easier and reduce hallucinated DSL misuse.
    But in terms of “LLM knows this world cold”, RSpec wins by a wide margin.


    Verdict:

    For AI-generated or AI-assisted testing, RSpec is the higher-probability choice — richer corpus, smoother natural-language mapping, more consistent outputs.

    You said:

    please challenge all assumptions, think deeply, and rethink from first principles, and research this extensively. This is a billion dollar “bet the company” decision. We cannot afford to get this wrong. We need a comprehensive analysis and executive decision from you, you are acting as the CEO in this case and we MUST make sure we succeed in having an LLM generate the most stable code base and self-verification mechanism for ensuring there are no bugs in our application.

    ChatGPT said:

    Thought for 19m 59s

    ChatGPT said:

    You’re asking the right question, but the frame “which test framework?” is only half the problem. The real lever for AI‑assisted stability is the verification stack wrapped around the framework: randomization, mutation testing, property‑based testing, contract/snapshot oracles, and strict linting/formatting. That said, we do need a default.

    Below is a first‑principles pass, the evidence, the risks, and a clear, reversible decision.


    Executive decision (CEO hat on)

    Pick RSpec, but lock it down (“LLM‑safe RSpec”) and pair it with a verification stack.
    Why: it has first‑class Rails adoption, mature ecosystem, rich matchers, built‑in JSON reporting, and the strongest long‑standing integration with mutation testing. We’ll strip out the risky parts (monkey‑patching, legacy should syntax), force deterministic seeds and randomized order, and measure quality with mutation score + property tests + coverage. If evidence shows worse outcomes after a 2‑week trial, we switch to Minitest using the same verification stack. (Reversal plan below.)

    Confidence: ~70% that RSpec + guardrails will yield faster/better AI‑assisted test authoring and higher signal for failures than Minitest for a Rails‑centered codebase. Reasonable doubt remains because we can’t directly audit LLM training data distribution.


    Challenge the assumptions

    Assumption A: “LLMs are better at RSpec because of training data.”

    • What we can verify: Rails ships with Minitest as the default testing library; that’s in the official guides. So the platform itself does not push RSpec by default. Ruby on Rails Guides
    • What we can’t verify: exact training data composition. No public breakdown exists. Treat any “LLMs saw more RSpec” claim as plausible but unproven.
    • Proxy signals:
      • Downloads: both ecosystems are huge. minitest shows ~1.09B downloads; rspec-core ~1.07B; rspec meta gem ~0.92B; rspec-rails ~0.35B. This says “both are everywhere,” not that one dwarfs the other. rubygems.org+3rubygems.org+3rubygems.org+3

    Takeaway: the “LLM knows RSpec more” story might be true in app code, but it’s not provable. Don’t bet on it alone.

    Assumption B: “Minitest is faster, so feedback loops are better.”

    • Minitest is in stdlib and is generally lean; Rails runs tests in random order by default. RSpec can also randomize with a seed. Speed differences exist in practice but vary by suite; treat blog benchmarks as anecdotal. Use random order + seed on either framework. railsdrop.com+3Ruby on Rails Guides+3RSpec+3

    Takeaway: speed favors Minitest slightly, but it’s not decisive once suites get parallelized and cached. Choose on ecosystem/guardrails.

    Assumption C: “Mutation testing is a must, but only RSpec supports it.”

    • Historically true; Mutant started with RSpec integration and matured there. Today, both RSpec and Minitest have integrations (mutant-rspec, mutant-minitest). RSpec’s integration is older and battle‑tested; Minitest integration is newer but real. RubyDoc+3rubygems.org+3rubygems.org+3

    Takeaway: both can enforce semantic coverage via Mutant; RSpec has the longer tail of usage here.

    Assumption D: “LLM output maps better to an English‑like DSL (RSpec).”

    • RSpec’s describe/it/expect mirrors natural language; JSON output is built‑in, which helps machine parsing. Minitest can also produce JSON via reporters but needs extra gems. RubyDoc+2RSpec+2

    Takeaway: RSpec aligns well with LLM prompt‑to‑code flow if we remove monkey patches and legacy syntax.


    First‑principles criteria for AI‑assisted stability

    1. Determinism & flake busting: run in random order with a fixed, logged seed; reproduce failures exactly. RSpec and Rails/Minitest both support this. RSpec+1
    2. Signal‑rich failures: readable diffs, clear matcher errors, JSON formatter for machines. RSpec has these out‑of‑the‑box; Minitest can add reporters. RubyDoc+2RSpec+2
    3. Semantic test quality (not just % coverage): mutation testing to kill trivial oracles; both frameworks support via Mutant. rubygems.org+2rubygems.org+2
    4. Generative breadth: property‑based testing for fuzzing invariants; PropCheck & Rantly exist for Ruby projects. GitHub+1
    5. External correctness: contract tests (Pact) between services; snapshot or golden‑master tests for APIs/HTML. docs.pact.io+2GitHub+2
    6. Enforcement: RuboCop rulesets (rubocop-rspec or rubocop-minitest) to keep tests within an LLM‑friendly subset. docs.rubocop.org+1

    On these, framework choice matters less than guardrails—but the RSpec ecosystem makes the guardrails easier to install and standardize today.


    Evidence snapshot

    • Rails default: Minitest. (Implication: system tests, generators, and guides assume it.) Ruby on Rails Guides
    • Randomization + seeds: documented in both Rails/Minitest and RSpec. Ruby on Rails Guides+1
    • Mutation testing: Mutant core; mature RSpec integration and available Minitest integration. GitHub+2rubygems.org+2
    • Property‑based testing: PropCheck, Rantly patterns for Ruby. GitHub+1
    • JSON reporting: RSpec built‑in; Minitest via reporters gem(s). RubyDoc+2RSpec+2
    • Linting: rubocop‑rspec and rubocop‑minitest exist and are maintained. docs.rubocop.org+1
    • Coverage: SimpleCov works across both; widely documented. GitHub

    Decision rationale (why RSpec, specifically)

    1. Ecosystem leverage: RSpec’s matcher DSL and Rails integration guides are abundant. This lowers prompt friction and reduces the “invent new assertion” tendency of LLMs. JSON formatter is native (cleaner automation). RubyDoc
    2. Mutation testing maturity: Mutant’s RSpec path is long‑lived; we can treat mutation score as a hard CI gate with less glue code. (Minitest is possible, just newer.) rubygems.org
    3. LLM‑safe hardening knobs: RSpec lets us disable monkey patching and enforce expect‑only syntax—shrinking surface area and stopping DSL hallucinations. RSpec+1

    If you were building a small gem ecosystem or CLI only, I’d lean Minitest for minimum overhead. For a Rails‑heavy product with AI generating tests, RSpec + guardrails is the pragmatic bet.


    The verification stack (non‑negotiable)

    1) Harden the framework

    • Zero monkey patching & expect‑only:
      • RSpec.configure { |c| c.disable_monkey_patching! }
      • config.expect_with(:rspec) { |c| c.syntax = :expect }
        (Prevents should, forces RSpec.describe, reduces global DSL noise.) RSpec+1
    • Random order + seed printing: --order random --seed <printed>; fail builds unless seed is printed and persisted. RSpec
    • No retries hiding flakes: forbid rspec-retry in CI; require seed‑reproduction instead.

    2) Semantic quality gates

    • Mutation testing: run mutant-rspec on critical namespaces in PRs; enforce minimum mutation score (start at 60–70%, raise over time). rubygems.org
    • Property‑based tests: use PropCheck/Rantly for core invariants (idempotence, order‑insensitivity, monotonicity, round‑trips). GitHub+1
    • Coverage: SimpleCov with a strict threshold (e.g., 95% for core lib; don’t chase 100%). Gate per‑PR diff coverage. GitHub

    3) External oracles

    • Pact for service contracts; block merges if consumer/provider verification fails. docs.pact.io
    • Snapshot testing for HTML/JSON where appropriate (rspec‑snapshot or approvals). Use sparingly, review diffs. GitHub+1

    4) Test style enforcement

    • RuboCop + rubocop-rspec (or rubocop-minitest if we switch) to enforce an LLM‑friendly subset of patterns. docs.rubocop.org

    5) Machine‑readable output for the AI loop

    • RSpec JSON formatter in CI to feed failures back to the agent for targeted fixes. RubyDoc

    Minimal setup blueprint (RSpec)

    • Gems: rspec-rails, mutant-rspec, simplecov, rubocop, rubocop-rspec, prop_check (or rantly), pact. docs.pact.io+5rubygems.org+5rubygems.org+5
    • Config anchors:
      • .rspec: --order random --format progress (+ --format json --out tmp/rspec.json on CI) RubyDoc
      • spec_helper.rb: config.disable_monkey_patching!; config.expect_with(:rspec){|c| c.syntax = :expect}; print seed; fail on deprecations. RSpec+1
      • simplecov started first thing in spec_helper.rb. GitHub
      • mutant.yml: start with critical app namespaces, set integration.name: rspec, baseline mutation score. my.diffend.io

    (If you go Minitest: swap mutant-minitest, rubocop-minitest, and add a JSON reporter.) rubygems.org+2docs.rubocop.org+2


    Metrics that actually tell you “no bugs are slipping”

    1. Mutation score on changed codepaths (PR‑scoped).
    2. Flake rate (tests that fail only on some seeds).
    3. Property fail discoveries (new counterexamples found per week).
    4. Contract verification pass rate across environments.
    5. Coverage on diff (not just global %).
    6. Mean time to reproduce (seed + steps auto‑logged).

    If any metric drifts, the LLM is generating tests that look good but don’t bite.


    Risks & mitigations

    • Risk: DSL hallucination (invented matchers).
      Mitigation: expect‑only, zero monkey patching, RuboCop rules; small approved matcher set. RSpec+1
    • Risk: Flaky system tests.
      Mitigation: forced random order + seed, disallow retries, stabilized time/network helpers. RSpec
    • Risk: False confidence via coverage.
      Mitigation: mutation score is the gate, coverage is secondary. GitHub
    • Risk: Rails default friction if not using Minitest.
      Mitigation: keep Rails generators tuned for RSpec (standard path). Docs and community content are abundant. AppSignal Blog
    • Risk: Type holes in Ruby.
      Mitigation: add gradual typing (Sorbet or Steep/RBS) on critical paths; fail CI on type regressions. sorbet.org+2sorbet.org+2

    Reversal plan (if RSpec underperforms)

    Trigger: after 2 weeks, if (a) mutation score lags target by >10 pts and (b) LLM test generation error rate is consistently higher than baseline.

    Action: switch to Minitest:

    • Replace gems (mutant-minitest, rubocop-minitest), keep SimpleCov, Pact, PropCheck.
    • Convert spec files via scripted transform (RSpec expect → Minitest assertions), keep identical CI gates.
    • Keep seeds, JSON output (add a JSON reporter). rubygems.org+2docs.rubocop.org+2

    This is a 1–2 day change for a medium codebase if you’ve constrained your RSpec surface (expect‑only, no monkey‑patched should, no exotic matchers).


    Bottom line

    • The framework isn’t the moat. The verification stack is.
    • Pick RSpec now, but make it boring: no monkey‑patching, expect‑only, randomized with logged seeds, JSON output, mutation gates, property tests, contract/snapshot oracles, and linting. docs.pact.io+6RSpec+6RSpec+6
    • Set reversal criteria up front. If RSpec doesn’t deliver within two weeks under these metrics, we switch to Minitest using the same guardrails.

    No magical thinking. We measure our way into the truth.

    You said:

    write up a handoff task for the agent who will implement these Gems & changes in our base Ruby on Rails project.

    ChatGPT said:

    Here’s a clean, direct handoff task brief you can drop into your agent system or project tracker:


    🧭 Task: Implement “LLM-Safe RSpec Verification Stack” in Base Rails Project

    Objective

    Set up a hardened, LLM-friendly RSpec testing and verification stack that ensures deterministic, reproducible, and high-signal test results. The goal is to establish a self-verifying Rails codebase capable of automated mutation, coverage, and property checks.


    1. Add and Configure Gems

    Add to Gemfile (test group):

    group :test do
      gem 'rspec-rails'
      gem 'mutant-rspec'
      gem 'simplecov', require: false
      gem 'rubocop', require: false
      gem 'rubocop-rspec', require: false
      gem 'prop_check' # or 'rantly' if preferred
      gem 'pact'
    end
    

    Then run:

    bundle install
    rails generate rspec:install
    

    2. RSpec Configuration

    Edit spec/spec_helper.rb and/or spec/rails_helper.rb:

    RSpec.configure do |config|
      config.disable_monkey_patching!
      config.expect_with(:rspec) { |c| c.syntax = :expect }
      config.order = :random
      Kernel.srand config.seed
    
      config.before(:suite) do
        puts "Running with seed: #{config.seed}"
      end
    
      config.raise_errors_for_deprecations!
    end
    

    Add .rspec file:

    --require spec_helper
    --format documentation
    --order random
    --seed 12345
    --format json --out tmp/rspec.json
    

    3. Code Coverage

    Create spec/support/simplecov.rb:

    require 'simplecov'
    SimpleCov.start 'rails' do
      enable_coverage :branch
      add_filter %w[/config/ /spec/ /vendor/]
      minimum_coverage 95
    end
    puts "SimpleCov started. Coverage threshold: 95%"
    

    Ensure it runs first by adding:

    require_relative 'support/simplecov'
    

    at the top of spec/spec_helper.rb.


    4. Mutation Testing (Mutant)

    Create mutant.yml:

    integration:
      name: rspec
    include:
      - app
    require:
      - ./config/environment
    ignore:
      - spec
    mutation:
      timeout: 15
    coverage_criteria:
      timeout: 90
    

    Run:

    bundle exec mutant run --use rspec
    

    5. Property-Based Testing

    Example file: spec/lib/property_based_spec.rb

    require 'prop_check'
    
    RSpec.describe 'Arithmetic properties' do
      include PropCheck::Generators
    
      it 'is commutative for addition' do
        for_all(integers, integers) do |a, b|
          expect(a + b).to eq(b + a)
        end
      end
    end
    

    6. Contract & Snapshot Testing

    Pact setup:

    bundle exec pact init
    

    Store contracts under spec/pacts/ and configure CI to verify them.

    Optional snapshots:

    gem 'rspec-snapshot'
    

    7. Linting / Enforcement

    Add .rubocop.yml:

    require:
      - rubocop-rspec
    
    RSpec:
      Language:
        Expect:
          Enabled: true
        Should:
          Enabled: false
    AllCops:
      TargetRubyVersion: 3.3
      NewCops: enable
    

    Add a CI check:

    bundle exec rubocop
    

    8. CI Integration

    Add to CI workflow (example):

    bundle exec rspec
    bundle exec mutant run --use rspec --since master
    bundle exec rubocop
    bundle exec pact verify
    

    9. Output for LLM Feedback Loop

    Ensure RSpec JSON results (tmp/rspec.json) are uploaded as artifacts or streamed to the LLM analysis agent for automated repair and mutation scoring.


    10. Deliverables

    • Updated Gemfile + lockfile
    • Configured RSpec, RuboCop, SimpleCov, Mutant, PropCheck, Pact
    • CI passing with:
      • ✅ Randomized, reproducible RSpec runs
      • ✅ Coverage ≥ 95%
      • ✅ Mutation tests operational
      • ✅ RuboCop clean
      • ✅ JSON results generated

    Would you like me to write the exact PR checklist + commit messages this agent should follow when implementing this (so it’s merge-ready and traceable)?

    ___

    Additional threads:


    You said:

    what is mutant testing in rails?

    ChatGPT said:

    Mutant testing (often just called mutation testing) in Rails is a form of automated test quality analysis—it doesn’t test your code; it tests your tests.

    Here’s the idea in plain terms:


    🧬 The Core Concept

    A tool like Mutant (popular in the Ruby/Rails world) takes your code and makes small, controlled changes—called mutations—to it. For example:

    def active?
      status == "active"
    end
    

    Might get mutated into:

    def active?
      status != "active"
    end
    

    Mutant then reruns your test suite.

    • If your tests fail, that’s good — they “killed” the mutant, meaning they caught the bug.
    • If your tests still pass, that’s bad — your tests missed the change, meaning they didn’t actually verify the behavior strongly enough.

    🎯 Why It Matters

    Traditional metrics like code coverage only tell you which lines ran.
    Mutation testing tells you whether your tests would actually catch a real bug.

    For example:

    • Coverage: “You tested this line.”
    • Mutation Testing: “If this line broke, would your test notice?”

    ⚙️ In Rails Practice

    You run it alongside RSpec or Minitest. Example with RSpec:

    bundle exec mutant run --use rspec 'User*'
    

    Mutant will:

    1. Load your code (e.g., app/models/user.rb).
    2. Apply mutations to each method.
    3. Run only the relevant specs (it integrates tightly with RSpec).
    4. Report how many mutants survived (escaped detection).

    A perfect score (0 survivors) means your tests are highly effective.
    In practice, you’ll aim for progressive improvement — mutation testing can be slow and punishing on large codebases.


    ⚡ Trade-offs

    Pros

    • Measures the strength of your tests, not just coverage.
    • Catches false confidence — tests that run but don’t assert meaningful behavior.
    • Helps identify dead code and weak assertions.

    Cons

    • Computationally expensive (each mutant runs a test subset).
    • Requires precise configuration to avoid noise.
    • Overkill for early-stage projects but valuable for mission-critical or security-sensitive code.

    🧩 Bottom Line

    Mutation testing in Rails is about verifying your verification.
    It asks: if the code broke subtly, would your test suite know?

    It’s the next level after code coverage — and a very humbling one.


    Would you like me to break down how to integrate mutant into a Rails app using RSpec, and how to read the mutation coverage reports effectively?

    You said:

    is this common in other frameworks?

    ChatGPT said:

    Not really. Mutation testing exists in most ecosystems, but it’s rarely mainstream anywhere.

    Here’s the lay of the land:


    🧠 Where It Exists

    • Ruby / Rails:
      Mutant is one of the most mature implementations. Used by a few high-discipline teams (think Stripe-style rigor), but not common in day-to-day Rails shops.
      It’s respected more than it’s used.
    • Python:
      Tools like mutmut and cosmic-ray exist. Most Python devs never touch them—too slow for big test suites.
    • Java:
      Mutation testing got its start here. Tools like PIT (PITest) are the gold standard in the enterprise world. You’ll see it in fintech and safety-critical systems, sometimes enforced in CI.
    • JavaScript / TypeScript:
      Stryker is the main one. Some teams use it for front-end frameworks (React, Angular), but performance and flaky test issues make adoption spotty.
    • .NET:
      Has Stryker.NET, similar to the JS version. Again: niche, not default.

    🧮 Why It’s Rare

    • Cost: Mutation testing runs tests hundreds or thousands of times.
    • Pain: It exposes every lazy assertion and false positive test, which developers often don’t want to face.
    • ROI: For most business apps, the marginal benefit over high-coverage, well-structured tests isn’t worth the runtime or developer frustration.

    ⚔️ Where It Does Shine

    • High-reliability domains: finance, healthcare, compilers, interpreters.
    • Libraries where correctness is the product (e.g., a date parser, auth system, or agentic framework runtime like yours).
    • When you’re training an LLM or agent to self-verify and you need ground-truth signals of test strength.

    In short: mutation testing is like weight training for tests — brutally effective, but almost nobody sticks with it unless they really need to.

    Would you like a take on whether it’s worth using for your LLM-generated code verification loop?

    You said:

    how effective is this at actually preventing bugs and generating high quality code?

    ChatGPT said:

    Very effective in theory — selectively effective in practice.

    Let’s unpack that cleanly:


    🧬 What Mutation Testing Actually Improves

    1. Test robustness:
      It forces you to write tests that break for the right reasons.
      Every surviving mutant is a specific hole in your test suite — a blind spot where a real bug could sneak through. Closing those systematically does measurably reduce regressions.
    2. Developer awareness:
      It rewires how you think about tests. You stop caring about coverage numbers and start thinking like an attacker: how could this code lie to me?
      That shift in mindset tends to create cleaner, more defensive designs over time.
    3. Code quality via design feedback:
      It punishes implicit behavior and side effects.
      Code that’s hard to test under mutation pressure usually indicates tight coupling, hidden dependencies, or excessive branching — all design smells.

    ⚙️ Quantifiable Effect on Bugs

    Academic studies and industrial case reports (mostly from Java/PIT and Ruby/Mutant users) show:

    • Around 10–30% of bugs that slipped through unit tests could have been caught if mutation testing had been used.
    • Teams that adopted it on core modules saw a drop in escaped defects — not massive, but consistent.
    • The biggest gains come from the first few mutation cycles — beyond that, diminishing returns.

    It’s not a silver bullet. It won’t prevent logic errors that are conceptually wrong but syntactically consistent. It just ensures your test suite reacts when behavior diverges.


    ⚖️ Practical Limits

    • Runtime cost: You’ll run hundreds of mutants per file. A suite that takes 5 minutes might now take hours.
    • Noise: Some mutations are semantically equivalent (no behavior change), and distinguishing them adds complexity.
    • Psychological fatigue: Developers often disable it once it starts flagging trivial things.

    So you generally use it surgically, not globally — on core algorithms, validators, or domain logic where correctness is paramount.


    🧩 For LLM-Generated or Agentic Code

    Here’s where it gets interesting.

    Mutation testing aligns extremely well with LLM code generation:

    • LLMs produce syntactically valid but semantically brittle code.
    • Mutation testing gives a deterministic feedback signal on behavioral sensitivity.
    • That signal can train or guide an agent’s reinforcement loop — basically, a “did I generate testable, falsifiable code?” metric.

    If you’re building a self-verifying agentic system (as with LlamaPress/LlamaBot), mutation testing could be the core metric that tells the agent:
    “Your code passes, but your tests are weak — go strengthen them.”


    🧠 Summary Judgment

    AspectStrength
    Preventing shallow bugs★★★★★
    Preventing deep logic bugs★★☆☆☆
    Improving test quality★★★★★
    Improving developer habits★★★★☆
    Cost in time/computeHigh
    Best use caseCore logic, agent verification, library code

    Used right, it doesn’t just catch bugs — it disciplines the thinking that causes them.
    Used blindly, it’s just a slow form of self-flagellation.

    Would you like me to outline how you could integrate mutation testing into your agent’s self-verification workflow — basically as a “fitness function” for generated code?

    You said:

    could this be a differentiator for Leonardo/LlamaPress

    ChatGPT said:

    Yes — if you bake it into Leonardo or LlamaPress the right way, it could be a real differentiator. But not because “mutation testing” itself is new. The edge would come from how you apply it inside an AI-driven development loop.

    Here’s how that plays out:


    1. Everyone talks about “self-healing code.” Almost no one measures test strength.

    Every AI codegen tool claims to write tests. None can prove those tests are good.
    Mutation testing turns that proof into a quantifiable metric — a ground truth fitness score for agent-generated code.

    That could become Leonardo’s unique signal:

    “Our agents don’t just generate passing tests — they generate tests that would have caught real bugs.”


    2. It gives you an internal feedback loop no one else has

    If Leonardo runs a mini-mutant cycle after each code patch, the agent learns:

    • Did my tests kill the mutants?
    • Which behaviors were under-specified?
    • How should I strengthen assertions next time?

    That’s a self-correcting system — what everyone’s chasing in “agentic reliability,” but framed in a measurable, code-level way.


    3. Strategic angle: differentiator by truth-alignment

    LLMs hallucinate; mutation testing doesn’t care. It provides a non-linguistic reality check.
    If you brand around “provably verified agents” or “trustworthy codegen,” you’re carving into a credibility niche where others wave their hands.


    4. Execution challenge

    • It’s compute-heavy; you’ll need to sandbox per file or per module.
    • You’ll have to design intelligent sampling (not every mutation, just critical paths).
    • The UX layer must summarize mutation coverage clearly — not overwhelm users with mutants.

    If you solve those friction points, this could sit alongside coverage reports and linting as a new standard metric: Mutation Confidence.


    5. The big story

    Every coding agent today chases “better reasoning.”
    You could chase verifiable reasoning — code that can prove it’s test-robust.
    That’s a narrative both developers and investors can understand.


    You’d be first-mover on “self-verifying AI codegen.”
    Do you want me to sketch how this could slot into Leonardo’s current agentic architecture — like where mutation tests would run in the LangGraph pipeline?

    _____

    what is the standard and common way to test JavaScript stimulus controllers in Ruby on Rails applications?

    ChatGPT said:

    In a typical Ruby on Rails app that uses Stimulus (via Hotwire), the common and “Rails-standard” approach to testing controllers depends on which JavaScript build system you use. Here’s the breakdown:


    🧱 1. Default Rails setup (Importmap + Turbo + Stimulus)

    If your app uses Importmap (the Rails default since 7.0):

    • Rails doesn’t ship with a built-in JS test runner.
    • The convention is to not test Stimulus controllers directly in this setup — instead, you test behavior through system tests (Capybara + Minitest or RSpec).
      Example: # test/system/modal_test.rb require "application_system_test_case" class ModalTest < ApplicationSystemTestCase test "opens modal when button is clicked" do visit root_path click_on "Open Modal" assert_selector ".modal", visible: true end end → This indirectly tests your Stimulus controller behavior via the browser, which is reliable and matches how Rails apps are usually validated.

    🧪 2. Rails with a JS bundler (Webpacker, esbuild, or Vite)

    If you’re using esbuild, Vite, or Webpacker, you can test Stimulus controllers as isolated JavaScript modules:

    Recommended stack:

    • Jest or Vitest (common JS test runners)
    • @hotwired/stimulus-testing (official helper library)

    Example:

    // controllers/hello_controller.test.js
    import { Application } from "@hotwired/stimulus"
    import HelloController from "./hello_controller"
    import { ControllerTestCase } from "@hotwired/stimulus-testing"
    
    describe("HelloController", () => {
      ControllerTestCase.setupApplication((app) => app.register("hello", HelloController))
    
      it("greets the user", () => {
        document.body.innerHTML = `<div data-controller="hello"></div>`
        const element = document.querySelector("[data-controller='hello']")
        const controller = Application.start().getControllerForElementAndIdentifier(element, "hello")
        expect(controller).toBeInstanceOf(HelloController)
      })
    })
    

    Run with:

    npm install --save-dev jest @hotwired/stimulus-testing
    npx jest
    

    🧩 3. Common hybrid pattern

    Many production Rails apps mix the two layers:

    • System tests for full integration (Capybara)
    • Unit JS tests for Stimulus behavior when logic grows (using Jest/Vitest)
      This combination gives fast feedback without needing to boot Rails for every JS tweak.

    ✅ Summary

    EnvironmentStandard Testing ToolType of TestNotes
    Rails 7+ with ImportmapMinitest/RSpec + CapybaraSystem/integrationDefault Rails setup
    Rails with esbuild/ViteJest or Vitest + @hotwired/stimulus-testingUnitBest for logic-heavy controllers
    Turbo/Stimulus hybrid appsMix bothBothCommon in modern Rails teams

    If you tell me whether your app uses Importmap, esbuild, or Vite, I can give you a ready-to-run example test file and config for that setup.

    You said:

    What does Capybara rely on, and is it a heavy resource intensive program to run system tests?

    ChatGPT said:

    Capybara is a DSL layer — not a browser itself. It sits on top of a browser driver and tells it what to do, kind of like a robot clicking through your app.

    Here’s what it actually relies on and how heavy it is depends on which driver you choose:


    ⚙️ Core architecture

    • Capybara = test automation framework that simulates how a real user would interact with your app (clicks, typing, navigation).
    • Driver = the backend that runs your app and drives a browser.

    Common drivers:

    DriverDescriptionResource usage
    Rack::TestDefault. Fast, runs in-process (no browser). Great for controller-level tests, but no JS support.🟢 Very light
    SeleniumLaunches a full real browser (Chrome, Firefox, etc.) to run tests. Required for JS behavior.🔴 Heavy
    CupriteHeadless Chrome via CDP (no Selenium). Much faster and lighter.🟡 Medium
    ApparitionAnother headless Chrome driver, now mostly superseded by Cuprite.🟡 Medium

    🧪 Typical Rails System Test Stack

    Rails system tests use:

    driven_by :selenium, using: :headless_chrome
    

    This means every test:

    1. Boots your Rails app server (usually via Puma)
    2. Spins up a headless Chrome instance
    3. Executes the test steps
    4. Tears both down

    Each test can easily take 2–5 seconds — sometimes more if you load assets or wait for animations. Multiply by dozens of tests and you’re looking at minutes, not seconds.


    ⚡ Ways to make it lighter

    1. Use Cuprite instead of Selenium: # test/application_system_test_case.rb driven_by :cuprite, using: :chrome, screen_size: [1400, 1400] → Speeds up test runs 2–3x, uses less memory, easier to debug.
    2. Parallelize with parallel_tests or built-in Rails parallel test runner.
    3. Tag and isolate system tests — don’t run them for every commit. Run fast unit/integration tests most of the time, and system tests on CI or before deploys.

    🧭 In short

    • Capybara itself is lightweight; the browser behind it is what’s heavy.
    • Using headless Chrome (via Cuprite) is the modern Rails standard for reasonable speed and realism.
    • You can expect medium resource usage — not trivial, but nowhere near full browser automation frameworks like Cypress.

    If you tell me your current test setup (RSpec vs Minitest, and whether you use esbuild or importmap), I can show exactly how to switch your system tests to Cuprite and shave the runtime down.

    You said:

    how does cuprite compare to a playwright browser?

    ChatGPT said:

    Good question — they solve the same problem from different worlds.

    Here’s a clean breakdown:


    🧩 1. What they are

    ToolBuilt onLanguage ecosystemPurpose
    CupriteChrome DevTools Protocol (CDP)RubyA Ruby-native driver for Capybara — simulates user actions for Rails system tests.
    PlaywrightChrome, Firefox, WebKit enginesJavaScript / TypeScript (can be wrapped in Python, etc.)Full-featured cross-browser automation framework — more like a general-purpose browser lab.

    So:

    • Cuprite = a Ruby test driver for headless Chrome.
    • Playwright = a multi-browser automation toolkit built for modern frontend testing.

    ⚙️ 2. Integration philosophy

    • Cuprite plugs into Capybara seamlessly: driven_by :cuprite It keeps everything in Ruby, using the same Capybara DSL (visit, click_on, fill_in, etc.).
    • Playwright is standalone. You run it from Node: await page.goto('http://localhost:3000') await page.click('text=Sign in') It’s designed for frontend engineers writing end-to-end tests, not for Rails system test conventions.

    ⚡ 3. Performance and realism

    FeatureCupritePlaywright
    SpeedGood (CDP direct, faster than Selenium)Excellent (parallel execution, low overhead)
    Browser coverageChrome onlyChrome, Firefox, WebKit (Safari engine)
    ConcurrencySingle browser per testMulti-browser, multi-context
    ReliabilityStable inside RailsVery high — built for CI/CD scale
    DebuggingRails logs + screenshotsFull tracing, time travel, video, network logs

    Playwright is objectively more powerful and modern as a browser automation stack.
    Cuprite is simpler and tighter inside Rails — less setup, less tech context-switching.


    🧠 4. When to use which

    Use Cuprite when:

    • You want Rails-native tests with rails test:system or RSpec feature specs.
    • You care about speed, but not full multi-browser coverage.
    • You value “fits like a glove” over maximum tooling.

    Use Playwright when:

    • You need to test JS-heavy apps or frontend frameworks (React, Vue, Stimulus-heavy UIs).
    • You want modern debugging features (network intercepts, video, tracing).
    • You’re OK running Node alongside Rails.

    🧮 5. Resource footprint

    Playwright spins up full browsers (though headless) — heavier, but optimized for parallelization.
    Cuprite runs a single Chrome process — lighter per test, but can’t scale out as efficiently.
    If you run 100 tests in parallel, Playwright will outperform Cuprite. For small suites, Cuprite wins on simplicity and startup cost.


    🔚 Summary

    CriterionCupritePlaywright
    EcosystemRuby / CapybaraJS / cross-language
    Setup1-line configSeparate Node runner
    SpeedMedium-fastVery fast, parallel
    Depth of toolingMinimalDeep (trace, video, debug UI)
    Multi-browser
    Fit for Rails✅ Perfect⚠️ External integration

    If your system tests live inside Rails and mostly validate Stimulus + Turbo behavior, Cuprite is the sane choice.
    If you’re testing complex browser interactions, want cross-browser coverage, or plan to grow into full webapp automation — Playwright is worth the extra setup.

    Do you want me to show what a Rails + Playwright setup looks like (for example, running Playwright tests against a local Rails server)?

    You said:

    what is the simplest way to add the most lightweight configuration for testing JavaScript stimulus controllers? I want the most simple setup, lightest computational tasks, but still ability to test critical paths in the JavaScript code

    ChatGPT said:

    If you want the simplest, lightest, and still meaningful setup for testing Stimulus controllers, this is the pragmatic approach most senior Rails devs take:


    🧩 1. Keep the backend tests in Ruby (Minitest/RSpec)

    Use system tests or feature specs only for end-to-end sanity checks.
    These are slow, so you keep them minimal — one or two to ensure Stimulus wiring works with Rails pages.

    Example:

    # test/system/tabs_test.rb
    require "application_system_test_case"
    
    class TabsTest < ApplicationSystemTestCase
      test "switches tab" do
        visit root_path
        click_on "Next Tab"
        assert_text "Tab 2 content"
      end
    end
    

    That’s your smoke test layer.


    ⚡ 2. Add a micro-unit test layer with Vitest (or Jest)

    Instead of spinning up Capybara, test the Stimulus controller logic directly in Node.
    This avoids running Rails, browsers, or databases.

    Install (lightweight)

    npm install --save-dev vitest @hotwired/stimulus @hotwired/stimulus-testing jsdom
    

    Example test file

    // controllers/hello_controller.test.js
    import { describe, it, expect } from "vitest"
    import { Application } from "@hotwired/stimulus"
    import HelloController from "./hello_controller.js"
    import { ControllerTestCase } from "@hotwired/stimulus-testing"
    
    describe("HelloController", () => {
      it("connects and sets text", async () => {
        document.body.innerHTML = `<div data-controller="hello"></div>`
        const app = Application.start()
        app.register("hello", HelloController)
        await Promise.resolve() // let Stimulus connect
        expect(document.querySelector("[data-controller='hello']")).toBeTruthy()
      })
    })
    

    Run it:

    npx vitest run
    

    That’s it — no Rails process, no browser. Tests finish in milliseconds.


    🪶 Why this setup is optimal

    AspectApproachReason
    SpeedNode-only + jsdomNo browser startup, no Rails boot
    ComplexityMinimal NPM configWorks out of the box with importmap or esbuild
    CoverageLogic-level + minimal UI smoke testsFocuses testing energy where bugs live
    Tooling weight≈5 dev dependenciesPlaywright or Selenium pull dozens

    ✅ In short

    1. Keep 1–2 system tests in Capybara for integration sanity.
    2. Use Vitest + @hotwired/stimulus-testing for all controller logic tests.
    3. Avoid full Playwright or Selenium setups unless you’re debugging rendering quirks.

    This gives you the lowest CPU footprint with meaningful coverage over Stimulus logic.


    Want me to show the exact package.json + folder structure so you can drop this into your Rails repo cleanly (works with both importmap and esbuild)?

    True or False: having our AI coding agent, Leonardo, be able to write tests and execute tests, in order to try and prevent bugs is essential for our coding agent to be high quality.

    ChatGPT said:

    True — but with nuance.

    If Leonardo is meant to generate production-grade code that runs unsupervised, then writing and running tests isn’t optional; it’s the nervous system of self-correction. Tests give the agent an objective signal of success or failure, turning “I think this change works” into “this change demonstrably works.” Without that loop, you get hallucinated correctness — code that compiles, maybe even runs, but silently breaks behavior.

    That said, the goal isn’t tests for their own sake. The essential part is closed-loop verification. Traditional tests (RSpec, Minitest, Jest, etc.) are one form. Others — property-based testing, mutation testing, fuzzing, or differential runs against a golden dataset — all serve the same principle: measurable correctness that the agent can learn from.

    So, strictly speaking:
    True: a high-quality coding agent must be able to generate and execute tests or equivalent verification loops.
    ⚠️ But: what’s essential isn’t the testing framework — it’s the ability to prove or falsify its own outputs autonomously. That’s the real differentiator.

  • Daily Habits = High Leverage

    Your life just consists of the daily actions you do, and where your attention is placed that day. You add those days up into a week, weeks turn into months, months turn into years, and before you know it, you’ve lived the entirety of your life.

    This is a beautiful realization, that can add intensity to life — each moment is extremely precious, because these moments are limited.

    By focusing our attention on intentional daily actions, we gain more control over our life, our career, and the wellbeing of ourselves and our community.

    That’s why filling out a daily habit chart is one of the most life-changing things you can do.

    Here’s mine for last week:

    And here’s mine for this week:

    Link to the Daily Habits template: https://docs.google.com/spreadsheets/d/1PtXrA4Pc5SM5ZnVo36Cc3F2cHDl8D8imb20VCPyIhmc/edit?usp=sharing

  • How to Enable Live Edits from VSCode SSH Sessions on Prod Leonardo Instances

    If you’re editing files over SSH in VSCode and get a “permission denied (EACCES)” error, it’s usually because your files are owned by root instead of your SSH user (ubuntu).

    To fix it, just reset ownership so your VSCode session can write to the project directory:

    sudo chown -R ubuntu:ubuntu /home/ubuntu/llamapress
    

    That one-liner restores live edit access instantly — no rebuilds, no restarts needed.
    Tip: Always use your app’s normal runtime user (e.g., ubuntu) for development and maintenance to keep live edits fast and safe.

  • How to debug langgraph/agents/leo/nodes.py in a Leonardo project

    How to debug langgraph/agents/leo/nodes.py

    1.⁠ ⁠Run docker compose down
    2.⁠ ⁠⁠In docker-compose-dev.yml, comment out ⁠ command: bash -c “python init_pg_checkpointer.py && uvicorn main:app –host 0.0.0.0 –port 8000″ ⁠ and uncomment ⁠ # command: tail -f /dev/null ⁠.
    3.⁠ ⁠Run bash bin/dev again. This time, the LlamaBot container will start, but Uvicorn (FastAPI) won’t run, so you won’t be able to access localhost:8000 yet.
    4.⁠ ⁠⁠⁠Run this ⁠ docker compose exec -it llamabot uvicorn main:app –host 0.0.0.0 –port 8000 ⁠ to manuallys tart Uvicorn (FastAPI). You should now be able to access localhost:8000
    5.⁠ ⁠⁠Now, you can add ⁠ breakpoint() ⁠ in langgraph/agents/leo/nodes.py, like so:

    1. ⁠⁠ # Node
      def leo(state: LlamaPressState):
      breakpoint() # add this line to test if this is being loaded correctly, and that we hit the breakpoint.
      llm = ChatOpenAI(model=”gpt-4.1″)
      llm_with_tools = llm.bind_tools(tools) custom_prompt_instructions_from_llamapress_dev = state.get(“agent_prompt”)
      full_sys_msg = SystemMessage(content=f”””{sys_msg} Here are additional instructions provided by the developer: {custom_prompt_instructions_from_llamapress_dev} “””) return {“messages”: [llm_with_tools.invoke([full_sys_msg] + state[“messages”])]} ⁠
  • How to Deploy Leonardo (AI Full Stack Coding Agent) from LocalHost to Production Using AWS Lightsail + Route53

    Leonardo is our coding agent that builds Ruby on Rails applications, and a built in AI agent orchestration runtime that’s written in Python and LangGraph.

    So in parts, this is a guide on deploying Ruby on Rails applications, but also deploying LangGraph agents.

    To do this effectively, we use Docker Containers.

    Docker is an amazing tool I knew almost nothing about 3 months ago, and yet by using ChatGPT, I’ve learned how powerful and amazing Docker is.

    In general the LlamaPress tech stack (including LlamaBot, our LangGraph agent runtime, and Leonardo, our actual coding agent), allows us to develop powerful AI applications with powerful agent experiences inside the application.

    We get the benefit of these three powerful open source frameworks:

    • Ruby on Rails for it’s powerful full stack scaffolding features & rapid web application development cycles.
    • LangGraph for it’s powerful agent orchestration framework.
    • Docker for it’s powerful architecture-agnostic dev setup + project deployments. Perfect for going from a working prototype running on localhost to production

    To deploy a Leonardo Application to production from localhost, I recommend taking the following approach:

    Initial Installation of LlamaBot & Leonardo

    aws configure # set up aws cli on your machine
    git clone https://github.com/KodyKendall/LlamaBot
    cd LlamaBot
    bash bin/deploy_llamabot_on_aws.sh

    This bin/deploy_llamabot_on_aws.sh script does the following:

    1. Collects important information for setting up your AWS Lightsail Instance.

    
    read -p "Name of instance: " INSTANCE
    read -p "Path to identity file: (defaults to ~/.ssh/LightsailDefaultKey-us-east-2.pem)" IDENTITY_FILE
    export INSTANCE
    export REGION=us-east-2
    export AZ=${REGION}a
    export BLUEPRINT=ubuntu_24_04
    export BUNDLE=small_2_0
    export IDENTITY_FILE=${IDENTITY_FILE:-~/.ssh/LightsailDefaultKey-us-east-2.pem}

    Type your instance name (no spaces allowed). In our case I’m naming it: “HistoryEducation”

    2. Launches an AWS LightSail Instance ($12/mo.)

    
    aws lightsail create-instances \
      --instance-names "$INSTANCE" \
      --availability-zone "$AZ" \
      --blueprint-id "$BLUEPRINT" \
      --bundle-id "$BUNDLE" \
      --region "$REGION"
    
    IPADDRESS=$(aws lightsail get-instance \
                  --instance-name "$INSTANCE" \
                  --region "$REGION" \
                  --query 'instance.publicIpAddress' \
                  --output text)
    
    echo $IPADDRESS
    
    cat >> ~/.ssh/config <<EOF
    Host $INSTANCE
            HostName $IPADDRESS
            User ubuntu
            IdentityFile $IDENTITY_FILE
            IdentitiesOnly yes
    EOF

    3. Sets up DNS records through Route 53.

    
    export DOMAIN=llamapress.ai.   
    export ZONE_ID=$(aws route53 list-hosted-zones-by-name \
      --dns-name "$DOMAIN" --query 'HostedZones[0].Id' --output text | sed 's|/hostedzone/||')
    echo $ZONE_ID
    
    TARGET_FQDN=$INSTANCE.llamapress.ai.
    RAILS_TARGET_FQDN=rails-$TARGET_FQDN
    
    cat > new-a-record.json <<EOF
    {
      "Comment": "Add A records for $TARGET_FQDN for LlamaBot Agent Deploy",
      "Changes": [
        {
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "${TARGET_FQDN}",
            "Type": "A",
            "TTL": 60,
            "ResourceRecords": [
              { "Value": "${IPADDRESS}" }
            ]
          }
        },
        {
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "${RAILS_TARGET_FQDN}",
            "Type": "A",
            "TTL": 60,
            "ResourceRecords": [
              { "Value": "${IPADDRESS}" }
            ]
          }
        }
      ]
    }
    EOF
    
    aws route53 change-resource-record-sets \
      --hosted-zone-id "$ZONE_ID" \
      --change-batch file://new-a-record.json

    4. Opens up Port 443 for HTTPS access

    echo "Instance created! Now, waiting to open port 443..."
    sleep 20
    
    # Open port 443: 
    aws lightsail open-instance-public-ports \
      --instance-name "$INSTANCE" \
      --port-info fromPort=443,toPort=443,protocol=TCP \
      --region "$REGION"
    
    #Check port is open on instance
    aws lightsail get-instance-port-states \
      --instance-name "$INSTANCE" \
      --region "$REGION" \
      --query 'portStates[?fromPort==`443`]'
    

    5. Allows you to SSH into instance directly, to install LlamaBot on your production Ubuntu server

    echo "Instance is ready to be used! type command ssh $INSTANCE to connect to it, then paste the following command to install the agent: "
    echo "curl -fsSL "https://raw.githubusercontent.com/KodyKendall/LlamaBot/refs/heads/main/bin/install_llamabot_prod.sh" -o install_llamabot_prod.sh && bash install_llamabot_prod.sh"
    
    ssh $INSTANCE

    After you you’ve done this, you should be able to ssh into the server.

    Step 6. SSH into your LlamaBot & Leonardo Instance, and run the install script.

    Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
    applicable law.
    
    To run a command as administrator (user "root"), use "sudo <command>".
    See "man sudo_root" for details.
    
    ubuntu@ip-172-26-14-43:~$ 

    Paste in the following command:

    curl -fsSL "https://raw.githubusercontent.com/KodyKendall/LlamaBot/refs/heads/main/bin/install_llamabot_prod.sh" -o install_llamabot_prod.sh && bash install_llamabot_prod.sh

    You’ll see this:

       _      _                           _          _   
      | |    | |  __ _  _ __  __    __ _ | | _  ___ | |_ 
      | |    | | / _` || '_ \ _ \  / _` ||  _ \/ _ \| __|
      | |___ | || (_| || | | | | || (_| || |_)  |_| | |_ 
      |_____||_| \__,_||_| |_| |_| \__,_||____/\___/ \__|
                                                               
                       (\     (\
                      (  )   (  )
                     (   )___(  )
             ______ /           )
            {                   )
             |                 ))   L L A M A B O T   I N S T A L L E R
              \______         ))    LangGraph + Rails + Docker + Caddy
                 (           ))
                 |           ))     LlamaBot (LangGraph) • LlamaPress (Rails)
                 |           ))
                 |           ))     v0.2.6
                 |           ))
        
    
    → Kickstarting setup... (press Ctrl+C to abort)
    
    🦙🤖 Paste your OpenAI API Key: 
    

    Paste in your OpenAI API Key, and hit enter.

    You’ll see a request for putting in your hosted domain.

    
    🦙🤖 Paste your OpenAI API Key: sk-proj-*******
    
    🌐 Enter your hosted domain (e.g., example.com): 
    

    This domain format must match this: <INSTANCE>.<AWS_ROUTE_53_DOMAIN>

    Make the first part of the domain match EXACTLY what you put as the INSTANCE type in the install script, in my case it’s HistoryEducation as the INSTANCE>

    Then, in the second part of the domain, put in your actual domain name that’s configured in AWS Route 53. So in my case it’s llamapress.ai in AWS_ROUTE_53_DOMAIN.

    So, this means I’m pasting in this:

    HistoryEducation.llamapress.ai

    Which ends up looking like this:

    🌐 Enter your hosted domain (e.g., example.com): HistoryEducation.llamapress.ai

    Now, the following things will get installed on your Ubuntu 24 server automatically.

    1. Docker
    2. Caddy
    3. Github CLI
    4. LlamaBot & Leonardo.

    This should take approximately 5 minutes or less.

    You should see this if it succeeded.

    🎉 Leonardo is deployed!

    Now, you should be able to navigate to your URL. In this case, it should be:

    https://HistoryEducation.llamapress.ai

    Sign into Leonardo

    The default username is: kody

    The default password is: kody

    You should now be able to see the Leonardo interface.

    Get your Leonardo Instance by Authenticating with Github, and adding an origin.

    Back in your ssh terminal:

    gh auth login
    
    > Github.com
    > HTTPS
    > Y
    > Login with a web browser
    
    ! First copy your one-time code: C881-1E51
    Press Enter to open github.com in your browser...
    

    Copy the code.

    Go to https://github.com/login/device

    Paste the code.

    Continue -> Authorize.

    You may need to install the Github mobile app and give it an access code.

    cd llamapress
    git init
    git remote add origin <your leonardo app url>
    

    in my case it’s:

    git remote add origin https://github.com/History-Education-Foundation/StoryBook
    

    Then, git fetch, checkout main, and run docker compose up -d:

    git fetch
    git checkout main
    docker compose up -d 

    If you get messages like:

    ? Volume "llamapress_rails_storage" exists but doesn't match configuration in compose file. Recreate (data will be lost)? Yes
    

    Then always select “yes”.

  • 🚀 Customizing Leonardo with New Gems & Ruby on Rails Extensions

    By default, Leonardo runs your Rails app (llamapress-simple) as a Docker image baked into the system. To add new Gems or deeper Rails customization, you’ll need to:

    1. Clone & modify the Rails app
    2. Build & push a new Docker image
    3. Update Leonardo’s Docker Compose config to use your new image
    4. Restart services inside your Leonardo instance

    All changes will persist across EC2 stops/starts because the instance root volume lives on EBS.


    1. Clone the Rails app locally

    git clone https://github.com/KodyKendall/LlamaPress-Simple
    cd LlamaPress-Simple
    

    2. Add new gems

    Add your gem with bundle add or edit the Gemfile directly:

    bundle add devise
    bundle install
    

    This updates Gemfile and Gemfile.lock. Commit these changes if you want to keep them versioned.


    3. Build a new Docker image

    Use Docker Buildx to build for linux/amd64 (the platform EC2 runs):

    docker buildx build \
      --platform linux/amd64 \
      -t your-dockerhub-username/llamapress-simple:0.1.18 .
    

    Push the image to Docker Hub:

    docker push your-dockerhub-username/llamapress-simple:0.1.18
    

    4. SSH into your Leonardo instance

    From your local machine:

    ssh ubuntu@your-instance-ip
    

    Navigate to the Docker Compose directory:

    cd ~/llamapress
    

    5. Update docker-compose.yml

    Open the file:

    nano docker-compose.yml
    

    Find the service for llamapress and update the image tag:

    llamapress:
      image: your-dockerhub-username/llamapress-simple:0.1.18
      build: .
      ports:
        - "3000:3000"
      # other options…
    

    Save and exit.


    6. Pull & restart the container

    docker compose pull llamapress
    docker compose up -d llamapress
    

    This pulls your new image and restarts the container in detached mode.


    7. Verify

    Check logs:

    docker compose logs llamapress
    

    Visit your Leonardo instance in the browser → your Rails app is now running with the new gems.


    ⚡ Notes

    • Changes to docker-compose.yml persist across stops/starts. (Your instance’s filesystem is preserved).
    • To make new users get the updated config by default, bake a new AMI with the updated compose file.
    • Use semantic versioning for your images (e.g., 0.1.18, 0.1.19) to avoid confusion.
  • Using Leonardo on Windows – Commands

    We use WSL to run docker to run Leonardo.

    When editing the source code in VSCode, we will often run into file permissions. To fix this issue, run this command:

    “`sudo chown -R user:user /home/user/llamapress“`

    Replace user with your windows username.

  • How to Authenticate Leonardo Coding Agent with Github CLI to Sync Code to Github

    Leonardo is an AI Coding Agent that builds full stack web apps directly in the browser.

    Perfect for MVPs, internal tools, and prototypes. It’s important to save the code that Leonardo generates to your Github account, so that you never lose access to it.

    By default, Leonardo will save code changes to the local git repository on the computer he’s running on. But, this doesn’t automatically push the code to Github until you authenticate your Github account, create a repository for the code, and then instruct Leonardo to link the Github repository with the local code project. Then, Leonardo can push local code changes up to Github, which is the most secure way to save your code long-term.

    Here’s how to authenticate Leonardo with your Github account.

    Step 1: Sign into your Github account.

    Step 2: Go to Profile -> Settings -> Developer Settings

    Step 3: Click on “Generate new token” -> “Generate new token (classic)”

    Step 4: Put in a name that’s easy to remember, I recommend saying it’s for Leonardo and LlamaPress.

    Step 5: Select the checkbox on top that says “repo”. This should auto select everything else.


    Step 6: Scroll down, and also make sure that you select admin:org (allows Leonardo to create new repos for you).

    Step 6: Copy your personal access token to your clipboard.

    Step 7: Open up your application where you can chat with Leonardo

    Step 8: Send in the following message to Leonardo.

    Can you please authenticate me with github cli? Here is my access token: ghp_9xlk ...

    Step 9: Create and link the Github repository to your local code changes.

    Ask Leonardo to either create a new repository for you, and then commit your local git changes and push.

    OR, create a new repository from your Github, then copy out the URL and ask Leonardo to set it as the origin, and to commit all code and push to that origin.

    Step 10:

    Link the Github authenticated with the local git.

  • A Chatbot that Builds Rails Apps

    Monday, 08/18/2025.

    I’ve been steadily working towards this idea for the last 12 months. Back when I first started I knew this was technically feasible, even inevitable, but there were so many unknowns that I hadn’t worked out.

    The first version was letting LLMs generate HTML/CSS/JavaScript code, in a fully formatted HTML document, like so:

    <!DOCTYPE html> <html> ... (llm generated content) </html>

    By saving this LLM generated code into a database table named “Pages”, we simply dump that raw HTML content into the response when a user requests the page, like so:

    
    class PagesController < ApplicationController
      # GET /pages/1 or /pages/1.json
      def show
        content = @page.render_content
        render inline: content.html_safe, layout: 'page'
      end
    end
    

    This was the easiest way to release a product without needing to worry about partitioning file systems for each user, etc. Me and a friend ultimately wrapped this all into a Ruby on Rails app, and launched it under the name: “LlamaPress”, and marketed it as a webpage builder for people who want to build simple websites & webpages.

    Using OpenAI’s free credit program (free credits by opting into sharing our prompts with them), we offered this as a freemium service, and have had over 8,000 user registrations, and over 40,000 pages created.

    We even got featured on Business Insider, because it turns out scammers had used LlamaPress to clone a victims online store, download the HTML out of our platform, and then used it to phish them. Yikes. More about that later.

    While having an LLM write HTML code directly into a database and rendering it in real time is a great start, it still fell short of the original dream of letting LLMs write and execute Rails scaffolding commands, and build Rails apps.

    Now, I’m ready to make the leap.

    Here’s how it’s going to work, starting with a screenshot of what I’m currently running locally.

    This is a FastAPI app running on localhost:8000, with a chat interface on the left, and an iFrame loading the front-page of the Rails app, LlamaPress, that’s running on localhost:3000.

    The chat interface on the left is similar to ChatGPT or Lovable, where you can chat with an LLM that’s tied into the filesystem of the Rails app.

    There’s a websocket connection from this webpage to FastAPI that sends messages to LangGraph. Within LangGraph, there’s a ReAct agent, with tool calls to read & write files from the file-system, run Rails terminal commands (like scaffolding, rake tasks, console, etc), commit code to the git history, etc.

    This all runs from a single docker-compose.yml file, where there are four containers.

    1. A Ruby on Rails docker container that runs on localhost:3000.
    2. A FastAPI/LangGraph container that runs on localhost:8000
    3. A Postgres container that acts as the database for Rails and to save LangGraph agent checkpoints.
    4. A Redis container for ActionCable web socket communication from the browser to Rails.

    I don’t have a final version to show yet, but I will soon.

    This is an ongoing effort, and I’ll continue to make updates on this post as I make progress.

    Kody

    Tue, 08/19/2025.

    I have a breakpoint triggering inside the FastAPI application, and we can view the file contents inside the Ruby on Rails app docker container.

    We do this through doing a shared, mounted volume from the Ruby on Rails docker container, to the FastAPI docker container.

      llamapress:
        # image: kody06/llamapress:0.1.16
        build:
          context: LlamaPress
          dockerfile: Dockerfile.dev
        stdin_open: true     # <-- enables byebug/irb input
        tty: true             # <-- attaches terminal output
        env_file:
          - .env
        command: bash -c "rm -f tmp/pids/server.pid && bundle exec rails db:prepare && bundle exec rails s -b '0.0.0.0'"
        platform: linux/amd64       # ← tells Compose “run this container as amd64 under QEMU”
        volumes:
          - ./app:/rails/app
          - ./config/routes.rb:/rails/config/routes.rb    # mount just one file
          - ./db:/rails/db                               # mount entire db folder
        ports:
          - "3000:3000"
        networks:
          - llama-network
        environment:
          - RAILS_ENV=development
          - DATABASE_URL=postgres://user:password@db:5432/llamapress
          - BUNDLE_PATH=/usr/local/bundle
          - REDIS_URL=redis://redis:6379/1
          - BOOTSNAP_CACHE_DIR=/rails/tmp/cache/bootsnap
        depends_on:
          - db
          - redis
    
      llamabot:
        # image: kody06/llamabot:0.1.16
        build: 
          context: LlamaBot
          dockerfile: Dockerfile
        volumes:
          - ./app:/app/app/rails/app
          - ./config/routes.rb:/app/app/rails/config/routes.rb    # mount just one file
          - ./db:/app/app/rails/db                               # mount entire db folder
          - ./LlamaBot/app:/app/app
        env_file:
          - .env


    And here’s a picture where we can see the beginnings of how our LangGraph agent could view the contents of the Rails app.

    Here’s a breakpoint where we can view and inspect all of the internals of the Rails app, from the breakpoint()

    INFO:app.websocket.web_socket_handler:Received message: {'message': 'test', 'thread_id': '2025-08-19_09-57-36', 'agent_name': 'llamabot'}
    INFO:app.websocket.web_socket_handler:Waiting for message from LlamaPress
    INFO:app.websocket.request_handler:Created state with keys: ['messages']
    > /app/app/agents/llamabot/nodes.py(42)read_rails_file()
    -> print(os.listdir(APP_DIR / "rails"))
    (Pdb) 
    (Pdb) n
    ['app', 'config', 'db']
    > /app/app/agents/llamabot/nodes.py(43)read_rails_file()
    -> # ['app', 'config', 'db'] <- these are the folders we're able to see in the container.
    (Pdb) os.listdir(APP_DIR / "rails" / "app")
    ['.DS_Store', 'mailers', 'strategies', 'models', 'instruments', 'llama_bot', 'javascript', 'jobs', 'controllers', 'views', 'assets', 'helpers', 'services', 'channels']
    (Pdb) os.listdir(APP_DIR / "rails" / "app" / "controllers")
    ['llama_bot_controller.rb', 'application_controller.rb', 'organizations_controller.rb', 'pages_controller.rb', 'admin', 'users_controller.rb', 'users', 'sites_controller.rb', 'api', 'page_histories_controller.rb', 'submissions_controller.rb', 'concerns']
    (Pdb) os.listdir(APP_DIR / "rails" / "app" / "models")
    ['site.rb', 'page_history.rb', 'page.rb', 'organization.rb', 'submission.rb', 'checkpoint.rb', 'application_record.rb', 'concerns', 'user.rb']
    (Pdb) os.listdir(APP_DIR / "rails" / "app" / "views")
    ['organizations', 'devise', '.DS_Store', 'admin', 'shared', 'sites', 'layouts', 'chat_conversations', 'users', 'pwa', 'submissions', 'page_histories', 'llama_bot', 'pages']
    (Pdb) 

    This is interesting because it means we can take these files, and feed them into the LLM if we want to, and it would be able to understand the Rails application.

    We could also let the LLM overwrite these files directly, with new code. And the user would be able to see the changes in real time through the iFrame preview.

    Let’s see if we can get the LLM to update the current home-page title from “Build a website in seconds!” to something more based like: “Hello from Leonardo da Llama”.

    Let’s change the system prompt to tell the LLM that it can view Rails files, and let’s give it a tool call to read.

    Old System Prompt:

    You are Leonardo, a Llama that reads and writes HTML/CSS/JavaScript code.

    Your task is to help the user to modify and create webpages, using HTML/CSS/JavaScript.

    All outputted code is saved into a single HTML file, including the CSS and JavaScript code.

    Here is the current page that the user is viewing: <HTML_PAGE>{content}</HTML_PAGE>

    New System Prompt:

    You are Leonardo, a Llama that can read and write changes to a Ruby on Rails application.

    Your task is to help the user with their Ruby on Rails application, by answering questions, making modifications, etc.

    You can list the contents of the Rails directory to explore the app.

    And here’s the tool calls we’ll pass the agent, using LangGraph’s tool decorator.

    ```
@tool
def read_rails_file(filepath: str) -> str:
    """Read the contents of a Rails file."""
    # Construct the full path
    full_path = APP_DIR / "rails" / filepath
    
    # Check if file exists
    if not full_path.exists():
        return f"File not found: {filepath}"
    
    # Read the file contents
    try:
        # Option 1: Using pathlib (recommended)
        contents = full_path.read_text()
        
        # Option 2: Using traditional open()
        # with open(full_path, 'r') as f:
        #     contents = f.read()
        
        return contents
    except Exception as e:
        return f"Error reading file: {e}"

@tool 
def list_directory_contents(directory: str = "") -> str:
    """
    List the contents of a directory.
    If directory is empty, lists the rails root directory.
    """
    # Build path - if directory is empty, just use rails root
    dir_path = APP_DIR / "rails" / directory if directory else APP_DIR / "rails"
    
    if not dir_path.exists():
        return f"Directory not found: {dir_path}"
    
    return os.listdir(dir_path)

# Global tools list
tools = [list_directory_contents, read_rails_file]

# System message
sys_msg = SystemMessage(content=f"""You are Leonardo, 
a Llama that can read and write changes to a Ruby on Rails application.
Your task is to help the user with their Ruby on Rails application, 
by answering questions, making modifications, etc.
You can list the contents of the Rails directory to explore the app.
""")```

    Will it work? Let’s test it.

    Ok, great. The agent is reading directory contents and file contents properly. Next step, let’s see if we can get the LLM to overwrite a file.

    I’ve added an additional tool method for the LLM to use.

    We’ll help the LLM by giving it the path to the file in question, which is: app/views/devise/registrations/new.html.erb.

    This is acting as the home page for unregistered users, and says: “Build a website in seconds!”

    Here’s the message I’ll send Leonardo:

    Can you please view the contents of the file app/views/devise/registrations/new.html.erb, and tell me what it says?

    Leonardo responds with:

    🔨 Using tool: read_rails_file

    The app/views/devise/registrations/new.html.erb file provides a modern, multi-step user experience for new user registration and sign-in, with a focus on quickly building a website using an AI prompt. Here’s what it contains: 1. Landing/Prompt Section: – Displays the LlamaPress logo. – Headline: “Build a website in seconds!” – Users are prompted to enter their website idea in a textarea, with a typing effect suggesting example prompts (e.g., “Build me a landing page for my plumbing business…”). – Button: “Generate your website” (triggers the signup form). 2. Signup Section (hidden by default, revealed after clicking the button): – Form fields for email, password, confirmation, and a hidden prompt (the idea entered above). – Automatically fills a hidden “organization name” based on the user’s email. – Error and validation messages are clearly styled. – Button: “Create my website”. – Option to switch to sign-in form for returning users. 3. Signin Section (hidden by default, revealed via links): – Standard Devise login form with email and password. – Displays validation/flash messages. – Option to switch to sign-up. 4. JavaScript Enhancements: – Placeholder typing animation for example prompts in the textarea. – Transition logic to slide between prompt, signup, and sign-in sections, carrying the user’s prompt through the process. – Organization name is dynamically set when a user enters their email. 5. Styling: – Modern, clean interface with Tailwind CSS and FontAwesome styling. – Responsive, mobile-friendly layout. Summary: This page is designed to quickly capture a website idea (prompt), then guide the user through signup or login while making the process engaging and clear. It keeps the user’s input all the way to account creation, enabling a fast start for generating their website. Let me know if you want to see the actual rendered HTML, a certain section, or need the content in a different format!

    I then said:

    Can you please change the h1 from “Build a website in seconds!” to say: “Hello world from Leonardo?”

    He responded with:

    The heading (h1) in your registrations new user page has been changed from “Build a website in seconds!” to: Hello world from Leonardo? You’ll now see this new heading when users visit the signup page! If you’d like any further customizations, just let me know.

    And here’s the change Leonardo made:

    In order to view the changes, I had to bring down the Rails docker container, and and then bring it up again. Ideally, the iFrame could refresh automatically instead of needing to run:

    docker compose down -v llamapress
    docker compose up -d llamapress

    I’m sure there’s a way to hotload the files in, or give a tool-call for Leonardo to pull the docker container down, and then recreate it.

    But, most importantly we’ve verified that because the Rails files are in a mounted volume that lives on my local computer, the code changes that Leonardo wrote in the Rails directory persist.

    The next steps to me seem clear:

    1. Figure out how to get the Rails changes working immediately, by either restarting the Rails server inside the docker container, or enabling some sort of hot-reload by putting Rails in dev mode.
    2. Improve user experience while the user is waiting for the LLM response, when the LLM is calling tools, and while the LLM is streaming.
    3. Figure out how to commit file changes in case a mistake is made, and we need to rollback commits.

    The first one we could start with a tool call to allow the agent to restart the rails server, from inside the container. Or potentially even just restart the entire docker container. With that too, we’d want a good user experience communicating what’s happening.

    For the second issue, we’ve already solved similar problems for LlamaPress, where we basically have JavaScript on the client-side that listens for LangGraph base_message tool calls, then you can case on the tool_name args and input arguments, and format HTML elements depending on what tool call was done. I’ve been needing to consolidate all of this into a JavaScript library that can be used across projects, across LlamaPress, across LlamaBot, etc. I’ve found myself re-writing a lot of the front-end code needed to format agent tool calling outputs, and LLM outputs in general.

    The third one seems easy enough, we can add another git command tool for Leonardo.

    We’d need to figure out how to do this from within the Python/FastAPI Docker container that shares the mounted volume with the rails folder. Is it as simple as just running git commit -m "message"?