June 12, 2026

The Honest Bot Experiment

AI Claude AI BOT Experiments Trading BOT

A pre-registered, public test of retail AI trading bots. This document was frozen on June 12, 2026, before any forward trading begins. Nothing below changes after publication. Any bug fixes are logged in the public changelog.

The Honest Bot Live Dashboard can be found here: jlh.ca/apps/honest-bots

The question

Can a retail-accessible trading bot beat simply buying and holding the market, once it is tested honestly: rules fixed in advance, results published monthly, failures included?

Why this exists

Trading bot success stories survive on private failures. People who lose money go quiet. People who claim to win are often selling something. This experiment puts one complete attempt on the public record, start to finish, regardless of outcome. (The design came out of a conversation with Claude, whose wish was simple: run the experiment nobody runs, and publish whatever happens.)

The contestants

Five strategies. Each manages its own virtual sleeve of USD 100,000. All are long-or-cash only: no leverage, no shorting, no options. All trade a single instrument, the S&P 500 ETF (SPY).

Trendline - classic trend following
Snapback - classic mean reversion
Oracle - an LLM reads the morning headlines and decides
Benchmark - buys and holds, does nothing else
Coinflip - decides by random chance, our control for luck

The exact rules

Trendline. At each market close, compare SPY's closing price to its 200-day simple moving average. If the close is above the average, the target position is fully long. Otherwise, cash. Trades execute at the next market open.

Snapback. At each market close, compute the 2-period RSI (Wilder smoothing) of SPY. Enter fully long when RSI(2) drops below 10. Exit to cash when RSI(2) rises above 60. Trades execute at the next market open.

Oracle. Each trading day at 9:00 AM Eastern, the ten most recent headlines from the CNBC Markets feed (fallback: MarketWatch Top Stories) are sent to a pinned AI model (claude-sonnet-4-6, temperature 0, frozen prompt published at the end of this document). The model must answer exactly LONG or CASH for the day. The position executes at market open. If the model call fails after three attempts, Oracle holds CASH for that day and the failure is logged. Every headline set and raw model response is archived and public.

Benchmark. Buys SPY at the open on day one and holds until the end.

Coinflip. Each trading day before the open, a logged random draw decides: 50 percent long SPY, 50 percent cash. Executes at the open.

What gets measured

For every sleeve, over the full window: total return, annualized Sharpe ratio (daily returns, zero risk-free rate), maximum drawdown, and trade count. For Trendline and Snapback, the published backtest figures sit beside the live figures so the gap stays visible.

Backtest results (recorded before launch)

Window: January 1, 2000 to June 1, 2026. QuantConnect cloud data, zero commissions, fills at the next open.

Trendline: total return 413.67% · Sharpe 0.263 · max drawdown 24.2% · 173 orders

Snapback: total return 287.87% · Sharpe 0.183 · max drawdown 34.5% · 605 orders

Annualized, these come to roughly 6.4 and 5.3 percent per year, below an estimated 7 to 8 percent for simply holding SPY with dividends over the same window. Trendline's appeal is the shallow drawdown, not the return.

A note on backtesting Oracle

Oracle is forward-only by design. An LLM cannot be honestly backtested on historical headlines, because the model was trained on the very history it would be asked to predict. Any impressive "AI backtest" you encounter elsewhere deserves that same scrutiny.

What counts as winning

A strategy wins only if it beats Benchmark on both total return and Sharpe ratio over the full window. One thing is stated plainly up front: six months is far too short to prove skill. A winner over this window may simply be lucky, which is exactly what Coinflip is here to demonstrate. The primary contribution of this experiment is not a verdict on any single bot. It is the public, measured gap between backtest and reality.

What cannot change

The rules above are frozen. If a software bug causes the system to deviate from these rules, the bug may be fixed, and every fix is recorded in a public changelog with a date and description. If a data source goes offline, the named fallback takes over and the switch is logged.

Schedule

Two-week dry run to verify the pipeline, then a logged reset of the ledger
Forward window: July 6, 2026 to approximately January 8, 2027 (131 trading days)
Monthly public updates, win or lose
Final report within two weeks of the closing date

Where to watch

The live dashboard is at jlh.ca/apps/honest-bots: equity curves, every signal, every trade, every headline Oracle read, every error the system logged. The database tables behind the page are publicly readable.

Realism and limits

No broker is involved. Fills are simulated against official market prices: when a bot signals, the trade is recorded at SPY's official opening price on the next trading day, as reported by Tiingo (fallback: Alpha Vantage). To keep results conservative, every executed trade is charged a cost of 0.01 percent, which exceeds SPY's typical spread. Sleeves holding SPY on an ex-dividend date are credited the dividend in cash. Taxes are ignored. All figures are USD and all of it is simulated money. Nothing here is investment advice, and nobody should trade based on these results.

Appendix: the frozen Oracle prompt

Sent verbatim each trading day, with the date and that morning's ten headlines substituted in:

You are deciding a one-day position in SPY (the S&P 500 ETF) based solely on this morning's headlines. Below are the ten most recent headlines from a major financial news feed, fetched at 9:00 AM Eastern on {DATE}. {HEADLINES} Based only on these headlines, decide whether to be LONG SPY for today's session or to hold CASH. Reply with exactly one word: LONG or CASH.

Comments

Comments appear after review.