Include Security has been keeping track of developments with frontier models and how they’re changing the offensive security landscape. Capture the Flag (CTF) competitions are now facing significant design difficulties due to their challenges being especially suited to the capabilities of powerful LLMs. Our team attended BSidesSF 2026 CTF competition and wanted to present a first-hand account of how LLM-enabled workflows are being used to tackle these challenges. Additionally, we’ll walk through some of the key differences between CTFs and professional security assessments to highlight why LLMs still require the guidance of experienced practitioners to be effective “in the field”.
Introduction
BSides San Francisco CTF is one of the longer-running jeopardy-style CTF events, with an experienced organizing team, and cash prizes ($1,500 for first place). Challenges are on the easy to medium side, and the authors publish source code and writeups afterwards, making it a great learning opportunity.
At one point during the 2025 CTF, I remember looking around the room and seeing almost half the players had ChatGPT open. At that time ChatGPT 4 was good at solving easy challenges, freeing up mental bandwidth to focus on the crucial higher-point challenges, which it couldn’t solve.
Like most CTFs, BSides SF uses dynamic scoring, so the hardest challenges are worth up to 10x as many points as the easiest ones. In 2025 the winning team was the only one that came anywhere close to solving every challenge.
Everything changed in late 2025/early 2026
Jump forward to this year’s BSidesSF, and it’s clear that the CTF scene has dramatically changed. This year, 16 teams fully solved all challenges, and no challenge had fewer than 25 solves. This was not because the challenges were much easier.
In fact, the top 10 teams fully automated the solving process, with most challenges getting solved minutes after release. Apart from a few OSINT challenges, Claude Code and Codex were able to solve every challenge, including tough cryptography and binary exploitation ones that would probably have gone unsolved last year. Last year, I came 5th playing solo; this year I estimate I would have placed 75th without LLM assistance.
Automating CTF
At BSidesSF 2026 CTF, I realized that I had no chance at competing without my own AI agent, so I decided to join the new meta. I set up a Debian VM, and installed a kitchen sink of CTF tools, including:
- Conda environment with every pip package that I could think of
- Playwright for headless browsing
- Ghidra and Ghidra MCP server
I used Claude Code with the Max 5x plan ($100 per month), which turned out to have just enough tokens in its weekly limit for the whole CTF. I vibe-coded a script to scrape challenges from CTFd and save them to separate directories.
I then opened up challenge directories in separate tmux terminals, and kicked off Claude Code with the “–dangerously-skip-permissions” flag in each one. I deemed that flag acceptable to use as Claude was sandboxed inside a VM, on a travel and CTF laptop.
I used Opus 4.6 on max effort with a dead-simple prompt “solve this CTF challenge keeping track of progress”. I then tabbed between the windows to monitor progress on different challenges and occasionally steered the LLM in a better direction. I picked out some of the challenges that looked interesting and tried to solve them manually in parallel to the LLM.
I was astonished at how capable the latest models are at CTF. When you watch them crack gnarly cryptography puzzles faster than it would take the smartest people you know – you can’t help but be amazed.
Fully Automating CTF
This approach, while enough to solve almost every challenge, is far from enough to win or even to place top 10. Other teams, more experienced at vibe-solving, have invested in pipelines that:
- Continuously monitor the CTF platform for new challenges
- Spin up multiple agents to solve each challenge immediately as it’s released
- Auto submit the flag as soon as it appears in agent output
When scores are tied, the victor is decided by solve speed. So to beat the other teams you need more agents, better agents, and better CPUs. In other words, more $$$.
The winning team open sourced their CTF agent. Their special sauce to consistently being fastest was to run several different models in parallel, each with different strengths and weaknesses.
GPT-5.4-mini quickly crushes the easiest challenges, while Claude Opus 4.6 on max effort mode is slow but reasons the deepest. Additionally, they use a co-ordinator LLM that shares insights between the different model agents. If a particular agent appears to be stuck, the co-ordinator kicks it back into gear with a prompt containing any useful discoveries made by other agents.
Harder CTFs
AI can’t autonomously solve the majority of challenges at harder CTFs like hxp and DEF CON, but it’s becoming more of an issue. hxp’s cryptography challenges were apparently autonomously solvable (“sloppable” in CTF parlance) in December 2025. At DEF CON in August 2025 (a long time ago in this tech timeline) two challenges were solved with major LLM assistance, although LLMs weren’t particularly useful for the rest. Due to dynamic scoring, the winners are still decided by the “unsloppable” challenges.
I asked players from semi-retired top CTF team Organizers, who said that top CTFs still contain fun hard challenges that remain resistant to LLMs. Challenge design increasingly means anticipating what the next frontier model will be able to do, which is a new and genuinely difficult constraint for an author to work under.
They said the most obvious challenges which are harder for AI are “guessy” ones with little training data to work from, although these usually aren’t appreciated by humans either. A couple categories like cryptanalysis of symmetric ciphers are doing better, due to fewer existing writeups. Challenges that require diving deep into the internals of software can make LLMs struggle, particularly areas that are poorly documented, or even better, if the documentation contradicts the source code.
CTFs vs Pentesting
Given how well LLMs now perform at CTF, it’s reasonable to ask whether those results translate to pentesting. After all, CTF challenges are often modelled on real bugs. While pentesting occasionally throws up tasks that feel like CTF challenges, most of the work looks rather different.
1. Goal structure
CTF challenges have a single target: the flag. Good challenges have a well-designed, intended solution path that leads you there. Pentests are far more open-ended; rather than following a single path to its conclusion, you’re searching a vast system and trying to identify the many parts of it that are broken in a security-relevant way.
2. Finding verification
In a CTF, submitting the correct flag means you’ve unambiguously solved the challenge. In a pentest, verifying if a finding is valid is not so clear-cut. Distinguishing true positives from false positives requires not only reproducing an issue technically, but also understanding the business context in which it exists. A common false positive is an apparent authorization vulnerability where the endpoint or data is actually intended to be public.
3. Context management
CTFs usually involve small, self-contained programs. A typical challenge would be a single binary, webapp with a single-digit number of routes, or a 200-line encryption scheme. Pentests are usually conducted against huge systems and codebases that can have millions of lines of code, where you don’t have access to all dependencies and can’t run it locally.
4. Reporting and severity
In a CTF, the flag is the deliverable, and a writeup is optional. In pentesting, a significant portion of the work goes into the reporting process: explaining what was found, assessing its severity, and articulating why it matters in a way that’s useful to the client.
5. Staying in scope and dangerous decision-making
The worst consequence for breaking the rules of a CTF is likely disqualification. In a pentest, stepping outside the agreed-upon scope can be disastrous. Experienced testers take great care before running a proof-of-concept, or before pivoting to a system they may not be authorized to attack.
CTFs therefore play into the strengths of AI:
- Goals with unambiguous success criteria
- Bounded context that fits within a model’s working memory
- Immediate feedback loops
- Low consequences for breaking the rules
Additionally, the wealth of publicly available CTF writeups reinforces this advantage. Easier challenges are likely to be a minor variation on something that’s been seen before.
Recent talks and articles have demonstrated frontier models’ impressive capability at vulnerability research. Successful examples of LLM-driven vulnerability discovery often seem like reframing the problem as something closer to a CTF challenge: narrowing the search to a tight scope and referencing previous CVEs to build a clear threat model of what a vulnerability might look like.
Conclusion
BSidesSF 2026 shows we have passed an inflection point where easy-to-medium CTF challenges are largely solved problems for AI. What took skilled players hours last year takes an agent minutes today. The competition has shifted from who can solve the most to who can deploy the best infrastructure.
But the jump from automated CTF to automated pentesting remains big. CTFs are an ideal testbed for LLMs: instant verification, a bounded codebase, and tons of training data. In pentesting, false positive management, scope discipline, and business context are all areas where human judgment remains important.
In the second part of this blog post, we’ll explore the BSidesSF challenges that LLMs had more trouble solving and investigate why.
