04/12/2026 // AI Research
Three years of AI assisted development: what I actually learned
A timeline of experimenting with AI tools from ChatGPT through agent maxxing. What worked, what was overhyped, and how my thinking changed along the way.
I've been experimenting with AI development tools since late 2022. Three and a half years of testing models, burning through API credits, building projects, and trying to figure out what's real and what's noise. This is a record of how my thinking changed over that time.
Late 2022: I tried ChatGPT within the first couple of weeks of it going public. My frame of reference was Cleverbot and whatever I'd read about chatbots over the years. I thought this was a fully realized version of that.
My first misconception was that it would learn from me. I assumed if I had a conversation with it, the next conversation would be smarter because of what I'd told it. It took some reading to understand that's not how it works. The model is the model. What you get depends on the context window and the training data. There's no persistent learning in the way a human learns. That distinction between context and learning turned out to be one of the more important things I've internalized about working with LLMs.
I used it for creative stuff at first. Lyrics, poems, short stories. None of it was good. The writing was flat and generic, assembled from the most statistically average version of everything. But I could see the trajectory. If this was the floor, the ceiling was going to be interesting.
Early 2023: GPT-4 came out in March and things got real. I started using GitHub Copilot for autocomplete. The first time it correctly completed a Python function I was writing, it clicked. This was a tool, not a chatbot.
I was heavily learning Python at this point and I used ChatGPT as a tutor. It was wrong sometimes but it was fast, and when you're picking up a new language, having something that generates runnable code most of the time is useful.
This is when I started benchmarking. I created a test: can this model one-shot a Wolfenstein 3D style raycaster in Python? A complete, runnable, raycasted 3D map renderer in a single prompt. I ran this benchmark on every new model as it came out. In early 2023, nothing could do it cleanly. The code would be close but never right. It was a better measure of capability than how impressive the conversation felt.
The workflow back then was primitive. Copy code from the chat window, paste it into a file, run it, copy the error back. But it worked.
Mid to late 2023: I was still on GPT. Completely unaware of Claude or Anthropic. I tested local models through Ollama and the experience was rough. They'd confidently give wrong information on topics I was trying to learn. When you're using a model as a learning tool, you need some baseline of trust in the answers. Local models in 2023 weren't there.
Auto-GPTs and recursive thinkers started appearing. Agents that could loop on their own problems. The concept was exciting but the implementations were fragile. They'd spiral into repetitive dead ends without making progress.
I discovered Claude through aider, the CLI coding tool. Aider was the first time I used AI as a proper coding agent rather than a chatbot I was copying from. The experience was a clear step up. But I was running it against the API directly and the costs added up fast. I didn't have a feel yet for how tokens translated to dollars, so I'd run experiments and then find surprises on the bill.
Early 2024: In April I went to a Google conference in Vegas and got my first look at Gemini. I ran my raycaster benchmark on it. It failed. Most models still did.
I was building more complex Python games with aider, then moved to Cursor. The switch was about cost. Cursor was generous with their early pricing and $20 a month went further than raw API bills. I learned React around this time. Next.js became my go-to frontend framework.
August 2024 is when I shifted gears. I went from AI as an assistant that helped me write code faster to something closer to what people now call vibe coding. The agent does the typing, I do the thinking. The ratio changed.
OpenAI released o1 preview in September. I found it unimpressive for the cost. Expensive, slow, and the results didn't feel proportionally better. I shifted more of my attention to Claude and Google.
Reasoning models started one-shotting my Wolfenstein benchmark. Every language, every framework. That was a real signal. A task that no model could handle cleanly eighteen months earlier was now trivial. The progress was real even if the surrounding hype was overblown.
I moved heavily into Three.js for game development. The games started having actual architecture. Not just scripts that ran, but systems with components, state management, and rendering pipelines. I was building things that felt like they could be real products.
This was also the generative AI golden era across all modalities. Video, image, music, code, everything stepped up by a large margin in a short window. My experimentation became massive and my spend became unsustainable. I tried everything: Replit, Bolt.new, Windsurf. I kept Cursor. I dropped aider and direct API usage in favor of subscriptions. This is when I started thinking in compute per dollar. Where can I get the most inference for my money? Where can I get it free?
Early 2025: Deepseek R1 dropped in January. I tested it and found it overhyped for my use cases. But it did something useful: it exposed cracks in the narrative. A competitive model from a smaller lab built for a fraction of the assumed cost. It didn't work well for my needs, but it made me question whether AI progress required the level of investment the major labs were pouring in. The gap between datacenter spending and actual revenue was getting hard to ignore.
I cut services and reduced my spend. Kept building. More complex web applications, more complex Three.js game systems. I was relying on premade 3D assets and Mixamo for character animations. AI generated 3D models weren't usable yet. AI generated pixel art and sprite sheets were still bad. The tools were good at code and text but couldn't reliably produce the visual assets I needed for game development.
I kept building on my own time. More complex web applications, more complex Three.js game systems.
Late 2025 is when things changed for my workflow. I'd been stuck on Cursor because all the AI native editors were VS Code forks. I use VS Code fine, but I prefer JetBrains IDEs and had been for years. Claude Code solved that. It runs as a CLI tool. I could use whatever editor I wanted and still have an agent in the terminal. That was the unlock that actually changed how I work day to day.
Around the same time I started doing NFT projects as a hobby. Met good people, had fun with the community side of it. The 3D games got more complex. AI generated models had finally reached the point where I could generate one, rig it with Mixamo, and get it into a Three.js scene. That pipeline didn't exist a year earlier.
I built some experimental API projects that taught me expensive lessons about billing. Some of them were cool. Some were beautiful failures. All of them taught me something about either the technology or the economics.
This is when I started feeling that basic CRUD applications were fully commoditized. If someone could build a functional version of a data entry app in a weekend and market it better, the product wasn't defensible on features anymore. That realization was uncomfortable. I'd spent a decade of my career building exactly that kind of software. The architecture still matters, the reliability still matters, but the barrier to a functional first version had collapsed.
I also developed a more refined sense of model personality. Different models are good at different things. Some feel genuinely intelligent. Some are creative but unreliable. Some overthink every prompt. Learning to match the right model to the right task turned out to be its own skill.
2026 is where I am now. I sit at my desk with four terminal tabs open, each running an AI agent, sometimes on the same project, sometimes on different ones. The workflow is parallel. Give an agent a task, switch tabs, give another agent a different task, check back, review, correct, move forward.
My spend is down to about $150 a month across all services. I use a mix of models depending on the task. The gap between frontier models and open source alternatives has closed significantly. Chinese models are competitive with Western ones for the first time.
I'm not convinced by full agent orchestration platforms. I want agents I control in terminals I own, not a platform managing the loop for me. I think open source models combined with on-device inference may be where the long-term value settles.
I think IP protection is going to be one of the bigger conversations in this space. Sending proprietary code to third-party APIs is a real concern for a lot of companies. Local models running on local hardware are getting good enough to be a practical alternative, and that's worth watching.
Looking at all of this together, a few things stand out. The progress is real but the hype consistently runs ahead of it. The tools are most useful when someone with experience is steering them. AI replaces typing, not judgment. The economics of AI services are still shaky and something will have to adjust. And the workforce conversation is being had by the wrong people. Companies cutting teams in half because they think AI fills the gap are going to find out they have more work than ever and fewer people to think through it.
I worry about the generation coming up behind me. Junior developer roles are disappearing because companies assume AI can fill that gap. It can't. AI can't teach someone to think about systems. It can't give a new developer the instinct that comes from years of watching things break in production. The ladder is getting pulled up and the people who got on it early will be fine. The people who didn't may not get the chance.
In the meantime I'm doing what I've always done. Building software that has to work, learning the tools that help me do it better, and staying skeptical of anything that sounds too good to be true. The signal is real. The hype is not. Figuring out which is which is the job now.