Codex CLI vs Chat - Tokens, Output Quality, and Motivation

Dec 7, 2025 | Reading Time: 12 min

gpt

Tl;dr: Codex‑style CLIs look powerful but don’t work for me for anything beyond small, local/single file edits. In one Codex‑heavy week I averaged 4.9M input tokens/day; in a later chat+files week I averaged 0.33M/day and shipped more and better features. Codex burns tokens, blurs boundaries, and leaves me with large patches of code that I am not really mentally ready to own and with very little motivation to clean it up. A separate chat+files workflow gives me better output and keeps my motivation and sense of control intact.

Why I’m Writing This

I write software for a living. My main usage patterns:

Chat+files: I dogfood my own FlexiGPT . It’s a standalone chat app where I can attach specific files and get back code blocks, diagrams, explanations, etc. The comparison below would be valid for any chat app I think, nothing very special about FlexiGPT here, its just my preference.
Codex CLI / IDE: I’ve used OpenAI’s Codex‑style CLI and VS Code integrations.
- With autocontext (it scans the repo and pushes in context).
- With manual file selection (I tell it which files to look at).

I’ve briefly looked at Claude Code and Gemini CLIs. My limited experience there felt similar enough (too much code at once, not my style) that I didn’t go deep. I tried CoPilot some time back (long time in LLM world I suppose), but tabbed autocomplete was not really for me.

With some heavy usage of Codex, I have noticed my motivation to produce software take a big hit, especially in sessions where I am trying to build a feature. When I searched for any studies detailing how people get most out of these tools, most content is about productivity (for example, this paper is interesting), not about what it does to your head when you’re staring at 3000 lines of LLM‑written code that you don’t really want to own.

Experience With Codex

Usage Pattern

The pattern for me with Codex CLI, especially with autocontext, looks like this:

I want to build a feature.
I write a spec (often from a prior FlexiGPT chat).
I hand that to Codex CLI and tell it to “implement it”, mostly with autocontext on.
It generates a lot of code across multiple files. (And consumes a lot of tokens)
On first look it feels like progress: several new files, updated modules, proper naming.

Then I look closely:

Domain boundaries get blurred.
- Logic that clearly belongs in the backend turns up in frontend helpers.
- UI components start managing business rules.
- Backend packages mix responsibilities that I prefer clearly separated.
Separation of concerns collapses easily.
- Data access leaks into handlers.
- Validation logic is scattered.
- Shared concerns get duplicated in random places.
Very real bugs creeping into the system.
- Either under or over compensation wrt concurrency control.
- Allocation and reallocation of same resources.
- Heavy computations pushed into tight loops (the code did look clean though …/s).

Even when I give it explicit structure and separation rules (like a Claude.md/Context.md), Codex still tends to mix concerns when the context window is large. It looks at everything and tries to “optimize” or “simplify” at the wrong layer. Roughly half the patch was structurally wrong for how I wanted the system to look. Not “style differences”, but wrong at the design level.

Now, I have to choose between:

Keep the patch and refactor.
- I’m now holding a large, alien codebase that I didn’t really design.
- I have to decide what to keep, what to move, what to drop.
- My focus needs to be ultra high in review mode.
- It is highly uninteresting; it’s clean‑up of someone else’s work and of a design I don’t like.
Throw the patch away.
- I’ve burned a chunk of time and tokens.
- I end up back where I started, with much less energy than when I began.

In both cases my mind ends up scattered. And above is one more decision I have to make. Getting back into a calm, incremental development flow after that is very hard. It doesn’t take many of these cycles to poison your willingness to even invoke the tool.

I also saw (and heard rants) about a similar pattern in a team environment:

Juniors feeding Codex a spec,
getting a big chunk of code back,
then struggling to: understand it, debug it, or adapt it when requirements changed.

Some teams responded by running quick low‑level design sessions with:

Small prep time.
Present data structures, flows, module responsibilities.
Team debate afterwards.

The consistent observation was: if the thinking is fuzzy, Codex just creates large, fuzzy code. That’s demotivating to clean up, both for individuals and teams.

Token Profile

From my account usage, for one Codex‑heavy week across my account:

Total input tokens: ~34.26 million; Average per day: ~4.89 million input tokens/day.
Of those per day (on average): ~4.56 million tokens/day cached; ~340 thousand tokens/day uncached.
So roughly: 93% cached, 7% uncached, but still millions of tokens per day going through the system.

In that week, the actual outcome:

I shipped a few (~3) relatively simple features.
Most of my time was spent: reading and reviewing generated code, deciding what to keep, and cleaning up structural issues.

What I’ve Tried To Improve Codex Output

With autocontext on, the behavior seen is:
- The CLI pulls in a lot of irrelevant files.
- Token usage goes through the roof.
- Quality does not go up in line with tokens. Often it gets worse because the model tries to be “global” and overreaches.
- So I’m paying in tokens and attention for the model to stare at things it should ignore.
To get better control/coherence and reduce tokens, I’ve tried:
- Manually attaching only the files I care about.
- Claude‑style project MD files:
  - Architecture overview,
  - Domain language,
  - Explicit boundaries (“this stays in backend”, “this belongs in frontend”, etc.).
  - Issue with this is same as any documentation, you can only keep so much here.
- Tight prompts:
  - Clear separation of concerns,
  - Rules about where each piece of logic should live.
- Stepwise instructions:
  - “Change only A.”
  - “Now adjust B to match A.”
  - “Now update C.”

I experimented with the above in both autocontext on and off mode, but it has not helped much in autocontext-on mode. In autocontext-off mode, this does improve correctness to a degree and gives me:

Much lower token usage.
Fewer hallucinated APIs.
Slightly better alignment with my structure.
Less random movement of logic across layers.
I get more controlled output, which is easier to review.

But, the output quality when compared to a clean vanilla chat prompt is worse, for the same task. My guess is the CLI’s built‑in system prompts and assumptions push Codex into a direction that I am not really comfortable with.

The small side bar inside VS Code makes this worse:

Tiny space to review large diffs in the side pane.
I have to open each file in diff mode in VSCode editor pane and see for the file, rather than getting to see the “full” change and not a single file.
Hard to see the “shape” of the change.

Experience With Chat + Files

Usage Pattern

The workflow is pretty simple here:

I attach exactly the files I want the model to see.
I often feed the same spec I used with Codex. I prefer not to have an elaborate system prompt. The same tricks I used to improve Codex output (tight prompts, stepwise changes) work here too.
The model returns code in a big, readable chat pane. The UX for reading/reviewing code is much nicer.
I decide what to paste, what to ignore, and I do the integration myself.

The energy required to review is still there, but because the flow is much more controlled, it’s not as bad as with the Codex agent. Motivation levels are much better when things feel in control.

Token Profile

For a chat+files‑heavy week, my logs show:

Total input tokens: ~2.35 million; Average per day: 330 thousand input tokens/day.
Of those per day (on average): ~118 thousand tokens/day cached; ~218 thousand tokens/day uncached.

So:

Absolute usage was about 14-15X lower than the Codex week (0.33M vs 4.89M per day).
A smaller fraction was cached (~35% cached, ~65% uncached), because I was intentionally giving only the relevant files instead of pouring the whole repo into the context.

In that week: I shipped more features (~8), including a couple of complex, multi‑language features, plus several smaller ones.

Subjectively, I had:
- fewer motivation dips,
- less “what do I do with this giant patch” anxiety.

Same underlying models, very different way of driving them.

Specific Use‑Case Comparison

Go Tests

For tests, especially Go tests, my experience is clear:

Codex
- Acceptance rate maybe 50-60%, sometimes 70% on a good day.
- It often:
  - misses good table‑driven patterns,
  - overcomplicates setups,
  - or guesses too much about how I structure tests.
Chat+files
- Acceptance is closer to 90%. Good days go to 95% acceptance.
- My typical prompt is simple: “Use stdlib only.”, “Use table‑driven tests.”, “Cover N happy cases, N positive border cases, N negative border cases, empty and nil cases.”, etc.
- I attach:
  - the implementation file,
  - any relevant interfaces or helpers.

In chat mode, the main issues I see are due to model training lag, not misunderstanding:

Old Go patterns (e.g. for _, tc := range tcs { tc := tc; ... }). Almost all modernize issues.
Slightly outdated file permission constants (e.g. writing files with permissions broader than 0o600).
Minor style differences from my current lint rules.

Those are easy enough to clean up. Bottom line: my days of writing tests are objectively better with chat+files. It saves time, has great output, without destroying my motivation.

Frontend

My frontend stack on these projects is: React, Tailwind CSS, DaisyUI, React Router, Vite.

In a chat+files workflow:

The models do a good job:
- JSX structure is fine.
- Tailwind and DaisyUI usage is generally correct.
- Routing with React Router is reasonable.

My loop looks like:

Attach the key page components and related files.
Ask for: a new page, or a refactor, or a state change, with dummy API implementations.
Get the code, fix linting and static analysis issues myself.
Review state management carefully:
- Where state lives.
- How props flow.
- How side‑effects are triggered.
If I’m not happy with state boundaries:
- Update the spec,
- Ask for a revised version.
Run dev server, tweak styling and alignment, then commit.

In this mode, the models are useful. The mistakes are fixable, and I stay in control.

With Codex CLI + autocontext on the same stack:

It tends to:
- Over‑pull context (maybe includes things from node_modules?).
- Try to be “smart” about global patterns and state management.
- Blow up token usage.
State handling gets messy:
- State is placed in weird components.
- Responsibilities jump between layers.
- Local view logic and shared logic are mixed.

CSS and basic alignment still come out okay, but the state management is often bad enough that I don’t want to keep it.

So again:

Chat+files: I get usable frontend code with some targeted cleanup.
Codex CLI with autocontext: too much noise, too much code to unpick, and a noticeable motivation drop.

DB Design, DB Interactions, and API DTOs

For DB schema design, DB interactions, and API DTO design, my acceptance rate is low in both Codex and chat:

I’d put it around 30-40% in either mode.

For these tasks:

I care a lot about: naming, consistency across layers, how changes will age over time.
By the time I describe all that clearly enough in a prompt:
- in many cases, I might as well write the schema and core DTOs myself.

So for core DB and API design:

I now prefer to:
- design the schema manually,
- write the main DTOs,
- then use LLMs to generate repository boilerplate, simple mapping functions, and trivial handlers.

Once I already have DB models, generating API DTOs is much safer.

Both Codex and chat do better here because:

The data shape is fixed.
The model has less room to “invent” structure.

I still prefer chat+files:

I attach the schema / models.
I ask for: DTO structs, mappers, basic validation stubs.

Quality is usually acceptable, and I only need to adjust details.

Incremental Functions in a Single File

For incremental work in a single file (small helpers, refactors, etc.):

Codex CLI and chat are both technically fine.
The big difference is process and motivation.

With Codex CLI:

It often wants to touch more than I asked for.
I end up reviewing larger diffs than I’d like.
After several iterations, I feel like I’m chasing it instead of using it.

With a chat window:

I paste or attach a single file.
I say: “Change only this function,” or “Add a helper for X, do not modify Y and Z.”
The output is usually: more scoped, easier to diff, less mentally draining.

So even when output quality is similar, I strongly prefer chat because it doesn’t push me to get into reading and debating about large code blocks.

Tooling: Lint, TS/ESLint/Prettier/Knip, package.json, GitHub Actions

For tooling enhancements:

Lint rules tsconfig, ESLint, Prettier, Knip, package.json cleanup.
GitHub Actions workflows.

I’ve found both Codex and chat not very reliable.

Reasons:

These ecosystems move fast.
There are always new: lint rules, plugin versions, GitHub Action variants.
Models often: propose outdated patterns, overcomplicate configs, or change things I don’t want changed.

The common pattern:

The model “does something” that superficially looks sophisticated.
I don’t like: the approach, the trade‑offs, or the level of magic.
I end up rewriting or heavily editing it.

At that point, I might as well do it myself from the start or copy from current docs and adapt.

So for tooling, I treat LLM suggestions as ideas, not code I’m likely to accept directly.

How I Use LLMs For Codegen Now

Given all this, my current rules of thumb are:

Avoid Codex CLI for whole‑feature generation.
- It burns tokens, rewrites too much, and I don’t like the code I get.
Use a separate chat window with file attachments.
- Better UI.
- Better control over what the model sees.
- Easier to review and reason about.
Use LLMs where it clearly works for me.
- Go tests (unit and integration) with simple, strict prompts.
- Frontend JSX + Tailwind + DaisyUI + routing, as long as I own state design.
- DTOs on top of already‑defined DB models.
- Small incremental changes in single files.
Be very cautious for:
- DB schema design.
- Cross‑layer API and DTO design.
- Tooling and CI configuration.
Keep design ownership.
- LLMs fill in code around a design I already trust.

Under this approach:

My motivation stays more stable.
Token usage is still non‑trivial, but it’s tied to real progress.
I’m not constantly stuck choosing between refactoring code I don’t like or deleting a big LLM‑generated patch and killing my mood.

Other people will have different thresholds and workflows. For me, the key is simple: I have to feel like I still own the design and the code.

When the tool helps with that, I use it. When it fights that, I stop, because the hidden cost is my motivation, and that’s not something I’m willing to burn for the sake of a big code dump that looks like progress but doesn’t feel like it.