Notion × Amplify × Vercel partnership

2026 AI Engineering Survey

Confident in the gains, uneasy about the foundation. How more than 1,000 engineers, founders, and product leaders are actually building with AI in 2026.

Productivity is a settled question, durability is an open one.

The strongest signal in this data isn’t a single number — it’s a tension. Engineers are sold on the speed and the satisfaction. They’re far less sure the foundation under all that velocity will hold.

Chart illustration

88%

say they’re much more productive with AI

76%

report higher personal job satisfaction

59%

expect AI‑generated code to create major long‑term liabilities

50%

reject that AI software engineering is “a solved problem”

The picture is of a workforce that has wholeheartedly adopted the tools and is now quietly worried about the bill that comes due: in skills, in review load, and in brittle systems. Six tensions trace that fault line across models, agents, evaluation, roles, and beliefs.

Against the grain.

The engineers who hand agents the most autonomy aren’t the most careless, they’re the most rigorous. The quarter who allow full autonomy run formal evaluations at a higher rate (75% vs 66%) and lean on more evaluation methods to do it. Autonomy here is earned through measurement, not granted by inattention.

Graph illustration

Grains image - Two-part bar chart of AI's effects. Most positive effects: more experimentation 45%, ship more / faster 31%, freed for architecture 19%. Most negative effects: skill erosion 29%, higher review burden 28%, brittle / incident-prone code 22%. What's the most positive + negative downstream effect of AI-accelerated development for your org? · n=798

A builder population.

Read everything here through one lens: this is a particular, fast-moving slice of the field. It’s the crowd already living in the deep end. 1,053 people responded, and they skew heavily toward founders and engineers at small, AI-native companies:

Team illustration

Roles image - Bar chart ranking respondents by role share: Founder / CTO 23%, AI Engineer 19%, Fullstack Engineer 12%, Eng Manager / Tech Lead 9%, Product Manager 9%. Which best describes your current role? · n=1,053

Roughly 68% are at companies of 100 people or fewer. These are seasoned software engineers, relatively new to AI (a median of eight years writing code, three in AI/ML). The gap is broad: even among those with 10+ years in software, 51% have three years or less in AI. The one place it disappears is the leading edge — for the newest engineers, learning to build and learning to build with AI are the same thing.

Where conviction meets caution.

Six tensions surface again and again. Each time, a confident majority leans into what AI makes possible while a sizable share keeps one eye on the catch — and more often than not, it’s the same person, holding both at once.

Adoption image - Six opposing-pair bars contrasting attitudes. Productivity vs. liability: 88% much more productive vs. 59% expect major liabilities. Quality vs. cost: 67% pick models on accuracy vs. 75% cost is a tension. Evals rigor vs. feel: 35% run formal tests vs. 61% rely on "vibe checks." Agents power vs. trust: 25% grant full autonomy vs. 75% keep a human in the loop. Roles blur vs. core: 81% role is blurring vs. 39% trust-work is unchanged. Expertise paradox: 72% feel on top of AI vs. 29% fear skill erosion. Left = adoption/optimism · Right = caution/unease · n varies 762–1,053

The gains aren’t in dispute, what they rest on is.

88% feel markedly more productive and 76% report more job satisfaction. There’s no real debate about the gains. The doubt is structural: 59% expect today’s AI-generated code to create major long-term liabilities, and a heavier review burden (28%) and more brittle, incident-prone systems (22%) rank among the costs they name most.

They select for accuracy, then get reined in by the bill.

Accuracy is the top model-selection criterion (67%), comfortably ahead of cost (53%), yet in practice roughly three in four say cost reins in how ambitiously they use AI. The model they’d choose and the model they can afford to run aren’t always the same.

Sophisticated stacks, still judged by feel.

These teams run real production systems (retrieval, orchestration, monitoring) but the most common way they check outputs is manual review and “vibe checks” (61%), ahead of any formal method, and evaluation is the single hardest layer of the stack to get right (20%). The tooling has outrun the measurement.

Agents get the keys, not the wheel.

Among the teams that use agents, 65% grant write access but keep a human in the loop; only 25% hand over full autonomy. The holdup isn’t whether agents are capable, it’s whether they can be trusted to run unwatched. They still hallucinate (61%) and lose the thread mid-task (47%), so they don’t get to drive fully unsupervised.

The boundaries dissolve; the human core holds.

81% say their role is blurring into product, design, and marketing, yet the judgment at the center of the job is exactly what AI has left least touched. The edges are melting; the core isn’t. The democratization has a ceiling, too: only about a third of non-developers ship anything customer-facing, and just 17% do so regularly, so the production surface stays gated by engineers.

Feeling on top of it, fearing it dulls them.

72% feel they have the sources they need to keep up with AI. They don’t feel left behind. Yet what they fear most is that the same tools are quietly eroding their skill — on top of it, and worried about exactly that.

The rest of the story is how these tensions play out: in the models teams pick, the agents they deploy, the way they check their work, and where AI lands in the pipeline.

Models, agents, evaluation, and impact.

Models illustration

Models & stacks

Closed models are still the ground everyone builds on: 94% use proprietary models like GPT, Claude, or Gemini, with open weights a real but secondary presence (37% as-is, 17% fine-tuned). When choosing a model, accuracy (67%) outranks cost (53%), with agentic / tool-calling capability now the second-ranked criterion.

Models image - Bar chart ranking the top 6 criteria for choosing a model: accuracy / quality 67%, agentic capability 53%, cost 53%, privacy / data 33%, ecosystem / developer experience 20%, reliability 20%.

What are your top considerations when choosing a model for production? select up to 3 · n=1,004

Models & stacks (cont.)

More than 80% are actively multi-model: routing by task type (44%), running several and comparing (26%), or trying a cheaper model first and escalating (11%). Yet roughly three in four say cost reins in how ambitiously they use AI.

Zoom out from individual models to the broader toolset, and an early pull toward consolidation shows up: a slim majority (56%) report standardizing on fewer, better-vetted tools, even as a plurality (37%) deliberately keep the stack flexible. A wave of standardization may be forming.

Routing image - Stacked bar chart of model-routing strategies, with 81% running more than one model: route by task 44%, compare 26%, cost tier 11%, single model 13%, other 6%.

For a given task, how do you decide which model to use? · n=1,001

The model was the bottleneck. It no longer is — more often than not, it’s the tools it has access to.
Founder / CTO

Models & stacks (cont.)

Text is the substrate nearly everyone builds on: 96% are actively working with it. From there the frontier falls off fast — image 62%, audio 44%, video 25%. The more telling figure is the backlog behind each: audio and video carry the highest latent demand of any modality, with roughly a third of teams planning to add what they don’t yet use. Multimodal, for now, is a near-future bet more than a present-day default.

Modalities image - Stacked bar chart of AI modality adoption by status (working well, early traction, planned, no plans). Share working well: text 96%, image 62%, audio 44%, video 25%.

Which AI modalities are you actively building with at work? · n ≈ 970–1,000

Agents

Agents are now near-universal: 95% of teams run them internally. Among those teams the story is motion — full autonomy has nearly doubled in a year — from 13% to roughly 25% of agent users, and yet the dominant posture still holds: let it act, but check its work, with write access typically paired with a human in the loop. Controls are layered on top (human approvals 73%, access controls 60%, planning steps 48%, retrieval 47%, memory 47%), and almost nobody (3%) leans on prompting alone.

Autonomy image - Horizontal scale labeled "less autonomy" to "more autonomy" showing how much autonomy respondents grant AI: 10% read-only, 65% write with a human in the loop, 25% full autonomy.

What level of tool permissions do your agents have? · n=930

Agents (cont.)

The failure modes explain the caution: agents still hallucinate and lose the thread mid-task, well ahead of any ceiling on raw reasoning.

Failures image - Bar chart ranking the top 6 AI failure modes: hallucinations 61%, loses context mid-task 47%, poor reasoning (logic) 37%, lacks the right tools 30%, needs heavy supervision 27%.

What level of tool permissions do your agents have? · n=930

The leash comes off once the instruction are in.

The 25% who grant agents full autonomy aren’t the reckless ones, they’re the more rigorous ones. They use formal eval methods at a higher rate than the human-in-the-loop majority (75% vs 66%), run more eval methods each (3.1 vs 2.7), and are likelier to be shipping customer-facing features (59% vs 54%). Autonomy here is earned through measurement, not granted by carelessness; the evaluation and autonomy tensions are really one.

That bigger context windows would solve context loss. They don’t — the bottleneck was never raw tokens, it was structure. Compression and retrieval discipline beat window size every time.
AI Engineer

Evaluation & monitoring

Even this advanced crowd still grades by gut: manual review and “vibe checks” lead, ahead of any formal method, and one in nine has no formal eval process at all. Tellingly, evaluation is also the #1 named stack challenge (20%), ahead of orchestration and inference cost. The working toolkit for shaping behavior is prompting (80%), tool-use tuning (59%), and RAG (49%); fine-tuning remains a minority practice (26%). The meter-watching doesn’t stop at deploy: cost is the second thing teams track in production (48%), right behind whether it actually works (56%). The bill stays on the mind long after ship.

Evaluation image - Bar chart of methods used to evaluate AI output: manual / vibe check 61%, human preference 45%, llm-as-judge 41%, unit / integration tests 36%, golden datasets 26%, production metrics 26%, no formal eval 11%.

How are you evaluating AI outputs today? · n=893

Where AI lands in the pipeline.

AI hits hardest where the work starts — the blank page and the plan — but its reach runs deep, all the way through to CI/CD and the backend.

None of this is felt evenly. Split the sample by how close people sit to the code, and the central tension comes into sharper focus.

Pipeline image - Flow diagram tracing AI's impact across development stages: planning 48%, dev / IDE 57%, app UI 40%, backend 28%, ci / cd 32%, infra 18%.

Where in your development and deployment pipeline has AI had the biggest impact? · select up to 3 · n=831

Where AI lands in the pipeline. (cont.)

The mirror image is just as telling. Asked where AI has changed their work least, engineers name the human and judgment-heavy core: hiring, onboarding, and knowledge transfer (39%), architecture and system design (29%), and security review (27%) — the mentoring and system-level judgment that doesn’t reduce to typing.

Untouched image - Bar chart ranking the top 5 tasks AI leaves largely untouched: hiring / onboarding 39%, architecture / design 29%, security review 27%, debugging hard bugs 20%, writing specs 17%.

What parts of your job has AI had the least impact on? · select up to 3 · n=810 · separately, 28% said no area was untouched, AI changed nearly everything

Today’s convictions, tomorrow’s bets.

Two questions on the same scale: where the field stands today, and where it’s heading in five years. The present, they’re sure about. The future, less so — their forecasts thin into a row of increasingly hedged bets.

Calendar illustration

Where they stand today

Worry isn’t flat across experience. The newest engineers are the most sanguine: the most likely to call it “solved” (36%) and the least likely to fear long-term liabilities (54%). That flips fast: a little time in the field, and the liability worry climbs to 62% before leveling off. A little experience is enough to sober people up; more doesn’t add much.

That’s where they stand today. Asked to look five years out, the same group gets bolder.

Beliefs image - Stacked bar chart of agreement with statements about AI's impact on work. "Much more productive than before" 88% agree / 5% disagree / 7% unsure; "AI created more job satisfaction" 76% / 12% / 12%; "have the sources to keep up" 72% / 13% / 15%; "AI code → major liabilities" 59% / 19% / 23%; "productivity → more hiring" 38% / 29% / 34%; "SWE is 'a solved problem'" 32% / 50% / 18%.

Rate each statement · n≈765

[I used to believe] AI-written code should be reviewed every line; now it’s more about reviewing the high-level concepts and a few key parameters.
AI Engineer

The five-year view

They’re bullish on AI-driven research and on AGI-as-announced, ambivalent about whether today’s labs and closed models keep their lead, and genuinely split on architecture. They’re more productive today, and expect the ground to keep moving under them.

Outlook image - Stacked bar chart of agreement with 5-year predictions. "AI generates novel research" 71% agree / 12% disagree / 16% unsure; "a lab declares AGI" 67% / 14% / 19%; "agents buy without approval" 56% / 27% / 17%; "today's labs still lead" 52% / 17% / 30%; "most SOTA is closed-source" 47% / 19% / 34%; "most SOTA is non-transformer" 40% / 9% / 34%; "more AI compute runs in space" 36% / 38% / 27%.

In 5 years… · n≈770

What they used to believe, and no longer do.

If they expect the ground to keep shifting, they’ve already felt it move. Asked what they once believed about AI engineering and no longer do, their answers land on the tensions this report has traced. So we’ll let them have the last word.

Conversation illustration

On replacement

“[I thought] that it would replace me, like everybody thought — but now I see it more like a tool to make my life easier.”

— Fullstack Engineer

On discipline

“It’s not magic — there’s a way of thinking that cannot be compromised.”

— Founder

On what it amplifies

“I thought it would make people smarter. Usually it makes their weaknesses louder — the output is just multiplied.”

— Data Scientist

On the system, not the prompt

“I used to think great prompts were enough. Now I know robust AI products live or die on evaluation, data, and orchestration — not just model choice.”

— Fullstack Engineer

On the nature of the work

“That it’s not software engineering — but it turns out it’s just a specialised form of it.”

— AI Engineer

On the new advantage

“I used to believe that existing software developers are advantaged, but now I realize the main advantages are creativity, agency, and ambition.”

— AI Engineer

The discipline is the product.

If there’s one through-line, it’s the line that closes the section above: there’s a way of thinking that can’t be compromised.

For builders

The leverage is real, but durability is the discipline that protects it. Treat evaluation as a first-class layer of the stack rather than an afterthought. The teams shipping agents with the most confidence are the ones measuring them most rigorously.

For engineering leaders

The role is converging, and the work AI has changed least is the human-trust work: hiring, architecture, security review. That is where to concentrate senior judgment. And watch skill erosion: the same people reporting the biggest productivity gains name it their top worry.

For tool-makers

The bottleneck users actually feel is reliability and memory, not raw capability: hallucinations and lost context top the failure list. Evaluation is the single hardest, most-wanted part of the stack; whoever makes rigorous eval easy earns the trust that unlocks autonomy.

Get started today.

Methodology

Insights for this report were derived from survey responses collected from 1,053 respondents between May 16 and June 15, 2026.

Multi-select questions report the share of respondents choosing each option, so columns can exceed 100%.

Single-select questions are a share of those who answered, and denominators vary by question (~1,053 at the top, ~770–890 mid-survey, ~640 on open text).

Year-over-year figures compare with the 2025 edition, which was fielded to a different partner audience, so treat those deltas as directional.