GitHub Copilot Used to Be Great. Here's What Actually Changed.

JOURNAL · DEVELOPER CULTURE · 2026.07

Copilot used to be
genuinely good.

Model cycling, context amnesia, and the Ryz Labs study that put a number on what developers already suspected.

By Emcy · Founder, CodeCulture · 7 min read

github copilot getting worse 2025 — Confidently wrong: the Copilot failure mode developers started naming in 2024.

The GitHub Community thread titled "Is Copilot slowly getting worse?" has been running since 2023. It has not closed. New replies appear regularly, from developers at different levels, across different stacks, describing the same experience: something that felt genuinely useful two years ago now feels unreliable in ways that are hard to pin down but easy to recognize. This is an attempt to pin them down.

The frustration is not that Copilot is bad. It is that Copilot was noticeably better at a specific point in time, and the model cycling that has happened since then has not recovered that quality. Understanding why requires looking at the history, not just the current state.

GitHub Copilot Getting Worse: The Original Codex API and Where It Went

When GitHub Copilot launched in 2021 using OpenAI's Codex API, the developer reaction was genuine surprise. The model had been trained specifically on code, and it showed. Completions were contextually relevant, idiomatic, and often finished a function the way the developer would have written it themselves. That early version set the expectations the product has been trying to meet ever since.

The transition away from Codex toward GPT-4 series models introduced the first round of regression complaints. GPT-4 is a more capable general model, but "more capable" does not always translate to "better at tab-completion in a specific language." Copilot's value had always been narrow and precise: suggest the next syntactically correct, contextually appropriate code unit. GPT-4's broader training introduced different tradeoffs.

Each subsequent model cycling brought similar dynamics. A newer, more powerful general model would replace the previous one. Some capabilities improved. Others regressed. The net result, from a developer experience perspective, was not a straight upgrade. It was a lateral move: different failure modes replacing old ones, not fewer failure modes overall.

More from the blog: Developer Culture

What the Ryz Labs Study Actually Found About Copilot Accuracy

Ryz Labs conducted a six-month study on GitHub Copilot's accuracy in real-world codebases. Their finding: Copilot was accurate approximately 50% of the time on projects exceeding 10,000 lines of code. On smaller codebases, performance was notably better. The accuracy drop appears to be related to context limitations, Copilot's ability to reason about distant code relationships degrades as the codebase grows.

50% accuracy sounds like a number you could work with. The problem is how that 50% distributes. A completion that is obviously wrong is easy to catch and discard. A completion that looks correct, compiles cleanly, and fails at runtime under specific conditions is a different category of problem. Ryz Labs did not publish a breakdown of completion types, but experienced Copilot users know the distribution from practice: the dangerous completions are the ones that pass a quick visual review.

The study methodology involved real project codebases rather than synthetic benchmarks, which makes it more applicable to developer experience than most academic AI coding evaluations. The results align with what the GitHub Community thread has been describing in qualitative terms for over a year.

Ryz Labs found Copilot accuracy at approximately 50% on projects over 10,000 lines. The failure mode that matters is not the obvious wrong answer. It is the answer that looks correct.

The Specific Failure Modes That Appeared After Model Cycling

Three failure modes come up repeatedly in developer reports, each more concerning than the last.

Suggesting npm Packages That Do Not Exist

This is the failure mode that most clearly illustrates the danger of confident-but-wrong output. Copilot has been documented suggesting npm package names that do not exist on the public registry. A developer who does not verify the package exists, installs the wrong one, or in a more dangerous scenario, encounters a package that a malicious actor has published to claim the nonexistent name, that is a security problem, not just a quality problem.

The package hallucination issue is not unique to Copilot. It is a general problem with large language models applied to code generation. But Copilot's inline integration means developers encounter it in the flow of writing code, where the cognitive mode is building rather than verifying.

Deprecated Libraries and APIs

Copilot's training data has a cutoff. When it suggests using a library or API pattern that has since been deprecated, it is not making an error in the traditional sense, it is accurately reflecting what was correct at training time. But a developer using the suggestion in a new project is introducing technical debt on day one. This failure mode is particularly common in fast-moving ecosystems like React, where patterns from 2022 training data are often actively discouraged in 2026.

Context Amnesia on Large Files

Developers working on files with thousands of lines report that Copilot's suggestions often fail to account for code defined earlier in the same file. A function signature defined at line 200 may be ignored when Copilot is generating code at line 800. This is a context window limitation, not a model quality issue per se, but the effect from the developer's perspective is the same: a suggestion that would have been correct if the model had seen the full context.

GitHub Copilot Getting Worse: The "Confidently Wrong" Problem

The phrase "confidently wrong" has become the community shorthand for Copilot's most frustrating failure mode. It refers to completions that are syntactically valid, stylistically appropriate, and semantically incorrect in ways that survive a quick read. The model does not express uncertainty. It does not add a comment suggesting the developer verify the logic. It presents the wrong answer with the same formatting and visual confidence as the right one.

This is a specific kind of danger in a tool designed for flow state productivity. Cursor, Claude Code, and similar tools have started addressing this by providing explanation layers alongside completions. When a model says "here is what I am doing and why," the developer has a checkpoint to evaluate. When the model just writes the code and moves on, evaluation requires the developer to slow down and think critically, which is the exact cognitive mode the tool is supposed to reduce.

The GitHub Community discussions on Copilot are a useful real-time temperature check on this. The volume and consistency of "confidently wrong" reports has not decreased as models have been updated. That is the data point worth watching.

Why Model Upgrades Do Not Always Feel Like Upgrades

The assumption embedded in "Copilot is getting worse" framing is that the trajectory is downward. The more accurate description is that the trajectory is inconsistent. Each model upgrade changes the distribution of failure modes rather than reducing them uniformly. A capability that improves with a new model often comes with a different capability that regresses.

For a developer who relied on a specific Copilot behavior, a model upgrade that improves general performance but breaks that specific behavior is experienced as a downgrade. Their workflow is disrupted. The fact that aggregate benchmarks improved does not change the lived experience of the regression.

GitHub has not published a changelog of which underlying models power Copilot at any given time, which makes it difficult for developers to correlate their experience with specific model transitions. That opacity contributes to the "is this getting worse?" uncertainty. When you cannot identify what changed, you cannot evaluate whether the change is permanent or temporary.

The Asked ChatGPT shirt is the correct response to a Copilot suggestion you are not sure about. Not acceptance. Not rejection. A second opinion from a different model, applied critically, is a reasonable workflow when your primary tool's accuracy is uncertain.

Frequently Asked Questions

Is GitHub Copilot actually getting worse or is it a perception issue?

Both explanations have partial validity. Ryz Labs found genuine accuracy degradation on large codebases, approximately 50% on projects over 10,000 lines. Model cycling has introduced new failure modes alongside improvements. But some of the "getting worse" perception also reflects developers' increasing baseline expectations. As alternative tools have improved, the comparison point has shifted, making Copilot's existing limitations more noticeable than they were in 2021.

What caused GitHub Copilot quality regressions in 2024 and 2025?

The primary driver appears to be model cycling. GitHub has transitioned Copilot through multiple underlying models since the original Codex API deployment. Each transition has brought different failure modes. The original Codex model was trained specifically on code, which produced narrow but precise completions. Later general-purpose models improved some capabilities while introducing regressions in contextual accuracy and completion relevance for complex, multi-file projects.

Does GitHub Copilot work well on large codebases?

According to the Ryz Labs six-month study, Copilot's accuracy drops to approximately 50% on projects exceeding 10,000 lines of code. On smaller, well-structured codebases, performance is significantly better. The degradation appears related to context limitations, Copilot's ability to account for code defined in distant files or earlier in the same large file decreases as the project grows beyond what its context window can effectively process.

Can Copilot suggest npm packages that do not exist?

Yes. This has been documented by multiple developers and is a known failure mode of large language models applied to code generation. Copilot can suggest package names that do not exist on the npm registry, with the same visual confidence as a valid suggestion. The safest practice is to verify any new package name in the npm registry or your package manager before accepting a Copilot suggestion for a dependency you have not used before.

Should developers stop using GitHub Copilot in 2025?

Not necessarily. On small-to-medium codebases, Copilot's tab-completion remains a meaningful productivity tool. The recommendation from the developer community is to use it with appropriate verification habits: check package names, review deprecated API suggestions, and never accept a completion that introduces a new dependency without confirming that dependency exists and is actively maintained.

On-theme pick

POWERFUL GASLIGHTER shirt

It swears it is getting better. Powerful Gaslighter is the tee for devs who know otherwise.

From €29.90

View the shirt Shop developer shirts

AI Collection developer-culture github copilot getting worse 2025