Measuring the impact of GitHub Copilot

Copilot helps developers write better code, faster, and with more joy. We’re continuously learning—alongside our customers and partners—how to measure that impact. For example, both GitHub and outside researchers have observed positive impact in controlled experiments and field studies where Copilot has conferred:

55% faster task completion using predictive text
Quality improvements across 8 dimensions (e.g. readability, error-free, maintainability)
50% faster time-to-merge

Many enterprises quite reasonably ask, “How do I know Copilot is conferring these benefits for my team?” To answer that question, this guide will walk you through a framework for evaluating impact across four stages.

In this guide, you will learn:

How to evaluate Copilot’s impact across four stages, from initial adoption to sustained efficiency.
How to use developer surveys and telemetry data as leading indicators of Copilot impact on system-level measures.
How to plan and measure for system-level improvements that may result from GitHub Copilot.

GitHub Copilot adoption stages

The first two stages—Evaluation and Adoption—focus on leading indicators as close to the coding activity as possible. They rely on a combination of self-reported data from developers and existing telemetry from GitHub. Employing this strategy makes it possible to both (a) reliably predict future impact, and (b) perform an evaluation with relatively little additional investment in observability.

It’s hard, if not impossible, to objectively measure developer productivity. We’ve found it’s often best to just ask them how they spend their time, what their blockers are, and what tools are helpful. The overwhelming majority of developers who use GitHub Copilot say it’s a valuable tool they use at least once a week, which is certainly enough to justify our investment.

Mark Côté // Director of Engineering, Developer Infrastructure // Shopify

As we enter the third and fourth stages—Optimization and Sustained Efficiency—we ask that engineering leaders integrate their specific organizational goals into the evaluation criteria. During these stages, focus should shift toward increased efficiency (e.g. via measures of cost/effort, time-to-market, and risk) and intentionality (e.g. via measures of value delivered).

At all stages, we recommend measuring at the organization or team level, ideally along the boundaries of a workgroup committing code to the same service or application. This enables teams to evaluate impact for developers performing similar work.

Now, let’s walk through each of the four stages, exploring the specific goals, methodology, relevant metrics, and criteria for moving on to the next stage of measurement.

Evaluation

Goal: Build a technical and business case to adopt/reject GitHub Copilot at scale.

Methodology: Assess leading indicators of impact (close to the coding activity) through developer surveys and user engagement measures.

Relevant metrics (see glossary): Satisfaction using Copilot, Benefits of using Copilot, Challenges of using Copilot, Enablement provided, Average daily active users per month - completions, Average daily active users - chat, Average daily active users - total, Suggestions delivered - completions, Number of acceptances - completions, Lines of code accepted - completions, Total acceptance rate - completions

Exit criteria: >40% of active trial participants would be “very disappointed” to lose Copilot access (see survey).

During evaluation, most organizations simply want to determine: Is GitHub Copilot worth a scaled deployment?

Ultimately, most teams just want to ship good software quickly and safely. But things like time-to-market, cost-of-delivery, security posture, and talent retention are lagging indicators that take time to evaluate. What’s more, these lagging indicators can also be influenced by outside factors—e.g. staffing changes, business priorities, technical challenges, etc.

Thus, we recommend considering leading indicators before adding new metrics to your ROI and engineering scorecards. Those leading indicators should include a combination of both self-reported data from engineering surveys and existing telemetry from GitHub.

Tools and resources

Developer surveys: There’s ample research showing that developers are domain experts capable of assessing help versus harm within their own toolchain. We trust developers to critically assess GitHub Copilot’s impact on their own work. Fortunately, surveys are relatively easy to conduct, so they also serve as a quick and reliable harbinger of future impact. The questions in our GitHub Copilot Developer Survey provide insights on how Copilot has helped; what are its shortcomings; and for what it is used.
User engagement data: In addition to survey responses, the Copilot Metrics API and Copilot User Management API provide measures related to how your developers are using Copilot. Measures such as Average Daily Active Users and Total Acceptance Rate can help identify where further enablement of your team is required.

Adoption

Goal: Targeted teams are enabled and actively using GitHub Copilot.

Methodology: Continue focusing on leading indicators and enablement indicators. There may be early signs of broader system impacts.

Relevant metrics (see glossary): All measures from the Evaluation stage, plus New seats added in billing cycle, Dormant users, Total completed pull requests, Pull requests per developer, Time to merge, PR lead time, Average code review time in hours, Pull request merge rate, Total successful builds, Change failure rate, CI success rate, Open security vulnerabilities

Exit criteria: >80% of committed licenses are assigned and active with neutral-or-better impact to system-level metrics.

As you expand use of Copilot across your organization, in addition to observing the measures used during evaluation, consider using the Copilot Billing and Plans Dashboard to understand where further enablement may be required.

Insight: Microsoft research finds that it can take 11 weeks for users to fully realize the satisfaction and productivity gains of using AI tools.

In the 2022-2023 Microsoft Copilot Impact randomized controlled trial (RCT), the treatment group (developers with access to Copilot) experienced a raw adoption increase of 26.4% within two weeks when they received enablement reminder emails.

Tools and resources

The following tools and resources support enablement during the Adoption stage and enable trialing engineers during the Evaluation stage:

Microsoft Learn: GitHub Copilot : Free courses on using GitHub Copilot
Getting started with GitHub Copilot: Video tutorials for getting started in VS Code
GitHub community Copilot discussions: Great resource for admins and developer champions, with lots of valuable information and tips
GitHub Copilot docs: Docs covering everything from getting started to troubleshooting and feature enhancements

If you’re not seeing high levels of product fit in your survey responses and API measures, we recommend using follow-up interviews to understand any challenges engineers may be experiencing and to provide additional enablement.

Running your own developer surveys will help you learn about and improve your engineering efforts, but first you should consider your motivation. For example, we have three primary goals: evaluate overall developer tooling satisfaction, prioritize which pain points to address, and chart how well we've done addressing them. These goals guide the questions we ask.

Similarly, it's best to employ a variety of stakeholders to your questions. We work with data scientists from Shopify's talent research and insights team, directors and VPs from different departments, and various subject matter experts to create and refine our questions.

Survey fatigue can be a real problem, so we try to keep our surveys succinct. Our big, twice annual developer surveys take about 15 to 20 minutes to complete, and even then we only send it to half the company at once, so they don’t have to take it again. We also try to keep surveys relatively simple, often relying on the "5-point Likert" method, where participants rate a statement on a scale of 1-5.

Mark Côté // Director of Engineering, Developer Infrastructure // Shopify

Most organizations are continually seeking to improve their developers’ productivity and effectiveness. You may observe improvements to measures that are already part of your team’s engineering scorecard (e.g. PR lead time, story points, change failure rate), but they shouldn’t be the focus of this stage. Instead, focus on managing change as your team integrates a new tool into their daily practice—in this case, GitHub Copilot. If you see a regression in any of your engineering scorecard measures during your Copilot deployment, pause to examine the root cause. If Copilot is a contributing factor, adjust your enablement or deployment strategy as necessary. Ensure your developers have received proper training and have access to resources when they need help. Once you’ve fully integrated Copilot into your team’s daily engineering practice, then you can turn your attention toward realizing system-level improvements specific to your organization’s goals (e.g. reducing time-to-market, cost-of-delivery, risk).

Optimization

Goal: Positive impacts on organization-specific goals.

Methodology: Realize and channel efficiency gains toward positive business outcomes (e.g. faster time-to-market, lower cost-of-delivery).

Relevant metrics (see glossary): All measures from the Adoption and Evaluation stages, plus measures tailored to your organization’s goals.

Exit criteria: >80% of committed licenses are assigned and active with documented positive impacts on the organization’s target system-level measures.

Once your team has adopted GitHub Copilot, you can start to focus on the system-level improvements most important to your organization. Each organization is unique. Some organizations may be meeting their quality targets but want to increase velocity. For others, code smells and technical debt may be a pain point.

We recommend an overarching three-step process for optimizing and sustaining Copilot impact over the long term:

1. Set your system-level goals

Decide on the system-level targets that are most important to your organization. We recommend a holistic perspective that considers multiple measures and takes a sustainable approach to improvements. Monitor both quality and speed measures, even if one is the main focus, so that you notice any unintended consequences as early as possible. Continue to focus on leading indicators of developer happiness and engagement as they will underpin system-level improvements. For justification and maximizing value, some organizations may choose a Business Value Engineering lens to help with this decision.

As you measure progress towards your system-level goals (see step 3 below), it may be necessary to adjust your goals or provide enablement beyond the implementation of Copilot.

Tip: Your tactical plans should align with the goals you set. Once a critical mass of your team has adopted Copilot (typically, greater than 80%), it's reasonable to expect visible impact on downstream engineering measures – but only if engineering leaders direct surplus capacity toward meaningful goals. If developer's don't have adequate direction, the surplus capacity is likely to diffuse across activities and be less visible.

If your organization provides teams with significant autonomy, it may be necessary to record both (a) how much and (b) where surplus capacity is directed. This will maximize your ability to measure impact on system-level measures.

2. Setup/access tools that enable goal measurement

For some organizations, the engineering system extends beyond the GitHub platform. Consider whether you need to leverage additional data sources to enable measurement of your system-level goals. The SPACE framework provides examples of system-level indicators that may be helpful for measuring progress towards your goal.

3. Monitor measures

Decide on the frequency to monitor system-level measures. Often, this may be aligned with your engineering cadence—we recommend no more than weekly or every two weeks. Many teams choose to examine their engineering scorecards during a monthly “Engineering Fundamentals” review meeting. Concurrently, continue to observe the quantitative measures related to Copilot engagement (from the Copilot Metrics API and Billing and Plans Dashboard). We also recommend continuing to use the Copilot survey on a six-month basis, at minimum, to provide early indicators of any potential concerns related to usage. Alternatively, embed questions relating to GitHub Copilot into your organization’s developer surveys or provide synchronous feedback opportunities to check in on the Copilot experience.

Tip: Many leaders want to quantify a return on investment (ROI) from Copilot. Any impact beyond coding is indirect, and the downstream possibilities are endless. Thus, we must be careful when searching for causality in system-level measures like code coverage, story points delivered, cycle time, etc. It is likely that Copilot will impact various system-level measures, but any number of things—shifting priorities, staffing changes, live site outages, and stage-of-development can impact system-level measures. When situational dynamics are powerful, Copilot is more likely to be a mitigating factor than a driving force.

Sustained Efficiency

Goal: Continuous evaluation and improvement.

Methodology: As both the team composition and business goals change, adapt and sustain GitHub Copilot’s impact.

Relevant metrics (see glossary): All measures from the Adoption and Evaluation stages, plus measures tailored to your organization’s goals.

Organizations change over time. Employ measures from the Optimization stage as a form of feedback on where changes might be required in your engineering system. Be on the lookout for business factors that may require you to shift priorities, move your targets, and consider alternative system-level measures. Even as you sustain use of Copilot, continue to analyze surveys and quantitative engagement measures to ensure all developers are able to experience Copilot’s impact. Address any identified barriers and provide additional required enablement.

How to learn more

Our approach to measuring Copilot’s impact is informed by the many research studies we have undertaken:

We have consistently observed the need to focus on Copilot enablement, and that system-level impacts can be diverse and are highly dependent on context. This is why we recommend selecting system-level goals only once engineers are experiencing satisfaction and efficiency using Copilot. We also recommend being clear and consistent when communicating to engineering teams about where Copilot productivity gains should be directed.

We are committed to learning with our customers and partners on how we can use AI as part of sustainable, safe, and productive engineering systems. We will continue to share our insights.

Next up, let’s take a look at ways to shape your internal AI policy and governance to promote appropriate and effective use.

Up next: Empower developers with AI policy and governance

Get started with GitHub Copilot