AI code completion, its ethics and repercussions

15 Sep 2024 by Hadi Rickit

AI powered code completion tools are software that tries to figure out what your code snippet is trying to achieve and provides autocomplete suggestions as it deems appropriate. Where these suggestions come from, and associated telemetry will be discussed further through the different ethical frameworks as it is of high importance due to its ethical repercussions.

The emergent technologies in question are AI powered code completion tools, specifically Copilot. AI powered code completion tools are software that tries to figure out what your code snippet is trying to achieve and provides autocomplete suggestions as it deems appropriate. Where these suggestions come from, and associated telemetry will be discussed further through the different ethical frameworks as it is of high importance due to its ethical repercussions. I will mainly focus on the ethical dilemmas posed using an AI powered code completion called Copilot. Copilot is a consumer facing product, developed by GitHub, a developer platform owned by Microsoft (“Microsoft acquires Github”, 2018). Copilot can generate code when provided context. This context could be in the form of comments, methods, variable names, or nearby code (Dackel et al 2023, p. 2). In short, this is software that autocompletes code as you type and is described as an “AI pair programmer” (Dackel et al 2023, p. 2). AI code completion software became generally available when Github’s Copilot entered the market for consumer use in 2021 (Gershgorn, 2021). Before this, code completion relied mainly on the statically typed systems of the programming language in use. This resulted in less relevance as suggestions were ranked alphabetically, because it did not consider context (Bruch et al, 2009, p. 1). A high-level view of how Copilot works is it has a software client which sits in the end user’s machine, usually in the editor. As code is typed in, a prompt is generated from the context. This context is derived from peripheral code, open tabs, and file name and type. This prompt is then encrypted and sent to a Large Language Model (LLM), passing through a proxy which terminates unsuitable requests on the way. The code is formulated by the LLM and is then sent through the proxy again where it is further filtered before being encrypted and sent back to the client. This is where it shows up in the editor as a suggestion (Salva, N.A). It is worth highlighting that LLM’s are trained with data, public or otherwise. Vast amounts of training data, regardless of copyright status, are gathered legally under United States’ fair use doctrine (T. Teubner et al, p. 99). To drive this home, GitHub can use all the code hosted on its platform, publicly accessible data and the data it gains access to via Copilot to train its AI models without consideration for copyright, intellectual property infringements (Quang, 2021). It is notable also that LLMs in use for Copilot are provided by OpenAI (Verdi, 2024), which has implications given its unique governance structure and stated goals regarding Artificial General Intelligence (AGI) (“Inside OpenAI’s weird governance structure”, 2023, Nov 21). Thus, there are inherent conflict of values when it comes to public good, privacy, and data ownership which leads to an ethical dilemma. I will discuss this through utilitarianism, deontology and virtue ethics. Finally, I will attempt to draw conclusions about how the conflict should be resolved through a balanced and practical approach, relying on meta-ethics.

When evaluating the values of public good, privacy, and data ownership we need to first identify the stakeholders. The stakeholders would be the public, the programmers using Copilot and by extension their organisations, and the purveyors of Copilot, which we will take to be OpenAI and Github, as well as consumers of software created using Copilot.

Through utilitarianism, which is a consequentialist framework, public good takes precedence over privacy and data ownership. This is based on the established understanding that what is morally right is what provides greatest utility for greatest number (Graham, Gordon. p 135). The extent of utility in using Copilot is substantial. This is because of enhanced efficiency for the programmer and their organisation, benefits to consumers of the said software being built and all corresponding downstream operations, as well as large amounts of training data that contribute to the development of LLM’s. Copilot enhances code efficiency, arguably speeding up development time (Kalliamvako, 2022), making the process of writing code more seamless (Arghavan et al, 2023) and increasing software quality (Izadi et al, 2024). Data is subjected to harm mitigation measures in the proxy where invalid or illegitimate requests such as toxic language, relevance, and hacking attempts are filtered out before it is collected and fed to LLMs (Salva, NA). This data that is fed to LLMs stands to benefit not just Github or OpenAI, but all of humanity by progressing cutting-edge technology - enabled by vast amounts of high-quality training data. Thórisson et al. (2024, p. 111 ) suggests these same developments could also bring humanity closer to achieving AGI and by extension allow society to progress, the terminal outcomes, and centralisation of AGI notwithstanding. It is important to note that the training data should be obfuscated and anonymised as well as diverse and representative to avoid reinforcing harmful stereotypes or biases (Rozado, 2023). The concerns of privacy and data ownership are addressed adequately with the following. Firstly, Github complies with the European Union’s General Data Protection Regulation (GDPR), EU-U.S. Data Privacy Framework (EU-U.S. DPF), as well as California Consumer Privacy Act (CCPA) (“Data Protection Agreement”, 2023), ensuring a lawful basis for each processing activity (“Privacy Statement”, 2024). Secondly there are provisions allowing for acceptable use of data to train AI models (Quang, 2021). Lastly, encryption and obfuscation measures taken by Github (Salva, NA). Notably, “GitHub does not claim ownership of a suggestion produced by GitHub Copilot” (Salva, NA), which would address concerns regarding data ownership. The trifecta of confidentiality, integrity, and availability is reasonably dealt with in the following ways: encryption, consent collection, and transparency. Ultimately, utilitarianism supports the idea that the public good derived from the use of AI code completion tools outweighs privacy and data ownership concerns. The improvements in efficiency and code quality for developers and organisations, combined with the broader societal benefits of advancing AI technology, offer strong arguments for their use assuming ethical safeguards are implemented to address privacy and data protection concerns. The utilitarian would endorse continued use of such AI code completion tools, in this case Copilot.

Through a non-consequentialist framework of deontology privacy and data ownership trumps public good. This is because there is an inherent right to privacy, trust, and consent - all of which are infringed upon or diminished in this case. The users of Copilot will have their privacy infringed upon as all code written is constantly monitored before being sent to offsite servers (Salva, N.A), whatever the justification. Copilot operates on a Software as a Service (SAAS) model. This means that these services (AI code completion) are provided through the internet rather than via software installed on the end user’s machine (Gashami et al, 2016). A programmer using Copilot is constantly having their work monitored with the justification of forming the context upon which the prompt to generate suggestions rely (Salva, N.A). This can be seen as a gross overstep of privacy and excessive surveillance. It is apparent that consent and choice are lacking in this scenario. The programmer has lost the right to control or limit access to their personal or sensitive information by opting (or being opted for by their employer) to use the Copilot service. As mentioned, there is a transmission of data before the code completion recommendation is returned from Github’s computing infrastructure rather than from the local machine, this is inherent in the SAAS model. Although Github complies with the various data privacy legislations as mentioned above, the programmer and their organisation have no control or knowledge over the exact processes and manipulation their data is subjected to as it does not take place on their machine, rather on Github’s servers. There is no control nor transparency regarding software choice, data paradigms, or security measures, let alone what the data is being used for (Stallman, 2010). All of this is dependent on the administrator of the infrastructure which is in this case Github and OpenAI. This is a severe lack of transparency and accountability which is inherently linked to privacy and trust, or a lack thereof. Hence, data ownership is severely compromised. The purveyors of the Copilot software, Github, and OpenAI also have a moral duty to provide transparency and privacy, both of which it is failing at. These organisations can legally train their LLM’s with customer data, proprietary, copyrighted, or not, citing fair use. This is morally dubious even if legal. Assuming data is unethically being used to train said LLMs, there is no way to ensure that this data is put through harm mitigation measures and that it is obfuscated and anonymised as well as diverse and representative enough to avoid reinforcing harmful stereotypes or biases (Rozado, 2023) other than taking OpenAI’s word for it. Where confidentiality, integrity, and availability are concerned it seems the former two are being overlooked for ready availability to the data. It is interesting to note the adoption rates Copilot has achieved (Dohmke, 2024) and how much of a part incrementalism has and will continue to play going forward. One wonders if there is also an angle of overdependence on this tool amongst programmers that contributes to its sustained growth and popularity and if privacy is the price that is being paid for its use. The purported benefits of 55% faster task completion using predictive text, quality improvements across 8 dimensions (e.g. readability, error-free, maintainability), 50% faster time-to-merge (Salva, NA) are very difficult to ignore and calls into question the ability of programmers to produce comparable results independent of this software. Thereby forming a crutch which, together with the infringements on privacy, is not justifiable for gains in software efficiency. Ultimately, the use of Copilot is not justified by the possible public good that might come from accumulating vast amounts of training data for the LLM, it not for the greater public good as the methods used to collect said data are morally wrong and it creates an unhealthy overdependence on the software at the expense of privacy. Hence, the values of privacy and data ownership are morally not upheld, and neither is public good through deontology.

Through agent-based, virtue ethics it is based on the moral agent to evaluate and deem whether privacy, data ownership, and/or public good is morally upheld. A moral agent is a person who can discern right from wrong and can be held accountable for their actions as they have moral responsibility to not cause unjustifiable harm (Manjikian, 2023). Virtue ethics relies heavily on “individual moral reasoning” (Manjikian, 2023) of the moral agent. Hence, assuming the moral agent in this case is virtuous and has good intentions, privacy, integrity, trust, and public good would be applications of virtue ethics. The increased efficiency and potential for public good derived from using Copilot (Kalliamvako, 2022) must be weighed against potential risks and moral quandaries relating to privacy and trust raised previously. The moral agent has a responsibility to avoid what could be perceived as morally negative situations. This would obviously differ from person to person and vary heavily on their belief system and upbringing. For example, an agent that values highly the protection of intellectual property would be inclined to not use Copilot on proprietary, sensitive codebases, but only on more routine, boilerplate tasks. This would be a valid compromise, as the benefits are still reaped while avoiding the more morally questionable aspects. It must be noted that the onus is on the moral agent, the programmer in this case, and not the organisations behind Copilot. This is because the organisations do not have moral agency as they do not have rational thought, emotions, morality, or self-awareness (Bradford, 2019). Therefore, it is the duty of the moral agent to recognize the creep of incrementalism on privacy and that of the organization. A moral agent should navigate the trifecta of confidentiality, integrity, and availability, while also balancing the need for public good. Ultimately, through virtue ethics it is purely based on the moral agent whether there are conflicting values; however, it is important to note that they should use their ability to discern right from wrong and to be held accountable for their actions.

It can be recommended through meta ethics that encryption, harm mitigation, consent collection, AI training specific consent collection, data legislation compliance, and ensuring AI training data quality can resolve the conflict of values explored above. Meta ethics shows that ethical theories are positioned to a specific set of moral intuitions, thereby suffering from limitations. These normative theories are all valid but have inherent intrinsic limitations which need to be considered when developing a practical response. Meta ethics should be used to see where the outcomes overlap and or diverge.
This should then be contextualized and Christen (2020, p. 207) suggests the solution should align with existing practices, professional commitments, and public expectations. These recommendations have come from the process of meta ethics in recognising the shared assumptions and desired outcomes of these conflicting frameworks. Both utilitarianism and deontology establish to different degrees that the rights to privacy and data ownership are important. Moreover, the value of public good is also highly regarded, albeit the utilitarian views public good as the progression of LLM technology by way of training data collection, while the deontologist views it as protection of privacy. To come to a practical response, I will attempt to stress their similarities rather than differences. Moreso, capitalise on agreed upon commitments while finding balance between the differences. Guidance will also be drawn from existing best practices in industry. In this case the AI, cybersecurity industries, and privacy related legislation and best practices. Privacy and public good are common values from which acts of encryption, harm mitigation, and increased developer efficiency are mutually agreed upon as positives when it comes to Copilot. This will form the basis of how this conflict will be resolved. Data should continue to be encrypted both ways with harm mitigation operations continuing to operate. The way context data is formulated, however, should be more granular and transparent. The user should be given privacy and consent options on what pieces of data are used to form this context. The right to decline uninvited observation should also be available to the user, this could mean functionality to pause or disconnect Copilot’s monitoring capability. Transparency in how information is collected, stored, and disclosed through processing or use should be exercised. Legal compliance should also continue, Hijmans & Raab (2022) suggests GDPR compliance would take ethics into account in relation to socio‑technical developments. This along with existing CCPA and DPF compliance would further align the solution to industry and legal standards. The consent collection for data used to train LLMs should be explicit and separate from consent collected for code completion, ensuring a degree of granularity. Furthermore, users should not be penalised for not consenting to one or the other. Data that is used to train the LLMs should be obfuscated, anonymised, as well as diverse and representative to avoid reinforcing harmful stereotypes or biases. These are all fundamental to privacy, data ownership, and public good. Thus, through meta ethics it can be determined that the conflict of values examined through the multiple ethical frameworks can be improved, and ultimately resolved.

It is evident that through utilitarianism and deontology the values of privacy and public good conflict. There is tension between the values held and it seems we have come to a moral dilemma. Broadly the utilitarian is satisfied with the shortfalls in privacy, trust, and transparency for the greater public good. While the deontologist argues that the inherent perceived lack of these values is unacceptable and deems the solution (Copilot) immoral. It should be kept in mind that the values of privacy and public good are held in high regard in both these ethical frameworks. Privacy, and by extension security, are not absolute and comes in degrees; hence, both frameworks seek to achieve a balance with the confidentiality, integrity, and availability triad model. What differs is the emphasis the frameworks place on different values and aspects of this technology (Hedstrom et al 2011) and therefore in the conclusion of morality. Which links to the views of virtue ethics based on the moral agent in question. Ultimately, from all the frameworks explored privacy, data ownership, and public good will continuously be conflicting, thereby incorporating meta ethics is essential for a practical and moral approach.

References

Arghavan M. D., Vahid M., Amin N., Foutse K., Michel C. D., Zhen M. J. (2023). GitHub Copilot AI pair programmer: Asset or Liability?. Journal of Systems and Software, 203, https://doi.org/10.1016/j.jss.2023.111734.

Bruch, M., Monperrus, M., & Mezini, M. (2009). Learning from examples to improve code completion systems. _Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM Symposium on the Foundations of Software Engineering_, 213–222. https://doi.org/10.1145/1595696.1595728

Barford, L. “Contemporary Virtue Ethics and the Engineers of Autonomous Systems,” 2019 IEEE International Symposium on Technology and Society (ISTAS), Medford, MA, USA, 2019, pp. 1-7, doi: 10.1109/ISTAS48451.2019.8937855.

Christen, M. (2020). The Ethics of Cybersecurity (Markus. Christen, Bert. Gordijn, & Michele. Loi, Eds.; 1st ed. 2020., p. 1 online resource (384)). Springer Nature; Springer International Publishing ; Imprint Springer.

Dohmke, T. (2024, May 14). The economic impact of the AI-powered developer lifecycle and lessons from GitHub Copilot. Github. https://github.blog/news-insights/research/the-economic-impact-of-the-ai-powered-developer-lifecycle-and-lessons-from-github-copilot/

GitHub General Privacy Statement. (2024). Retrieved from https://docs.github.com/en/site-policy/privacy-policies/github-general-privacy-statement

GitHub Data Protection Agreement. (2023). Retrieved from
https://github.com/customer-terms/github-data-protection-agreement

Gershgorn, D. (2021, 29 June). GitHub and OpenAI launch a new AI tool that generates its own code. The Verge. https://www.theverge.com/2021/6/29/22555777/github-openai-ai-tool-autocomplete-code

Graham, G. (2004), Eight theories of ethics [Digital version]. Routledge/Taylor and Francis Group

Gashami, J. P. G., Chang, Y., Rho, J. J., & Park, M.-C. (2016). Privacy concerns and benefits in SaaS adoption by individual users. _Information Development_, _32_(4), 837-. https://doi.org/10.1177/0266666915571428

Hijmans, H., & Raab, C. (2022). Ethical Dimensions of the GDPR, AI Regulation, and Beyond. Direito Público (Porto Alegre), 18(100). https://doi.org/10.11117/rdp.v18i100.6197

Inside OpenAI’s weird governance structure. (2023, Nov 21). _The Economist (Online),_ https://login.wwwproxy1.library.unsw.edu.au/login?url=https://www.proquest.com/magazines/inside-openai-s-weird-governance-structure/docview/2891913400/se-2

Izadi M.,Katzy J., Dam T.V., Otten M., Popescu R.M., Deursen A.V. (2024). Language Models for Code Completion: A Practical Evaluation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ‘24). Association for Computing Machinery, New York, NY, USA, Article 79, 1–13. https://doi.org/10.1145/3597503.3639138

Karin Hedstrom, Ella Kolkowska, Frederik Karlsson & JP Allen 2011, ‘Value conflicts for information security management’, The Journal of Strategic Information Systems, vol. 20, no. 4, pp. 373-384.

Kalliamvakou E (2022) Research: quantifying GitHub Copilot’s impact on developer productivity and happiness. https://github.blog/2022-09-07-research-quantifying-github-copilots-impacton-developer-productivity-and-happiness/. Accessed 11 Aug 2024

Manjikian, M. (2023). Cybersecurity ethics : an introduction (Second edition.). Routeldge. Chapter 1 “What are Virtue Ethics?” pp. 25-30.

Microsoft to acquire GitHub for $7.5 billion. (2018). Retrieved from https://news.microsoft.com/2018/06/04/microsoft-to-acquire-github-for-7-5-billion/

Quang, J. (2021). DOES TRAINING AI VIOLATE COPYRIGHT LAW? _Berkeley Technology Law Journal_, _36_(4), 1407-. https://doi.org/10.15779/Z38XW47X3K

Rozado, D. (2023). The Political Biases of ChatGPT. Social Sciences (Basel), 12(3), 148-. https://doi.org/10.3390/socsci12030148

Salva, R. (NA). How GitHub Copilot handles data. GitHub. https://resources.github.com/learn/pathways/copilot/essentials/how-github-copilot-handles-data/

Stallman, R. (2010). Who does that server really serve?. GNU Operating System. https://www.gnu.org/philosophy/who-does-that-server-really-serve.en.html

Teubner, T., Flath, C. M., Weinhardt, C., van der Aalst, W., & Hinz, O. (2023). Welcome to the Era of ChatGPT et al: The Prospects of Large Language Models. _Business & Information Systems Engineering_, _65_(2), 95–101. https://doi.org/10.1007/s12599-023-00795-x

Thórisson, K. R., Isaev, Peter., & Sheikhlar, Arash. (Eds.). (2024). _Artificial General Intelligence : 17th International Conference, AGI 2024, Seattle, WA, USA, August 13-16, 2024, Proceedings_ (1st ed. 2024.). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-65572-2

Verdi, S. (2024, Feb 7). Inside GitHub: Working with the LLMs behind GitHub Copilot. Github. https://github.blog/ai-and-ml/github-copilot/inside-github-working-with-the-llms-behind-github-copilot/