Recently, there has been news that attackers are infiltrating Stack Overflow with ‘helpful’ coding advice that spreads malware. As Bleeping Computer reported,
Cybercriminals are abusing Stack Overflow in an interesting approach to spreading malware—answering users’ questions by promoting a malicious PyPi package that installs Windows information-stealing malware.
For the layperson, this is how it works…
Code Libraries
Nowadays, software developers never write code from scratch. They reuse code written by other software developers, who reuse code from others, who in turn reuse code from others. Reusable codes are packaged together in public code repositories like “PyPi”. A lot of these reusable codes end up in production software, apps and other corporate internal systems.
The key point to understand is this: a lot of these code repositories are public. Anyone can contribute code to them. So, what happened is that attackers are contributing malicious code to these public code repositories. As software developers are constantly under the pump, they do not have time to inspect all the third-party code that they re-use. As a result, there are reports of malicious codes being discovered within the supply chain of software and apps.
Stack Overflow
Stack Overflow is a Q&A forum for software developers. It is a fantastic resource for them.
What is happening is that ‘helpful’ forum users are infiltrating Stack Overflow with example codes that are ultimately found to be malicious. These attackers cannot provide example codes that directly perform malicious actions. It will be too obvious. So, they provide code that performs the actions of importing malicious codes from code repositories. Since most people will not have the time to inspect the code inside code repositories, these malicious code slips in undetected.
AI learning
Now, this is where it gets interesting.
The reason why ChatGPT and Microsoft Copilot are clever enough to write code is because they have been trained with coding examples obtained from somewhere. Recently, as this news article reported, OpenAI and Stack Overflow had a partnership where the former’s Large Language Model (LLM) gets to train on the latter’s huge Q&A coding content.
AI writing malicious code
So, if ChatGPT and Copilot train on tainted Stack Overflow content, then it follows that they will end up writing malicious software code.
There is every temptation for time-poor software developers to copy and paste code written by AI. As my friend wrote on LinkedIn,
I’d guarantee malware will end up in production apps this way. There are far too many folks out there, under time pressures to fix and produce, that this can’t happen.
Cut and paste code kicks a lot of devs and admins in the face. As a result, you’ll see more compromises and exploits, more data breaches and leaks, and more panic & scramble by vendors to issue patches and updates.
And the code itself doesn’t even have to be malicious by itself, but it can trigger existing flaws and zero-days, reverse existing patches, and make subtle config changes when disguised as an update or a feature.
This will succeed on the fast pace and pressure to “get things done” and get apps to the market or to fix issues – because the very market does not allow the luxury of time to do things right from the start. “If we don’t, someone else will” and that greed and need to be first, fastest, and wealthiest does the rest.
And while we panic and scramble, inadvertently releasing compromised code, we’re distracted – and criminals sit back and wait.
Plus the wildcard of those complacent and time-poor admins and techs that just never seem to get around to maintaining systems because their “leadership” won’t give them resources, time, or even listen.
We will see how OpenAI deals with this new situation. Will they try to solve this looming problem at the root? Or they will take a piecemeal and reactive approach to the problem?
We shall see.