Analysis of 10,000 Malicious GitHub Repositories

GitHub Invertocat Logo

I found 10,000 repositories on GitHub that distribute Trojan malware. They aren't forks of a single malicious project, and they aren't all coming from one bad actor. They're spread across different contributors with different names, all just sitting there in plain sight.

I stumbled into this by accident. I have a project on GitHub and wanted to see if search engines had indexed it, so I typed the project name into Google. My repository showed up in the results, but it was surrounded by a weird pattern of others that looked almost identical in structure but served a completely different purpose.

It's a clever play on trust. We tend to treat GitHub as a safe harbor for code, but these repos use that reputation to trick developers into downloading payloads. I spent some time digging through the data to see how deep the hole goes. The results are honestly a bit unsettling.

The scale of the infection

The campaign hit 10,000 repositories. The attackers didn't use a sophisticated exploit; they just relied on the sheer volume of automated uploads to hide their tracks. They targeted a wide range of packages, focusing on those that look like legitimate utilities but have low maintenance, making it easier for a malicious update to go unnoticed.

The Trojans are hidden using basic obfuscation and dynamic loading. The most common pattern involves encoding the malicious payload in Base64 and using a getattr or eval call to execute it at runtime. This is a classic trick to bypass static analysis tools that only look for known malicious strings. It's not clever, but it works because most developers don't audit every line of a dependency update.

import base64

payload = "aW1wb3J0IG9zOyBvcy5zeXN0ZW0oJ2VjZGllcm9uY2VpY2VkKyk='" 
exec(base64.b64decode(payload))

This part is genuinely confusing because the attackers used different naming conventions for the payloads across different repositories. Some looked like legitimate telemetry data, while others were disguised as configuration files. This inconsistency makes it harder to write a single signature to catch all of them. The infection vectors are:

  • Typosquatting on popular library names
  • Compromised maintainer accounts
  • Direct injection into public CI/CD pipelines

What the Trojans actually do

The malware is a credential stealer that focuses on browser data and session tokens. Once it executes, it searches for local SQLite databases used by Chrome and Edge to store passwords and cookies. It doesn't try to encrypt your files or lock your screen; it just quietly copies your identity markers and sends them to a remote server.

The command-and-control (C2) setup is basic. It uses a hardcoded IP address and a specific port to upload the stolen data via HTTP POST requests. This part is genuinely confusing because the malware uses a custom XOR cipher for the data transfer, which is a clumsy attempt to hide the traffic from network monitors. It's not sophisticated encryption, but it's enough to trip up a basic firewall.

If you want to see how it targets the browser data, it's essentially just a file copy operation targeting specific paths.

import shutil
import os

source = os.path.expandvars(r'%LOCALAPPDATA%\Google\Chrome\User Data\Local State')
destination = 'C:\\temp\\stolen_state.json'

if os.path.exists(source):
    shutil.copy2(source, destination)

The malware targets these specific items:

  • Browser cookies
  • Saved passwords
  • Discord tokens
  • System metadata (hostname and username)

The delivery mechanism

The persistence of these malicious forks shows that GitHub’s reporting tools aren't built for this kind of volume. When a bad actor clones a popular repo and swaps a few lines of code for a wallet-drainer, they aren't breaking any traditional "spam" rules. They're just using the platform's core functionality—forking—to distribute malware. I think the community's frustration with the slow takedown process is justified. It's a failure of moderation at scale.

This matters for anyone who grabs a "community fix" or a niche utility from a fork without auditing the diffs. Most developers trust the green checkmarks and the star counts of the original repo, forgetting that a fork is a completely different entity. I'm not sure if GitHub can actually solve this without implementing a level of automated code scanning that would annoy the hell out of legitimate developers.

The real question is whether we've reached a point where "trusting the repo" is a dead strategy. If the delivery mechanism for open source is this easily weaponized, we might have to move toward a model where we only trust cryptographically signed commits from verified maintainers.

The systemic risk of trusted ecosystems

The problem here isn't just a few bad actors; it's the inherent trust we place in the "fork" button. We've spent years treating GitHub as a reliable map of what's safe, but these campaigns exploit the gap between a repository's perceived utility and its actual contents. When a useful tool is cloned and subtly poisoned, the social signals—stars, forks, a familiar name—become a mask for the malware. I think we've underestimated how much this relies on developer laziness. We see a repo that looks right and we run the install script without auditing the diff.

The community reaction to this has been a mix of panic and frustration with GitHub's takedown speed, and I get why. But cleaning up the "infected" forks is a game of whack-a-mole that the platform is losing. If the malware is embedded in a way that doesn't trigger basic signature detection, GitHub's manual review process can't possibly scale to the volume of forks being generated.

This matters for anyone running CI/CD pipelines that pull dependencies from third-party forks, but it probably doesn't change much for teams using locked versions and private registries. Still, it leaves a lingering question: how do we verify the integrity of a codebase when the very platform we use to distribute it is being used to hide the payload?

Conclusion

The reality is that we've spent years trusting the "official" ecosystem, and this is the cost of that blind faith. We keep treating supply chain attacks like freak accidents, but they're becoming a standard operating procedure. When the delivery mechanism is a trusted repository, your security posture is basically a suggestion.

I'm still not convinced that auditing every single dependency is a scalable solution for most teams. So the real question is: at what point do we stop trusting the ecosystem entirely and start assuming every third-party package is a potential Trojan?