GitGuardian raises $12 million to find sensitive data hidden in online code

GitGuardian, a cybersecurity platform that helps companies detect sensitive data hidden in public and private code repositories, has raised $12 million in a series A round of funding led by London-based Balderton Capital, with participation from GitHub cofounder Scott Chacon and Docker co-creator Solomon Hykes.

Founded out of Paris in 2017, GitGuardian scans all GitHub public activity in real-time to identify private data such as database login credentials, API keys, cryptographic keys, and more. The company works with more than 200 API providers, spanning payment systems, cloud services, messaging apps, crypto wallets, and more to ensure private information that does leak out into the public domain is identified swiftly and the company is notified. The French startup said that it has sent out more than 400,000 alerts since its inception.

Secret sauce

The type of private data that GitGuardian is looking to protect is what is known in the industry as “secrets,” and includes anything that can be used by unauthorized third-parties to access a system (e.g. a cloud or database) — such as passwords and API tokens.

Behind the scenes, GitGuardian links GitHub-registered developers with their companies, and scans content covering 2.5 million code commits each day in an effort to find usernames and passwords, database connection string keys, SSL certificates, and more. Powering this, the company said that it uses “sophisticated pattern matching” and machine learning techniques, with its algorithm constantly learning through a “feedback loop” which involves taking on feedback from developers in terms of how accurate each alert is. So in effect, GitGuardian’s clients help improve the technology by telling it whether an alert was a valid or not.

Although monitoring public GitHub repositories is a major part of GitGuardian’s offering, it also works to identify sensitive information that is inadvertently disseminated through internal systems including private code repositories and message apps. Indeed, even companies that are careful to keep their code under lock and key can come unstuck if too many people inside in an organization have access to it — the more people that have access to “secrets,” the more avenues there are for this data to become compromised. This is what is commonly referred to as “secret sprawl.”

“Secrets that are made too widely accessible in an organization is a huge issue for security professionals,” GitGuardian cofounder and CEO Jérémy Thomas told VentureBeat. “In the case of source code, if there are secrets in it, it takes only one developer account to be compromised, for all the secrets they had access to to be compromised as well.”

Above: GitGuardian dashboard

Breaches

Back in 2017 Uber announced a major data breach that exposed the personal data of millions of riders and drivers. Uber later confessed that it wasn’t using multifactor authentication on its GitHub account, meaning that anyone who encountered the login credentials could access its private repositories unhindered — and it was through the GitHub repository that the intruders managed to find the access keys for Uber’s AWS datastore where its user data was kept.

In a Federal Trade Commission (FTC) filing from 2018, Uber revealed how the intruders managed to gain access to the private GitHub repository in the first place — it was down to Uber granting its engineers access to the private repositories via their own personal GitHub accounts, which had weak security. The filing noted:

Uber granted its engineers access to Uber’s GitHub repositories through engineers’ individual GitHub accounts, which engineers generally accessed through personal email addresses. Uber did not have a policy prohibiting engineers from reusing credentials, and did not require engineers to enable multi-factor authentication when accessing Uber’s GitHub repositories. The intruders who committed the 2016 breach said that they accessed Uber’s GitHub page using passwords that were previously exposed in other large data breaches, whereupon they discovered the AWS access key they used to access and download files from Uber’s Amazon S3 Datastore.

As a result, the intruders accessed sixteen files that contained unencrypted personal data including nearly 26 million names and email addresses, 22 million names and mobile phone numbers, and 607,000 names and driver’s license numbers.

Poor password hygiene aside, Uber’s AWS access key should probably not have been anywhere near a GitHub repository — private or otherwise — in the first place. And this helps to highlight what is at stake for companies. Compromising customer data and losing their trust is a major issue for sure, but poor security can also lead to regulatory and legal tussles.

“Hardcoding secrets in source code or other private site that are not specifically meant for secret storage breaks various compliance rules and industry standards and best practices,” Thomas noted.

In terms of Uber, which initially covered up its gargantuan leak, it was widely viewed to have violated numerous data security and data breach reporting laws, and it eventually settled the case by paying a $148 million fine. And that is the type of scenario that GitGuardian said it can help avert, as it claims it can detect and send an alert to the developer and security team within four seconds of a secret leaking into code repositories.

“Currently, every company with software development activities is concerned about secrets spreading within the organisation, and in the worst case, to the public space,” Thomas said. “As a company with so much sensitive information at hand, we have built a culture of unconditional secrecy at our core.”

GitGuardian said that it has already helped more than 100 of the Fortune 500 companies, government organizations, and thousands of individual developers. And with another $12 million in the bank, it said that it plans to expand its customer base in the U.S., where 75% of its current clients are based.

Some 40 million developers use GitHub, and with more than 100 million repositories, the Microsoft-owned code collaboration platform is fertile ground for any company looking to train algorithms due to the vast amount of data available. A few months back, Swiss startup DeepCode raised $4 million for a system that learns from GitHub project data to give developers automated code reviews. GitGuardian is adopting a similar philosophy in terms of how it’s using GitHub to train algorithms at scale so companies can further automate their cybersecurity setup.

“Rather than encumber technology organisations with limiting compliance procedures, GitGuardian allows the modern enterprise to develop code quickly and how it wants to, but with automated visibility and protection over how data, credentials and other sensitive information is used, moved and shared,” noted Balderton Capital partner Suranga Chandratillake.

Sign up for Funding Daily: Get the latest news in your inbox every weekday.

Leave a Reply