Every day, millions of office workers click the "Upload your file or image" button on ChatGPT , Claude , or Gemini , believing they're simply saving time. But behind that user-friendly chat interface lies a massive data collection machine, where every PDF file, every piece of code, every meeting recording can become a permanent part of a global AI infrastructure – beyond the control of the company that created it.
The "Shadow AI" problem
In cybersecurity circles, this is known as "Shadow AI." This concept accurately describes a growing phenomenon: employees are using free, personal AI tools to handle company tasks, completely outside the oversight and approval of IT and security departments. No permission is required, no declaration is needed, just a Gmail account and a few seconds of registration.
The issue isn't about employees being malicious; on the contrary, in most cases, it stems from excessive dedication to their work. A recent report by Cyberhaven, a data security company that analyzed the AI usage behavior of millions of knowledge workers, revealed the astonishing prevalence of this phenomenon. Specifically, the frequency of AI use in the workplace has increased more than 60-fold in two years, spreading fastest in manufacturing and retail – sectors that traditionally have less awareness of AI data security.
The act of "donating" data takes many different forms depending on the industry. Finance professionals quietly paste revenue figures, cash flow statements, and business plans into chat boxes so that AI can help them write reports faster.
Programmers copy and paste entire blocks of source code containing API keys or core algorithms, simply to have AI find bugs or optimize performance. HR and operations departments upload audio and video recordings of internal meetings, even payrolls, for seemingly harmless purposes: summarizing content and analyzing employee performance.
According to Cyberhaven's analysis based on real-world data from millions of interactions, the most frequently fed sensitive data into AI tools is source code, followed by research and development (R&D) documents, and then business and marketing data. Significantly, this isn't the fault of a few individual carelessness – research shows that employees feed sensitive data into AI tools on average every few days across the entire enterprise.
No case study illustrates this problem better than the incident at Samsung in April 2023. In less than 20 days after the semiconductor giant allowed employees to use ChatGPT, three serious data leaks occurred in quick succession. One engineer pasted entire source code snippets from an internal equipment measurement system into ChatGPT to try and fix a bug. Another engineer inserted code used to identify faulty components, relying on AI optimization. The third case was even more alarming: an employee recorded an entire internal meeting, transcribed it, and then used ChatGPT to summarize it into meeting minutes.
The immediate consequence was that Samsung had to implement emergency measures, including limiting each input to ChatGPT to just 1024 bytes – a stopgap measure rather than a permanent solution.
Then, a few weeks later, the corporation issued a comprehensive ban on the internal use of AI-generated tools, warning employees that violations could lead to disciplinary action, including dismissal. This is living proof that the line between "personal benefit" and "collective risk" in the age of AI-generated tools can be just a press of an Enter key.
What's actually happening behind the "Upload file" button?
To understand why a seemingly simple action like uploading files to AI is so dangerous, we need to dissect the technical mechanisms hidden behind that user-friendly chat interface.
With most commercial AI models in their free versions, user input data isn't simply processed and then forgotten. It can enter what's called the "retraining loop"—a process where AI development companies reuse conversations and uploaded files to label and incorporate them into training datasets for subsequent model versions. In other words, the code you paste in today could, quite literally, become part of the AI model's "memory" in months.
This mechanism was publicly mentioned when the media reported on the Samsung case, because ChatGPT is a machine learning platform, and all input data is used to train its algorithm, meaning that Samsung's proprietary information became available to other users on the same platform.
This gives rise to an even more subtle and frightening risk – the "reverse AI data poisoning" scenario. Imagine your competitor, weeks after you inadvertently uploaded a product launch plan, asks a seemingly random question to the same AI tool – and receives suggestions containing pieces of strategic information that they are unaware you have "donated" for free. The AI model doesn't "leak" data in the sense of hackers stealing it; it simply synthesizes what it has learned – and you are one of its teachers.
This is why numerous large financial institutions, such as JPMorgan, have admitted they couldn't even determine how many employees were using ChatGPT or what they were using it for, because traditional data loss prevention (DLP) tools, designed to monitor email attachments or shared drives, are completely "blind" to the act of directly copying and pasting into a web browser.
The problem becomes even clearer when looking at the "terms of service trap"—the crucial difference between the free and enterprise versions of AI. In the free packages for individuals, the terms of use often allow the provider to use conversational content to improve the model, unless the user manually disables this option in the settings—a step that most employees never consider.
Conversely, Enterprise packages or enterprise APIs typically come with clear contractual commitments: customer data cannot be used to train models, there are limited storage periods, and legally binding data processing agreements are included. The gap between these two levels is precisely where most enterprise data breaches occur – not because the technology is insecure, but because users inadvertently choose the wrong "door."
The reality is that the line between personal and business use is much thinner than many people think: analysis from Cyberhaven shows that the majority of ChatGPT access in the workplace still comes from personal accounts not under corporate control, and this percentage is significantly higher with other AI platforms. In other words, even if your company has signed an Enterprise contract with an AI provider, that doesn't mean all employees are using the right "safe door."
The legal matrix and the invisible "sentence"
While technical risks are intangible, legal risks are becoming increasingly tangible, reflected in very specific figures.
In Europe, the General Data Protection Regulation (GDPR) continues to be the sharpest sword against businesses that mishandle personal data – even when the error comes from a third-party AI tool. Cumulative GDPR fines since 2018 have exceeded €7 billion, with European regulators aiming to impose fines of up to €1.2 billion by 2025 alone.
Notably, regulators aren't just targeting Big Tech: the Italian data protection authority once fined an AI company developing chatbots €5 million for collecting personal data and user behavior without valid consent and for lacking a mechanism to verify users' age. With the EU AI Act officially tightening enforcement from August 2026, the maximum fine for serious violations could reach €35 million or 7% of global revenue – higher than the traditional GDPR penalty ceiling.
In Vietnam, the corresponding legal framework is Decree 13/2023/ND-CP on the Protection of Personal Data, issued by the Government on April 17, 2023, and effective from July 1, 2023. It comprises 44 articles detailing the collection, storage, processing, and transfer of personal data. This decree categorizes businesses into specific legal roles – Data Controller or Data Processor – and each role comes with its own legal responsibilities in the event of an incident. It is noteworthy that the scope of application of Decree 13 is not limited to Vietnam – it applies to personal data of Vietnamese citizens processed abroad. This means that a Vietnamese employee uploading a file containing customer information to an AI server located in the United States could potentially fall under the jurisdiction of domestic law.
But there's another layer of risk even more dangerous than administrative violations: the loss of intellectual property (IP). When a proprietary algorithm, a pricing formula, or a piece of core code is introduced into a public AI model, the legal boundaries of "who owns what" become incredibly blurred.
If a competitor later launches a product with similar logic, the original company will have almost no grounds to sue – because they voluntarily "disclosed" their trade secrets through terms of service that few people carefully read before clicking "I agree". This is the most bitter paradox of the issue: the law protects trade secrets, but cannot protect a secret that its owner has voluntarily given away.
Self-Hosted and Autonomous AI Trends
Amidst mounting risks, a wave of technology is emerging as a true solution: open-source AI (LLM). Models like Meta's Llama, Mistral from France, and Alibaba's Qwen are demonstrating that the reasoning power of AI is no longer the exclusive privilege of large, closed-cloud cloud operators. The core difference lies in the fact that, with an open-source model, businesses can download the entire AI "brain" and operate it directly on their own infrastructure – meaning data never leaves the company's walls.
The technology that realizes this solution is containerization, with Docker being the most popular name. The operating principle is very simple but absolutely effective in terms of security: data travels from the employee's machine, through an internal server running a Docker container containing the AI model, where the AI processes and analyzes it – the entire process ensures that not a single byte of data escapes the internal network to reach the public internet. This is a complete reversal of the traditional SaaS model, where data always has to "travel" to a third-party server before returning the results.
Another technological piece being implemented by many businesses is the internal Retrieval-Augmented Generation (RAG) architecture. Instead of "cramming" all company documents into the AI training process—a costly and risky undertaking—RAG allows documents to be "shredded" into data segments, stored in a vector database located directly on the company's own infrastructure.
When employees ask questions, the AI will "look up" information on-site within the database to find the most relevant details, then synthesize the answer – the entire original document is not included in the model training process and does not leave the internal system. This is considered the optimal compromise solution: businesses still get a smart, personalized AI experience tailored to their data, without having to "donate" their data to third parties.
Of course, the self-hosted path isn't free in terms of effort. Businesses need to invest in server infrastructure and technical personnel to maintain the system, and open-source models often require fine-tuning to achieve accuracy comparable to leading commercial models. But compared to the cost of a data leak – from reputational damage and loss of competitive advantage to potentially millions of dollars in legal fines – the initial investment cost for self-hosted AI infrastructure is becoming a more compelling economic proposition than ever before.
Conclusion & Future Forecast
The core message from this whole story isn't about "banning" employees from using AI – a strategy that will certainly leave businesses behind in the productivity race and push employees back to using personal tools clandestinely, making it even harder to control. The real lesson is to shift from a "prevention" mindset to a "control and create a safe environment" mindset.
For business managers, concrete action should begin with establishing a clear AI Governance policy – the document doesn't need to be lengthy, but it must answer the questions: what types of data are allowed into AI, what are absolutely forbidden, and which AI tools have been officially approved for use. Simultaneously, classifying input data by sensitivity level is crucial – a seemingly simple step, but one that forms the foundation of any sustainable AI security strategy.
For AI developers, the pressure for transparency is growing. Optimizing "Opt-out" options so they are easily accessible and activated—rather than being hidden deep within settings menus—is not just an ethical issue but is becoming a real competitive advantage, as more and more businesses prioritize data security when choosing AI vendors.
The "Upload" button won't disappear. But how we understand it—as a two-way door that opens both to productivity and potentially unlocks business secrets—is the fine line that will determine which businesses will survive safely in the AI age, and which will become the next cautionary case study.
Checklist for managers: 3 questions to ask before allowing employees to upload files to AI.
What would be the consequences if this data fell into the hands of a competitor?
- If the answer is "it could affect our competitive advantage," that's the first sign that we need to stop.
Is the AI tool being used a free personal version or an Enterprise version with a privacy contract?
- These two levels differ completely in terms of data access rights.
If this data accidentally appears in an AI response to another user, will the company be held legally liable?
- This question forces managers to consider risk not only from a technical perspective but also from a legal one.
Publicly available technical terms
Context Window: Simply put, this is the AI's "temporary memory" in a conversation – all the content you've entered, including uploaded files, is stored in this "memory area" for the AI to refer to when responding.
Data Leakage: The situation where sensitive internal data "escapes" from an organization's control – not necessarily due to a hacker attack, but often stemming from employees' everyday use of tools.
On-premise AI: An AI model where the AI is installed and operated directly within the company's physical infrastructure (private servers, internal data centers), as opposed to sending data to an external AI company's cloud.
Docker Container: A lightweight software "packaging box" that helps package an entire AI application along with everything it needs to run, making it easy to deploy on an internal server without complex configuration or reliance on external cloud infrastructure.