Governance through Access Controlled Lists (ACL)

You can enable the Access Controlled Lists (ACL) feature that applies the access control policy of the data source when a user queries the ingested document. It enables secure and controlled access to sensitive data. For example, Bob is allowed to read and query (through a prompt lab) the documents that are ingested in watsonx.data as only if Bob is a user in S3 (identified by the email ID) and is also a user in the watsonx platform.

The Access Control List (ACL) serves as metadata for the ingested data. It includes mapping information of the user IDs to document IDs, defining the access privileges for each user. When you run a Presto or Milvus query, the access privilege for the requested data is determined by verifying the ACL tables. The filtered query results have information only from the documents that you have permission to view.

The performance capability of the Access Control List (ACL) component supports greater scalability. It can efficiently handle up to 20 million user-document ID pairs.

Pre-requisite

  • At least one Milvus instance in running state
  • At least one Presto engine in running state
  • A default customer storage (associated with the Presto engine intended for gen AI use)
  • An ACL storage and catalog (also associated with the Presto engine that is intended for gen AI use)
  • Ensure that you have an ACL storage and catalog that is provisioned in watsonx.data. Access watsonx.data console, go to Infrastructure manager page. From Add Components, select any type from AWS S3, IBM Cloud Object Storage, MinIO, or IBM Ceph. Designate this bucket as the ACL store. You must select the Designate this bucket as the ACL store check box to designate the storage and catalog as the ACL store.

For more information, see Adding storage.
* After adding the ACL storage, you must enable the ACL storage from the Infrastructure manager page.

  • Ensure that you have associated all the Presto engines with the ACL catalog. For more details, see Associating a catalog with an engine.

  • SAML Integration with AWS IAM Identity Center
    To uniquely identify users and groups for Amazon S3 Access Control, it is essential to integrate AWS with a SAML-based Identity Provider (IdP). This integration facilitates the synchronization of user and group information from external identity providers, such as Microsoft Entra ID. For step-by-step instructions on configuring SAML with Microsoft Entra ID, see the official documentation.

  • Permission sets
    Assign permission sets to all users created. Users with read access to S3 can then be appropriately included in the S3 bucket policy. See Manage AWS accounts with permission sets.

ACL storage : The ACLs imported from the data source are stored in Iceberg tables in a storage in the customer’s IBM Cloud Object Storage instance. The storage is automatically created within the customer’s Cloud Object Storage instance when they provision Gen-AI lakehouse.

ACL update and sync : You can configure the unstructured ingestion pipeline to run at an interval. Each time the pipeline runs, modifications to the access control lists are captured and updated in watsonx.data. The latest ACLs are available right after the ingestion pipeline runs, but the ACLs may be out of date until then. Select the frequency at which the ingestion pipeline runs.

SAML integration with AWS IAM identity center

To uniquely identify users and groups for Amazon S3 Access Control, it is essential to integrate AWS with a SAML-based Identity Provider (IdP). This integration facilitates the synchronization of user and group information from external identity providers, such as Microsoft Entra ID. For step-by-step instructions on configuring SAML with Microsoft Entra ID, see the Documentation.

Implementing S3 storage policy

To define the user principal in the S3 bucket policy, include the SAML unique ID in the principal as follows:

arn:aws:sts::<AccountID>:assumed-role/<RoleName>/<SAML-unique-id>

  • <AccountID>: The AWS account ID.

  • <RoleName>: The name of the IAM role that is assigned to the user. For roles configured by uaing SAML, the name typically begins with AWSReservedSSO (for example, AWSReservedSSO_ReadOnlyAccess).

  • <SAML-unique-id>: The unique identifier associated with the user, often including an email or other user attribute from the Identity Provider (IdP).

Example: arn:aws:sts::239710307211:assumed-role/AWSReservedSSO_ReadOnlyAccess_ec1f0eaaac4c5586/CaliWu@alekh102gmail.onmicrosoft.com

Sample S3 storage policy with SAML unique ID

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Principal": {
                "AWS": "arn:aws:sts::239710307211:assumed-role/AWSReservedSSO_ReadOnlyAccess_ec1f0eaaac4c5586/CaliWu@alekh102gmail.onmicrosoft.com"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::acltest234/tzcol.csv"
        }
    ]
}

For more guidance on creating and managing bucket policies, you can use the AWS Policy Generator.

Sample ACL Output for an S3 Object

{
  "path": "/sample bucket/acltest.csv",
  "allow": {
    "users": ["Elliot@abc.onmicrosoft.com"],
    "groups": []
  },
  "deny": {
    "users": ["CaliWu@abc.onmicrosoft.com"],
    "groups": []
  },
  "inheritance": {
    "enabled": false,
    "parent_precedence": "parent"
  },
  "precedence": "deny"
}

Supported features in S3 support policy

As of now, the following features are supported in the S3 bucket policy:

  • Principal: Only AWS STS assumed roles are supported, and they must be formatted as follows: arn:aws:sts:: :assumed-role/ /

  • Actions: The following actions are supported for reading files:

    • s3:GetObject
    • s3:Get*
    • s3:*
  • Resource: The resource can be specified as:

    • A specific bucket
    • A specific file
    • A wildcard (*)
  • Effect: Both Allow and Deny effects are supported.

  • User Scope: Only IAM Identity Center users are supported at this time. Groups are not yet supported.

  • Conditions:

    • Limited condition support is available.
    • Conditions under Allow are always evaluated as false.
    • Conditions under Deny are always evaluated as true.

The ACL output includes the user principal, which is the SAML unique user identifier.

AWS Permissions for Retrieving S3 ACLs: The AWS credentials (Access Key and Secret Key in S3 connection) of the user performing the get_acl operation must have the following necessary AWS permission policies to retrieve ACLs:

  • identitystore:ListUsers
  • identitystore:DescribeUser
  • identitystore:ListGroups
  • identitystore:DescribeGroup
  • identitystore:ListGroupMemberships
  • sso:ListInstances
  • sso:ListPermissionSets
  • sso:DescribePermissionSet
  • sso:ListAccountAssignments
  • s3:*
  • s3-object-lambda:*
  • iam:List*
  • iam:Get*

Watch the quick video for a visual walkthrough: ACL setup in Admin console.

Features and limitations of ACL

  1. The feature only works for documents from Amazon S3, SharePoint, FileNet, and BOX data source.
  2. The feature is disabled by default. To enable select Enable access control list for retrieval when preparing the document library.
  3. Currently, only the email ID of the user is supported as the identifier for a user. This is both in the storage (S3) and watsonx platform in the S3 Bucket policies.
  4. The S3 connection user and watsonx.data Presto connection user who is used for creating the document library should match for the querying experience, which is, Chat in prompt Lab.
  5. Only the user ID used in the S3 connection has access to the document and will be able to query the ingested data asset. The user who ingests the data from S3 into the document library may not have access to the document.