What is IBM Analytics Engine?
IBM Analytics Engine provides a flexible framework to develop and deploy analytics applications on Hadoop and Spark. It allows you to spin up Hadoop and Spark clusters and manage them through its lifecycle.
How is it different from a regular Hadoop cluster?
IBM Analytics Engine is based on an architecture which separates compute and storage. In the traditional Hadoop architecture, a cluster was used to store data as well as perform application execution. In IAE, we have split the two - the clusters are used for purely running the applications while using IBM Cloud Object Storage for persisting data. The benefits of such an architecture include flexibility, simplified operations, better reliability and cost effective. Read this whitepaper (PDF, 287 KB) to learn more.
How do I get started with IBM Analytics Engine?
IAE is available on the IBM Cloud. Follow this link to learn more about the service and to start using it. We also have tutorials and code samples to get you off to a fast start.
Which distribution is used in IBM Analytics Engine (IAE)?
IBM Analytics Engine is based on open source, Hortonworks Data Platform (HDP). To find the currently supported version please see this page.
Which HDP components are supported in IAE?
To see the full list of supported components and versions please see this page.
What are the sizes of nodes available in IBM Analytics Engine?
To see the currently supported node sizes, please see this page.
Why is there so little HDFS space in the clusters? What if we want to run a cluster that has lot of data to be processed at one time?
The clusters in IAE are intended to be used as a compute cluster instead of persistent storage for data. Data should be persisted in the IBM Cloud Object Storage. This provides a more flexible, reliable and cost-effective way to build analytics applications. Please review this whitepaper (PDF, 287 KB) to learn more about this topic. HDFS should be used for intermediate storage during processing. Any final data (or even intermediate data) should be written out to Object Storage before deleting the cluster. If your intermediate storage requirements exceed HDFS available within a node, you can add more nodes to the cluster.
How many IAE clusters can I spin up?
There is no limit to the number of clusters you can spin up.
Is there a free usage tier to try IBM Analytics Engine?
Yes, we provide the Lite plan which can be used free of charge. Apart from this, as a new IBM Cloud user, you are also entitled to USD 200 in credit that can be used against IAE or any service on IBM Cloud.
How does the Lite plan work?
The Lite plan provides 50 node-hours of free IAE usage. One cluster can be provisioned every 30 days. Once the 50 node-hours are exhausted, you may upgrade to a paid plan within 24 hours to continue using the same cluster. If you do not upgrade within 24 hours, the cluster will be deleted and you may provision a new one after the 30 day limit has passed. Depending on the size of your cluster actual hours of use may vary. For instance, a 1 master and 3 data nodes i.e. 4 total nodes will run for 12.5 hours on the clock (50 hours/4 nodes). However, a 1 master and 1 data node i.e. 2 total nodes will run for 25 hours on the clock (50 hours/2 nodes). Within an instance, the node-hours cannot be paused e.g. you cannot use for 10 node-hours, pause and come back and use for remaining 40 node-hours.
How does Object Storage work in the IAE Hadoop environment? Is it exactly equivalent to HDFS but we just use a different URL?
IBM Cloud Object Storage implements most of Hadoop FileSystem interface. For simple read and write operations, applications that use the Hadoop FileSystem API will continue to work when HDFS is substituted by Cloud Object Storage. Both are high performance storage options that are fully supported by Hadoop.
What other components like Object Storage should we consider while designing a solution using IBM Analytics Engine?
In addition to Object Storage, please consider using Compose MySQL, available on IBM Cloud, for persisting Hive metadata. When you delete a cluster, all data and metadata is lost. Persisting Hive metadata in an external relational store like Compose will allow you to reuse it even after your cluster has been deleted or access from multiple clusters. IAE provides support for passing the location of metadata through the customization scripts while starting a cluster. Hence, you can have the cluster pointing to the right metadata location as soon as it is spun up.
How should I size my cluster?
Sizing a cluster is highly dependent on workloads. Here are some general guidelines: For Spark workloads reading data from the object store, the RAM in the cluster should be at least 50% of the size of the data to be analyzed in any given job. For best results, recommended sizing for Spark workloads reading data from the object store is to have RAM2x the size of data to be analyzed in any given job. If you expect to have a lot of intermediate data, you should size the number of nodes to provide the right amount of HDFS space in the cluster.
We are sizing for 4 environment: Production (with HA), DR, Staging(with HA), Dev. How do we design this in IAE?
Each of this will be a separate cluster. If you have multiple developers on your team, consider a separate cluster for each of them, if they cannot share the same cluster credentials. For dev, generally, a 1 master + 2 compute node cluster should suffice. On Staging, to test functionality we recommend 3 compute nodes. This gives some additional resources test at a slightly bigger scale before deploying to production. DR – DR clusters are generally implemented in an active-active or active-standby model. In IAE, you do not need to have a cluster running all the time. If the production cluster goes down, then a new cluster can be spun upusing DevOps tool chain andcanbe designated as production cluster. You should use the customization scripts to configure it exactly like the previous cluster.
How is user management done on IAE? How do I add more users to my cluster?
All clusters in IAE are single user i.e. every cluster has only one Hadoop user id with which all jobs will be executed. Authentication of users and access control happens through Identity and Access Management (IAM) service of IBM Cloud. Once a user has logged into IBM Cloud, they will be allowed or blocked access to IAE based on the IAM permissions set by the admin. A user can share their cluster’s user id and password if they want other users to access it; please note that in this case the other user will have full access to the cluster.
Sharing a cluster via a Watson Studio Project is a recommended approach. In this scenario, an admin sets up the cluster through the IBM Cloud portal and ‘associates’ it with a project in Watson Studio. Once this is done, any users who have been granted access to that Project can submit jobs through notebooks or other tools that requires a Spark or Hadoop run time. An advantage of this approach is that access control to the IAE cluster or to any data to be analyzed can be controlled within Watson Studio or Watson Knowledge Catalog, in addition.
How is data access control enforced in IAE?
Data access control can be managed via IBM Cloud Object Storage ACLs (access control lists). ACLs in IBM Cloud Object Storage are tied to the Identity and Access Management service of IBM Cloud. An admin can set permissions on an object storage bucket or files. Once these permissions are set, when accessing data through IAE, object storage credentials will be used to determine whether the use has access to a particular data object or not.
In addition, all data in object storage can be cataloged using Watson Knowledge Catalog. Governance policies can be defined and enforced using Watson Knowledge Catalog once data is in the data catalog. Watson Studio projects can be used for better management of access control.
Can I run a long-running cluster or job?
Yes, you can run a cluster as long as required. In this scenario you should ensure that data is periodically written back to IBM Cloud Object Storage and should not use HDFS as persistent store. This will protect from data loss in case of accidental cluster failures.
How much time does it take for the cluster to get started?
When using the Spark software pack, a cluster takes about 7-9 min to be started and ready to execute applications. When using the Hadoop and Spark software pack, a cluster takes about 15-20 min to be started and ready to execute applications.
How can I access or interact with my cluster?
There are several interfaces to access the cluster:
- Ambari Console
- REST APIs
- Cloud Foundry CLI
How do I get data into the cluster?
The recommended way to read data into a cluster for processing is from the IBM Cloud Object Storage. Upload your data to the IBM COS and use the COS, Hadoop or Spark APIs to read data off it. If your use case requires data to be processed directly in the cluster, you can use one of the following ways to ingest data – SFTP, WebHDFS, Spark, Spark-streaming, and Sqoop. Please see documentation on this topic for more information.
How do I configure my cluster?
A cluster can be configured by using Customization scripts or by directly modifying configuration parameters in Ambari console. Customization scripts are a convenient way to define different sets of configurations, through a script, to spin up different types of clusters or use the same configuration repeatedly for repetitive jobs. You can find more information on customization here.
Is root access allowed in IAE?
No, the user does not have sudo or root to install privileges since this is a defined, PaaS environment.
What if I want to install my own Hadoop stack components?
Since IAE is a defined PaaS service, we do not allow adding components which we do not support. Users do not have ability to install a new Ambari Hadoop stack component through Ambari or otherwise. Non-server Hadoop ecosystem components may be installed i.e. anything that can be installed and run in the user space is allowed.
What types of third party packages are allowed?
Packages which are available in the CentOS repo can be installed using the packageadmin tool that is available in IAE. Libraries or packages (eg; for Python or R) that can be installed and run within user space are allowed. The user does not have sudo or root privileges to install or run any packages from non-CentOS repos or rpms. It is strongly recommended all customization be performed using customization scripts at cluster startup to ensure repeatability and consistency for future cluster creations.
How can the cluster be monitored? How can we configure alerts?
Ambari components can be monitored using the built in Ambari Metrics alerts (in the ‘Hadoop and Spark’ pack). Metrics out of the box in Ambari can be configured to receive alerts.
How do I scale my cluster?
Cluster can be scaled by adding nodes to it. Nodes can be added through the IBM Cloud UI or through the CLI tool.
Can I scale my cluster while there are jobs running on it?
Yes, clusters can be scaled by adding nodes to it when jobs are running. Once the new nodes are ready, they will be used to execute further steps of the job.
What does IBM Cloud operations monitor and manage on my cluster?
IBM Cloud operations team ensures that the service stays up for users to spin up clusters, submit jobs and manage lifecycle of clusters through the interfaces provided. Users can monitor and manage their clusters using the tools available in Ambari or additional services from IBM Cloud.
What type of encryption is supported?
Hadoop transparent data encryption is automatically enabled for the cluster. The cluster comes with a predefined HDFS encryption zone, which is identified by the HDFS path /securedir. Files that are placed in the encryption zone are automatically encrypted. The files are automatically decrypted when they are accessed through various Hadoop client applications, such as HDFS shell commands, WebHDFS APIs, and the Ambari file browser. More information is available in the documentation. All data on Cloud Object Storage is encrypted at-rest. The data transfer between Cloud Object Storage and IAE clusters can be done over a private, encrypted end point available from Cloud Object Storage. Any data flowing over the public facing ports (8443, 22 and 9443) are encrypted.
Which ports are open on the public interface on the cluster?
The ports open on the public interface on the cluster are: 8443 – Knox; 22 – SSH and 9443 – Ambari.
Which other IBM Cloud services can I use with IBM Analytics Engine?
As a part of IBM Cloud, IBM Analytics Engine integrates with important offerings (such as IBM Watson Studio) to push jobs to IBM Analytics Engine; Data can be written to Cloudant or Db2 Warehouse on Cloud after being processed using Spark.
How will IAE integrate with Watson Studio? Would they both operate on the underlying Object Store or would Watson Studio execute in the Analytics Engine?
IBM Analytics Engine is a first class citizen in Watson Studio. Projects (or individual notebooks) in Watson Studio can be associated with IBM Analytics Engine via a simple UI. Once you have an IAE cluster running in IBM Cloud, login to Watson Studio using the same IBM Cloud ID, go to the Project Settings page, and ‘associate’ that IAE instance to a project or notebook in IAE. More details and a tutorial on this are available here.
Once associated, the Watson Studio project or notebook would execute any workload on this particular IAE instance. There is no tight coupling to any object store instance. Whatever object store instance is being referred to from within a notebook or application will be read while executing applications on IAE. One easy to use method of using a particular object store instance is the “insert to code” feature in Watson Studio notebooks.
Customer needs to use Kafka for ingestion. Please advise how we can handle this?
MessageHub, an IBM Cloud service is based on Apache Kafka. It can be used to land data in object store, which can be analyzed with the Analytics Engine clusters. MessageHub can also integrate with Spark in the IAE cluster to bring data directly to the cluster.
Can we set ACID properties for Hive in IAE?
Hive is not configured to support concurrency. Users do have the authority to change configuration in IBM AE clusters. However, it is the users responsibility for the correct functioning of the cluster after making any such changes.