Data integration with Salesforce CRM using IBM InfoSphere Information Server

This article introduces the IBM® Information Server® Pack for Salesforce.com. It describes the architectural design, demonstrates the pack’s usage, and highlights its benefits in contrast with other products. It also illustrates sample jobs to guide you though a complete deployment.

Jon Deng (jdeng@us.ibm.com), Staff Software Engineer, IBM

Author Photo: Jon DengJon Deng is a staff software engineer in Software Group in Boca Raton, Fla. He joined IBM in 2005 as part of the Ascential acquisition. He has been working in the area of enterprise applications and telecommunication systems for eight years. Currently, he is involved with developing a next-generation salesforce adapter for the IBM InfoSphere Information Server. He is also a certified SAP BW professional.



Jeff J. Li (liji@us.ibm.com), Senior Software Engineer, IBM

Author Photo: Jeff LiJeff Li is a senior software engineer in Software Group in Boca Raton, Fla. He joined IBM in 2005 as part of the Ascential acquisition. He has been working in the area of enterprise applications and telecommunication systems for 15 years. Currently, he is developing the complex data integration solutions for IBM InfoSphere Information Server. He is also a certified project management professional.



24 July 2008

Introduction

Salesforce is a leading on-demand customer relationship management (CRM) system. It offers a complete hosted solution for companies to manage their customer contacts, track sales orders, and streamline their sales processes. It also allows customers to tailor their software as a service, therefore gaining a competitive advantage.

IBM Information Server is a unified and comprehensive information integration platform. It profiles, cleanses, and transforms data from heterogeneous data sources to deliver consistent and accurate business data. IBM Information Server is an ideal solution to integrate and synchronize Salesforce CRM data with the business data stored within back end enterprise resource planning (ERP) systems and other data analytical applications.

This article illustrates the main features of the new certified IBM Salesforce solution for Information Server. The article also contains examples to demonstrate the daily business integration scenarios. These examples provide step-by-step guides that consultants can use to solve their Salesforce data integration cases.


Product prerequisite and installation

IBM Information Server includes many software products for data integration and analysis tasks. Those products include Information Server DataStage®, Information Server QualityStage™, Information Server Information Analyzer, Information Server Federation Server, and other companion products. Depending on the specific project requirements, users can choose to install a subset of the products in the IBM Information Server.

Figure 1. Software products needed for designing ETL jobs for Salesforce.com
Software products needed for designing ETL jobs for Salesforce.com

Figure 1 shows the minimum set of IBM Information Server products needed to design ETL jobs for the Salesforce.com CRM.

  • IBM Information Server DataStage, which includes:
    • DataStage Client
    • DataStage Server
    • DataStage Metadata Repository
    • DataStage Domain Server
    Information Server DataStage products can be installed separately on different hosts or installed on the same host.
  • Information Server Pack for Salesforce.com (IS Salesforce Pack)
    The IS Salesforce Pack is a companion product of the IBM Information Server. The pack has been developed to connect to Salesforce.com through Web services. The GUIs of the IS Salesforce Pack are installed on the DataStage Client. The runtime part of the pack is installed on the DataStage Server.
  • Apache Axis
    Apache Axis is an external component that is used to manage the transport layer of the IS Salesforce Pack. It contains libraries for the HTTP/HTTPS protocols, as shown in Figure 1a.
    Figure 1a. Apache Axis as external component
    Apache Axis as external component

Architecture overview

Each software components in Figure 1 plays a different role in designing and executing the ETL jobs for Salesforce.com:

  • The Information Server DataStage Client and IS Salesforce Pack GUI components provide a smart UI to allow users to design ETL jobs, import and export metadata, and set up data operations to be performed on Salesforce.com
  • The Information Server DataStage Server and IS Salesforce Pack server components enable users to schedule and run the ETL jobs.
  • The Information Server Domain Server manages Information Server user accounts and authorization.
  • The Information Server Metadata Repository is a single repository database for storing and sharing metadata definitions.

The Information Server Salesforce Pack includes three major components:

  • Salesforce metadata browser for browsing and importing the definitions of the Salesforce objects from Salesforce.com to DataStage.
  • Salesforce Load Stage for loading data from non-Salesforce data sources to Salesforce.com. It is a Salesforce certified loading solution.
  • Salesforce Extract Stage for extracting data from Salesforce.com to non-Salesforce data sources. It is a Salesforce certified extraction solution. Both Salesforce load and extract stages depend on the Salesforce.com Web service APIs.

Salesforce metadata browser

The pack has a smart metadata browser. It supports browsing both Salesforce objects and custom objects created by Salesforce users. It is capable of importing the selected object definitions from the Salesforce back to IBM Information Server for future usages. The metadata browser displays only the objects and fields that are applicable to the user-selected data operations. Figure 2 shows the metadata browser for the query operation. The metadata browser shows only the Salesforce objects on which users can perform query operations.

Figure 2. Metadata browser and query table objects in end-user’s Salesforce account
Metadata browser and query table objects in end-user’s Salesforce account

In Figure 3, the metadata browser displays the Salesforce objects that allow users to update their data or insert new data.

Figure 3. Metadata browser and Upsertable objects in end-user’s Salesforce account
Metadata browser and Upsertable objects in end-user’s Salesforce account

As shown in Figure 2 and Figure 3, different sets of objects, including custom objects, are displayed based on user-selected operations. This feature eliminates the guessing game for the end user regarding what operation is associated with which object and the reverse.

The metadata browser shows not only the Salesforce objects but also the field attributes related to those objects. It further simplifies the end user’s task at the DataStage job design phase. In Figure 4, the Opportunity object is selected with UPSERT operation. Only the fields in the object that are qualified for the UPSERT operation are displayed.

Figure 4. Opportunity fields with Upsert operation
Opportunity fields with Upsert operation

In Figure 5, the same Opportunity object is selected with the Query operation. Compared to Figure 4, a different set of fields is shown for the Opportunity object. Only the fields in the object that are qualified for the Query operation are visible.

Figure 5. Opportunity fields with Query operation
Opportunity fields with Query operation

The metadata browser imports the Salesforce object and field definitions into DataStage. You first select the interested Salesforce object and its data fields, as shown in Figure 6. Then you click Import to convert the selected Salesforce object definitions to the DataStage table definitions. The created table definition is shown in Figure 7. You can also save the table definitions to the DataStage repository for future usages.

Figure 6. Importing Salesforce metadata
Importing Salesforce metadata
Figure 7. Saving the metadata definitions in DataStage
Saving the metadata definitions in DataStage

Extract data from Salesforce

The Salesforce Extract Stage extracts data from Salesforce.com based on the user’s Salesforce.com credentials. The extracted data can be loaded into other data sources, such as Oracle®, DB2, SAP®, or just flat files. The Extract Stage depends on the Salesforce.com Web service API and Salesforce Object Query Language (SOQL).

In this section, a simple ETL job is designed to illustrate the design steps and the extract functions. Figure 8 shows the sample job. This sample job replicates the Salesforce Account object into the Oracle database. You can manually create the Oracle Account table or it can be automatically created by the DataStage Open Database Connectivity (ODBC) connector at the job run time.

Figure 8. DataStage job for extracting data from Salesforce.com to Oracle 9g with reject
DataStage job for extracting data from Salesforce.com to Oracle 9g with reject
Figure 9. Table Account in Oracle database
Table Account in Oracle database

To set up the Salesforce extraction job, you open up the stage GUI first and select the operation type. Four operation types are defined: Query, QueryAll, Get Deleted Delta, and Get Updated Delta. Figure 10 shows the selection of the Query operation.

Figure 10. Select the Query operation for extraction
Select the Query operation for extraction

After the operation is selected, you can either launch the metadata browser to select the business object that you want to query on, or manually enter the object name and SOQL statement. The metadata browser generates the SOQL statement as a reference. You can use the generated SOQL statement as is, or manually modify it. Figures 11 and 12 show the generated SOQL statements.

Figure 11. Select the object and fields use the metadata browser
Select the object and fields use the metadata browser
Figure 12. SOQL statement is automatically generated
SOQL statement is automatically generated

Run the data extraction operation

Figure 13 shows the setup of the ODBC connector to insert the extracted Salesforce data into the Oracle database.

Figure 13. Set up a connection to the Oracle database
Set up a connection to the Oracle database

Figure 14 shows the job run results.

Figure 14. Schedule and run the extraction job
Schedule and run the extraction job

Load data into Salesforce

This section uses a simple ETL job to illustrate the steps necessary for loading data into Salesforce.com. Figure 15 shows the sample job. The job extracts customer data from an Oracle database table using an ODBC Stage. It passes the extracted data to the Salesforce Load Stage (Salesforce_Contact), which loads the processed data into the contact object in the user’s Salesforce account. Figure 16 shows the sample data in the Oracle database table.

Figure 15. Loading data from the Oracle database to the Salesforce contact
Loading data from the Oracle database to the Salesforce contact
Figure 16. Sample data in an Oracle database table
Sample data in an Oracle database table

You need to map the source from the Oracle database to the Salesforce contact object. You can do so after you have imported all the metadata from both the Oracle database and Salesforce.com.

To set up the Salesforce Load Stage Salesforce_Contact, you choose the Upsert operation, as shown in Figure 17.

Figure 17. Choose the operation
Choose the operation

Then, you import the fields from the Salesforce object contact through the metadata browser. Figure 18 shows the imported result.

Figure 18. Imported Salesforce contact fields
Imported Salesforce ontact fields

Figure 19 shows the setup of the ODBC connector to extract the contact information from the Oracle database.

Figure 19. Extracting data from Oracle
Extracting data from Oracle

In Figure 20, the data fields from the Oracle database are mapped to the fields in the Salesforce contact object.

Figure 20. Mapping between an Oracle database and Salesforce contact
Mapping between an Oracle database and Salesforce contact

In Figure 20, The REPID field from the Oracle database is mapped to the Salesforce OwnerId field. The REPID from Oracle only has a length of four. The Salesforce OwnerId requires having length of 18. This mapping generates errors that cause the Salesforce Load Stage to reject the records. Figure 21 shows that all the records to be loaded into the Salesforce are rejected and sent to the reject link.

Figure 21. Loading failed with rejection
Loading failed with rejection

As shown in Figure 22, the output file for the rejection link records the rejection cause for each rejected record. Users can use this rejection file and correct the errors, and then reload the records to Salesforce.

Figure 22. Rejection cause
Rejection cause

Parallel loading

Parallel loading into Salesforce enables the end user to speed up the loading process using Web services calls over the Internet. Typically, loading half of a million rows of data into Salesforce through sequential loading and single connection can take as long as two to four hours. This kind of performance is rarely acceptable in the real world of data warehousing. With the build-in DataStage parallel capability, you can setup multiple parallel load node configuration files depends on end users’ needs. Then, use the proper configuration file at run time without changing any designs in jobs. This process not only simplifies the operator's task of designing jobs, but also speeds up the loading process. With multiple nodes, the data are automatically partitioned by the Salesforce Pack, and loaded through multiple connections to Salesforce. A typical two hour loading job can be done in 30 minutes. This kind of performance can improve the customer's data warehousing process and increase their productivity.


Data enrichment with Salesforce

The quality stages within the DataStage can cleanse the data and remove erroneous records. In the sample job shown in Figure 23, a quality stage is used to remove the duplicated records with the same external IDs before the records are updated or inserted into Salesforce.

Figure 23. Remove duplicated records
Remove duplicated records

Figure 24 defines the column to be used as the key to check the duplicate records. In this case, the ExternalID__C column is selected.

Figure 24. ExternalID__C as the key defines duplicated records
ExternalID__C as the key defines duplicated records

Conclusion

This article has demonstrated how to integrate Salesforce.com CRM with external data sources using IBM Information Server and the Information Server Salesforce Pack. It explained the Salesforce data loading and extracting processes and the data enrichment features in the Quality stage.

IBM Information Server provides leading integration solutions to two other critical issues that many Salesforce customers are facing:

  • Data Quality: The data that builds a data warehouse often comes from various data sources. The structure of the legacy data is often not documented and the data quality is poor. The Information Server Information Analyzer product analyzes your data and determines the data structure and quality. It helps you understand your data. The Information Server QualityStage product standardizes and matches any type of information to create high quality data.
  • Data Volume: There is often a huge amount of data that needs to be processed regularly for a data warehouse environment. Sometime the data volume grows beyond expectations. The issue needs to be addressed with a scalable ETL architecture. IBM Information Server leverages the pipeline and partition technologies to support high data throughput. IBM Information Server can be deployed on symmetric multiprocessing (SMP) and massively parallel processing (MPP) computer systems to achieve the maximum scalability. The setup of node configuration is seamless and can be done through multiple configuration files and at production time. The pack is measured to load 1.2 million of rows of data in one hour in a multi-node configuration, behind a company’s proxy firewall.

Acknowledgements

We would like to thank Milind Tamaska for his feedback and reviews of this article.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=323161
ArticleTitle=Data integration with Salesforce CRM using IBM InfoSphere Information Server
publish-date=07242008