Introducing InfoSphere Streams 2.0 features, Part 2: Using collections

C++ and SPL collections examples

With the introduction of IBM InfoSphere® Streams 2.0 and the Streams processing language (SPL), developers gained access to two new collection types: set and map. This article describes these three collection types and how to exploit them as part of a native SPL application. You will also learn how to access collections using a primitive C++ Streams operator.

Chris Howard (chris_howard@ie.ibm.com), Big Data Solution Architect, Office of the CTO, IBM

Chris HowardChris Howard has been with IBM since 1998 and is currently a solution architect focused on Big Data solutions (InfoSphere Streams and InfoSphere BigInsights) as part of the CTO's office within IBM Software Group. In his previous role, he managed the EMEA Stream Computing Centre in Dublin and was responsible for client solution development for IBM's stream processing solution: IBM InfoSphere Streams. He is a Chartered Fellow of the British Computer Society.



02 April 2012

Also available in Chinese

Introduction

This article provides an introduction to the three collection types currently available to InfoSphere Streams 2.0 developers. Not only are the set and map composite types available with the Streams 2.0 version, there is added flexibility around the nesting of both primitive and composite types. This article illustrates the power of the new types using a number of samples in both the Streams processing language and C++.

Prerequisites

This article is written for Streams component developers and application programmers who have Streams programming language skills and C++ skills. You can use this article as a reference, or you can examine and run the samples to demonstrate the techniques described. To execute the samples, you should have a general, working knowledge of Streams programming.

Running the samples requires a computer running Red Hat Enterprise Linux® with InfoSphere Streams V2.0 or later installed.

Understanding collections

The Streams processing language (SPL) supports both primitive types (int, string, and so on) and composite types (set and map) that can be used to represent streaming data as part of an SPL application. One group of the composite types, known as collections, can be used to represent complex arrangements of streaming data and is supported with built in SPL functions to access and manipulate the collection contents.

In earlier versions of InfoSphere Streams, the SPADE language supported a limited set of composite types, including matrix and list, that could be used to manage groups of primitive types. Streams 2.0 discontinued the matrix type. The list of supported collections now includes:

  • Set (new in 2.0)
  • Map (new in 2.0)
  • List

Collections have been further extended to represent complex nesting of composite types, including lists of lists, which adds significant flexibility to the data that collections manage.

Figure 1. Streams primitive and composite types
diagram shows primitive types (boolean, enum, numeric, timestamp, string, blob) and composite types (tuple and collection, which includes list, set, and map)

Bounded and unbounded types

As for primitive types such as strings, SPL also implements collection types of bounded length. The size of any bounded length collection needs to be defined using a compile time constant. It is also possible to declare bounded collections containing unbounded types, such as a bounded list of rstring. From a developer perspective, there is very little difference between bounded and unbounded types. However, there are some compile-time optimization advantages when using bounded types.

Note that whether bounded or unbounded, SPL limits any collection to a maximum of 231-1 elements.

The following sections describe the composite types in more detail.


Understanding the list collection

For SPADE users

Developers moving from Streams 1.2 to Streams 2.0 are faced with the removal of the matrix composite type. In many cases, the matrix type was being used to emulate a nested list (list<list< >>). With the introduction of nested composite types, developers now have a far richer and more flexible set of capabilities to express streaming data.

Lists are sequence containers with their contents ordered linearly. Lists allow random, zero-based access to their contents. In simple terms, they can be thought of as dynamic arrays, allowing access to individual list items; but at the same time, they allow for dynamic allocation of new entries with the list being expanded or contracted as needed.

One additional advantage over traditional arrays is that lists in SPL provide for tighter boundary checking, ensuring that requests to access list elements beyond the list boundary are prohibited.

The list collection type has the constructors shown in Listing 1.

Listing 1. List collection type constructors
list<T>		// unbounded
list<T>[n]	// bounded

// Where T is the type of the elements, and n is the size for bounded lists

Figure 2 shows a simple example of a one-dimensional list containing six rstrings.

Figure 2. A list of fruits
diagram shows one dimensional list where fruits (apple,orange,banana,pear,pineapple,apple) are mapped to Elements (0,1,2,3,4,5)

Listing 2 shows how to declare and access that simple list.

Listing 2. Declaring and accessing a simple list
...
mutable rstring myFruit;

// Immutable - must be initialized upon declaration 

list<rstring> fruit = ["apple", "orange", "banana", "pear", "pineapple", "apple"];
myFruit = fruit[1];		// orange
...

Note that lists do not need to be unique (note the 2 apples) with individual elements accessed by their position index.


Understanding the set collection

Sets are unordered containers that enable the retrieval of their contents based on their value (the value of a set element is also its key). Sets store unique items (no duplicates). Once an element has been added to a set, it cannot be modified. However, list elements can be added or removed with the set expanding and contracting as needed.

Listing 3 shows the set collection type constructors.

Listing 3. Set collection type constructors
set<T>		// unbounded
set<T>[n]	// bounded

// Where T is the type of the elements, and n is the size for the bounded set

Figure 3 shows a simple set of 5 rstrings.

Figure 3. A set of vehicles
Image shows simple set of 5 rstrings (car, truck, airplane, boat, bicycle)

Listing 4 shows how to declare an unordered set.

Listing 4. Declaring a simple, unordered set
set <rstring> vehicles = {"car", "truck", "airplane", "boat", "bicycle"};

Understanding the map collection

Maps are implemented as associative containers that store elements in a key-to-value pairing. Like sets, a map collection requires that all keys are unique. Key and value items can be declared as different types with full support for nested types (even keys can be of any type, including other composite types).

Listing 5 shows the map collection type constructors.

Listing 5. Map collection type constructors
map<K,V>		// unbounded
map<K,V>[n]		// bounded

// Where K is the key component, V is the value component, and
n: is the size for the bounded map

Figure 4 shows an example of a map using both an int:rstring and an rstring:int.

Figure 4. A map of days in each month
Image shows mapping of days in each month by using an int32 key

Listing 6 shows how to declare the map of days.

Listing 6. Declaring the map using an int32 key
map <int32, rstring> monthsYear = {1 : "January", 2 : "February", 3 : "March", ...};

Figure 5 shows an example of a map of months.

Figure 5. A map of months
Image shows mapping of months by using an rstring key

Listing 7 shows how to declare the map of months.

Listing 7. Declaring the map using an rstring key
map <rstring, int32> daysMonth = {"January" : 31, "February" : 28, "March" : 31, ...};

Working with collections

This section describes how to works with collections.

Access operators

Once you declare the members contained in a collection, Streams provides a number of simple access operators. Following are the three basic mechanisms to work with collection items:

  • l[i] to access items stored in a list
  • in to check membership of a collection
  • for(T x in s) ... to iterate over the members of a collection. Note that as a for-loop iterates over a collection, that collection becomes immutable for the body of the for-loop.

Note that when accessing lists or maps using the in operator, the operator on the left is the access key. If you want to work with the value rather than the key, you need to use the appropriate SPL collection function (described below) rather than the in operator.

SPL collection functions

In addition to the access operators, the Streams Processing Language (SPL) standard toolkit provides several functions that can be used to manipulate collections as part of SPL expressions. SPL functions are either declared as mutable or immutable (non-mutating). A mutating function attempts to modify the contents of a passed input parameter (a collection, in this case). A non-mutating function returns a new object (a collection) containing the function results. Listing 8 shows an example.

Listing 8. SPL functions
<list T> public T concat(T values1, T values2)
// Returns a new list of values concatenating the contents of values1 and values2

<list T> public void concatM(mutable T values1, T values2)
// Appends values2 to the end of values1

The functions cover a wide range of capabilities covering basic element access, find functions, sorting, and set-specific manipulation. Table 1 gives a detailed view of the functions and their application to the various collection types.

Table 1. Built In SPL collection functions
GroupFunctionDescriptionListSetMap
Adding, joining, removingappendMAppend an item to a listX
clearMClear (empty) a collectionXXX
concat / concatMConcatenate elementsXX
insert / insertMInserts new elementsXXX
remove / removeMRemove elementsXXX
Manipulationreverse / reverseMReverse list of valuesX
makeSequenceMake an element sequenceX
Element accessatAccess multiple elementsX
sliceExtract elementsX
selectFromListSelect elements from 2 listsX
Finding elementsfindFind matching valuesX
hasFind whether a given value exists in a collectionXXX
findFirstFind the first occurrence of a matching value in a listX
findFirstNotOfFind the first occurrence of a non-matching value in a listX
findFirstOfFind the first occurrence of a matching value in a listX
findLastFind the last occurrence of a matching value sequence in a listX
findLastNotOfFind the last occurrence of a non-matching value in a listX
findLastOfFind the last occurrence of a matching value in a listX
findNotOfFind non-matching values in a listX
findOfFind matching values in a listX
SizesizeGet the size of a listX
countDistinctCompute distinct count of a listX
ComparisonlexicographicalCompareCompare two lists in lexicographical orderX
pairwiseCompareCompare based on elementX
pairwiseMaxCompare and fetch the largerX
pairwiseMinCompare and fetch the smallerX
Set manipulationsetDifferenceCompute set differenceX
setIntersectionCompute set intersectionX
setUnionsetUnionX
Element sortingsort / sortMSort valueX
sortIndicesSort values and return indiciesX
ConversiontoSetConvert a list to a setX

Using primitive operators and collections

This section describes how to use primitive operators and collections.

SPL type handling within C++ operators

InfoSphere Streams provides seamless exchange of data between native SPL and primitive operators using its mapping of SPL types to C++ types (normally achieved through extending common C++ classes.) This exchange is true for the standard primitive and more complex (and nested) composite types.

SPL collections under the covers

Table 2 shows the mapping of SPL and C++ types. Notice the C++ base implementation, because this gives some guidance on the behaviors of each SPL type.

Table 2. SPL and C++ type mapping
SPL typeC++ typeC++ base implementationC++ reflective type
list<T>SPL::list<T>std::vector<T>SPL::List
set<T>SPL::set<T>std::tr1::unordered_set<T>SPL::Set
map<T,K>SPL::map<T,K>std::tr1::unordered_map<T,K>SPL::Map
list<T>[N]SPL::blist<T,N>N/ASPL::BList
set<T>[N]SPL::bset<T,N>N/ASPL::BSet
map<T,K>[N]SPL::bmap<T,K,N>N/ASPL::BMap

Example

The example demonstrates the passing of both unbounded and bounded composite types between SPL and a C++ primitive operator. Figure 6 shows the application flow at a high level.

Figure 6. Application flow
Image of Streams application flow for collections, which demonstrates the passing of both unbounded and bounded composite types between SPL and a C++ primitive operator

Listing 9 shows that simple scenario of passing a nested composite type to a Streams C++ primitive operator. The type is declared as a simple, two-dimensional list of integers, which is averaged and returned as a one-dimensional list of float values. For simplicity, the initial list is populated from a file source that reads a tuple of the correct format.

Listing 9. SPL code passing a 2D list to a primitive operator
composite Main {

     // Tuple schema definitions for re-use later
     type
          // Unbounded list of list of int32       
          listSchema2d = tuple <list<list<int32> > inputMatrix>;

          // Bounded list              	
          listSchema1d = tuple <list<float64>[100] outputList>;	
		
     ...

    graph

     // Read a CSV file containing 100x100 matrix of random integers
     stream <listSchema2d> matrixData = FileSource() {
          param 
               format : csv;
               file : "matrix.txt";
     }

     // Pass the matrix data to the Primitive Operator for processing
     stream <listSchema1d> listData = myOp(matrixData) {} 

     // Sink the results of the Primitive Operator to a new file
     () as sinkListData = FileSink(listData) {
          param 
               format : csv;
               file : "list.txt";
     }
     ...
}

Listing 10 describes the next step.

Listing 10. C++ primitive operator to show passing and accessing SPL::list (_cpp.cgt)
...
    
// Tuple processing for non-mutating ports
void MY_OPERATOR::process(Tuple const & tuple, uint32_t port)
{

     // Define the input and output port tuples
     IPort0Type const & tp = static_cast<IPort0Type const&>(tuple);
     OPort0Type outputTuple;

     // Get a reference to the 2d list of list of int32
     SPL::list<SPL::list<SPL::int32> > const & inputMatrix = tp.get_inputMatrix();
     // Could also use IPort0Type::inputMatrix_type const& ....

     // Loop over the input matrix and calculate the average of each row
     for (uint32_t dim1=0, dim1Max = inputMatrix.size(); dim1 < dim1Max; dim1++) {

          SPL::list<SPL::int32> const& row = inputMatrix[dim1];    	     
          // Could also use IPort0Type::inputMatrix_type::value_type const& ...
        
          // Sum the data values in the row
          SPL::float64 sum = 0.0;
          for (uint32_t dim2=0, dim2Max = row.size(); dim2 < dim2Max; dim2++)
               sum += row[dim2];

          // Append the average of the row to the output tuple
          outputTuple.get_outputList().push_back (sum / row.size());
     }

     // Submit output tuple
     submit(outputTuple, 0); 
}
...

Conclusion

This article offered an introduction to using collections with InfoSphere Streams, whether using collections natively within SPL to manage complex data structures or passing to primitive operators.

As always, the best way to find out more is to try it yourself. Armed with these code snippets, build your own SPL code, and start to experiment with the collection types and the in-built functions available to you. For additional information, see Resources for a link to the Streams forum.

Acknowledgments

Many thanks to Mark Mendell for his assistance and review of this article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=807629
ArticleTitle=Introducing InfoSphere Streams 2.0 features, Part 2: Using collections
publish-date=04022012