Announcing the Availability of Time Series Functionality in Watson Studio Spark Environments

1 min read

We are excited to announce the availability of Time Series Libraries in Watson Studio Spark Environments starting today (October 8, 2020).

This library, developed by IBM Research, includes a full set of time series functionality that is not available in any other competing offerings. It joins our IBM Research Assets, Geospatial functionality, Data Skipping, and Parquet Encryption libraries as fully supported features by Watson Studio Spark Environments. 

The time series library allows users to perform various key operations on time series data, including construction of a collection of time series, imputation functions (like segmentation), transformers, reducers, joins, and machine learning functions (such as forecasting, clustering, and discriminatory sequence mining). The library supports various time series types, including numeric, categorical, and arrays.

Examples of time series data include the following:

  • Stock share prices and trading volumes
  • Clickstream data
  • Electrocardiogram (ECG) data
  • Temperature or seismographic data
  • Network performance measurements
  • Network logs
  • Electricity usage as recorded by a smart meter and reported via an Internet of Things data feed

Key features of the Time Series Libraries in Watson Studio Spark Environments

I. Data model

  • A core data model for univariate and multivariate time series  
  • Time Reference Systems for handling different timestamp representations
  • Support for aperiodic, duplicate, and time of order timestamps
  • Spark RDD and dataframe extensions for timeseries
  • Numeric and categorical timeseries
  • Lossless and lossy compression

II. Transformation and segmentation functions

  • Math: Mean, variance, skew, correlations, PAA, SAX, covariance matrix, Graphical Gaussian Model, etc.
  • Statistical tests: Augmented Dickey-Fuller, Ljung-box, Granger causality
  • Distance metrics: Dynamic Time Warping, Damerau Levenshtein, Longest Common Subsequence, Jaro-winkler,
  • Timeseries reconciliation: Hungarian algorithm, Earth mover distance
  • Change point detection: CU-SUM, Bayesian, Gaussian
  • Segmentation: Window, Record-based, Burst-based, Anchor, Regression

III. Forecasting functions

  • ARIMA
  • Holt-Winters
  • BATS
  • Vector auto-regression
  • Anomaly detection

IV. Joins

  • A complete suite of temporal joins, including inner, outer, left-outer, right-outer, left-inner, and right-inner supported

V. SQL extensions

VI. Spark machine learning

  • Sequence mining
  • Timeseries clustering: K-means, K-shape, Motif-based, Cluster drift detection 
  • Data connectors for feature engineering that provide Spark data frame iterators to TensorFlow and Sci-kit learn.

For full list of functions and how to get started, please refer to the documentation.

Learn more about data lakes in the IBM Cloud

If you would like to know more about time series use case on IBM Cloud, please reach out to Kiran Guduguntla or Josh Rosenkranz.

Be the first to hear about news, product updates, and innovation from IBM Cloud