Find databases with protected health information

Inventory PII, PHI, and other sensitive data within databases and spreadsheets

Identity theft and medical fraud are growing problems. They are so big the U.S. government is spending billions of dollars securing its own computer systems and has written thousands of pages of new regulations that you must follow to help protect your customer and employee data. To comply with new regulations and properly secure data, you will need to find personally identifiable information (PII) and protected health information (PHI) in your databases and documents. Both PHI and PII are conceptually easy to understand but very difficult to track in the thousands of relational data stores, files, and spreadsheets that make up a typical organization's IT environment. This article describes some methods to automatically identify and inventory PII, PHI, and other sensitive data with databases and spreadsheets using Java™ technology and the Apache Ant build tool.

Share:

Shahid N Shah, CEO and Chief Architect, Netspective Communications, LLC

Shahid Shah photoShahid N. Shah is an internationally recognized and influential healthcare IT thought leader who is known as "The Healthcare IT Guy" across the Internet. He is a consultant to various federal agencies on IT matters and winner of Federal Computer Week's coveted "Fed 100" award given to IT experts that have made a big impact in the government. Shahid has developed multiple clinical solutions over his almost 20 year career. He helped design an electronic health record solution for the American Red Cross and deploy it across thousands of sites; he built two web-based EMRs used by hundreds of physicians; and he designed large groupware and collaboration sites used by thousands. As an ex-CTO for a billion-dollar division of CardinalHealth he helped design advanced clinical interfaces for medical devices and hospitals. Shahid also serves as a senior technology strategy advisor to NIH's SBIR/STTR program helping small businesses commercialize their healthcare applications.



19 October 2010

Also available in Portuguese Spanish

PII and PHI data collection and storage is expanding

Identity theft is already a well-known problem in the finance industry, and companies are trying to take steps to counter it as best they can. Medical-related fraud is nascent at this time—currently less than 5 percent of reported breaches occurring within healthcare—but growth in healthcare fraud will continue to accelerate as government stimulus funds expand the use of electronic health records. When breaches for identity and medical theft occur, they do so because most information systems that have value to organizations contain some level of personally identifiable information (PII) or protected health information (PHI). PII and PHI are the most private kinds of data stored about people and, if they are breached or stolen, they cause adverse events like identity theft or medical fraud. For example, the Department of Defense, Department of Veterans' affairs, handlers of PII and PHI (such as an insurance or personal investment company), and employers (such as a hotel chain) have all reported significant losses of PII and PHI data, in some cases up to 25 million records. Protecting the information is not easy but it is vital.

PII is usually defined as any name or number that may be used, alone or in conjunction with any other information, to identify a specific individual. It is usually an individual's name, in combination with one or more of the following: driver's license number, state ID card number, Social Security number, account number, credit card number, or debit card number.

PHI, defined as any information that links a person with their health condition, is a little more complex. There are eighteen official identifiers defined by the Health Insurance Portability and Accountability Act (HIPAA) and they fall into the following groups:

  • Names
  • Dates
  • Numbers
  • Addresses
  • Graphics

If all these groups, or identifiers, are removed or somehow encrypted the information becomes de-identified or anonymized and is no longer considered PHI. The specific list of eighteen types of data defined in government regulations such as HIPAA consists of:

  1. Names
  2. All geographic subdivisions smaller than a state, including: street address, city, county, precinct, ZIP Code, and their equivalent geographical codes, except for the initial 3 digits of a ZIP Code if, according to the current publicly available data from the Bureau of the Census:
    1. The geographic unit formed by combining all ZIP Codes with the same three initial digits contains more than 20,000 people.
    2. The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000. According to the August 2002 Final Modifications to the Privacy Rule and utilizing the 2000 Census data, there are 17 restricted three-digit ZIP Codes that correspond to populations of 20,000 or fewer persons and must be changed to 000 to be de-identified: 036, 059, 063, 102, 203, 556, 692, 790, 821, 823, 830, 831, 878, 879, 884, 890, and 893.
  3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, and date of death. All ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.
  4. Telephone number
  5. Fax number
  6. E-mail address
  7. Social Security number
  8. Medical record number
  9. Health plan beneficiary number
  10. Account numbers
  11. Certificate or license number
  12. Vehicle identifiers and serial numbers (including license plates)
  13. Device identifiers and serial numbers
  14. URL address
  15. IP address
  16. Biometric identifiers, such as fingerprints and voiceprints
  17. Full-face photos and any comparable images
  18. Any other unique identifying number, characteristic, or code

PHI appears in different forms: paper or hard copy records, electronic records, and verbal and sign language communications. It is easier to protect paper records and verbal communications because we've been using and protecting these types of records for hundreds of years and generally, we're good at it. Electronic communications are less than sixty years old and most health data is less than thirty years old, so we're only beginning to understand how to protect it. Unlike stolen paper records that are difficult and time consuming to copy and disseminate, stolen electronic records are easy to distribute. For this reason, this article focuses on finding and protecting electronic records.


A guide to protecting PII confidentiality

The National Institute of Standards and Technology (NIST) publication SP 800-122, "Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)" (see Resources) is one of the best free and unbiased resources available to help you understand what privacy data is and how you protect it. Originally created to help federal organizations identify PII and determine the level of protection each case requires, this document is just as applicable to commercial and non-profit organizations. It advises what the appropriate protection and safeguards for PII are and how to handle data breaches. While the focus is on PII, it can support PHI for your own privacy and security plans as well.


To protect privacy data, you need to know where it is

You know you need to protect PII and PHI that reside in your databases. You also know that encryption in transit is a minimum requirement to secure private data. However, what data should be encrypted and protected? Declaring everything as private is not wise because data security costs are high.

Protecting large enterprise databases is straightforward—at least you know where the data is. The difficult task is how to handle the dozens, perhaps hundreds, of Lotus® Notes® documents and Microsoft® Word, Access®, and Excel® files. While one could treat these documents as simple file management problems they are more than that, they are real applications and real databases. Any reasonable analysis of HIPAA privacy concerns related to patient information in a sizable organization could uncover thousands of Microsoft Word, Access, and Excel files with health data needing protection.

Sensitive data discovery

This article presents some data discovery (inventorying) mechanisms to help identify PII and PHI that might reside in any data source that can be connected through JDBC or ODBC. The basic approach is as follows:

  • Look at the table, column, and other schema object names to see if they match known patterns and assume that all matching objects have PII or PHI data.
  • Look at the actual SQL or queries being conducted to scan the SQL Data Query Language (DQL) or results generated.
  • Look at data at rest within the database or files using string or pattern matching for specific words like do not distribute, confidential, \d{3}\-\d{2}\-\d{4} (SSN), and flag them.
  • Look at all content being transferred through firewalls or proxies using similar matching patterns.

How to classify data as sensitive or private

There are many techniques for classifying data, but they either:

  • Match deterministically on specific keywords or patterns in the actual data or the metadata that defines the data.
  • Algorithmically use document clustering with taxonomic or other classification, natural language analysis, Bayesian inference, or statistical analysis.

The algorithmic mechanisms are beyond the scope of this paper, but you'll see that it's not difficult to match on keywords in metadata.


Auto-discovery of PII or PHI columns in JDBC or ODBC data sources

Almost all popular relational databases are directly accessible through JDBC; Microsoft Office documents such as Access or Excel are accessible through JDBC-ODBC bridges. Using either the bridge model or direct JDBC along with the JDBC java.sql.DatabaseMetaData class gives you everything you need to do the quick scanning of database columns, tables, and other schema objects. Wrap that into an Ant task and you'll have a reusable command line tool that you can use on any platform connected to any database.

Defining the filters file

Define a text file that contains a list of different regular expressions to look for. Call it the filters list. You can create a file called filters.txt (see Download for the file) and it can start with the patterns shown in Listing 1. You will want to expand the file so that it has all your specific rules.

Listing 1. filters.txt contains sample regular expressions that identify patterns to locate
^.*(ssn|social|security).*$
^.*name.*$
^.*address.*$
^.*city.*$
^.*state.*$
^.*zip.*$
^.*county.*$
^.*precinct.*$
^.*(email|e-mail|mail).*$

Using the Ant Task

Commercial and open source scanning tools

Several other open source and commercial scanning tools start with similar functionality but add more features. When you look for third-party tools, consider automated discovery (the tools automatically find databases and record sources with PII/PHI), configurable templates (you add your own rules), broad coverage (all files, databases, and network transfers are covered), content scanning, and auditing.

To use the Ant task you will need to create an Ant Project File (see Listing 2). Notice that you'll be using the previously defined filters.txt file. You can have as many <scanner> tags as you need to search through all your different databases, Access and Excel documents, and other files. The full power of Ant is available to use variables, scripting, and so on.

Listing 2. scanner.xml—the Ant Project File that describes the JDBC connections and options
<?xml version="1.0" encoding="utf-8"?>
<project id="PII and PHI Scanner">
    <taskdef name="scanner" classname="PrivacyDataScanner"/>
    <scanner  catalog=""  schemaPattern=""
              filtersFile="filters.txt"
              output="database-one-scanner-output.txt"

              driver="com.ibm.db2.jcc.DB2Driver"
              url="jdbc:db2://HOST:5702/INSTANCE"
              userid="USER_ID"
              password="PA$$WORD">
    </scanner>

    <scanner  scanner  catalog=""  schemaPattern=""
              filtersFile="filters.txt"
              output="database-two-scanner-output.txt"

              driver="com.ibm.db2.jcc.DB2Driver"
              url="jdbc:db2://HOST:5702/INSTANCE2"
              userid="USER_ID2"
              password="PA$$WORD2">
    </scanner>
</project>

If you want to search additional databases or documents you simply add additional <scanner> sections in the Ant project file. I suggest adding all the different database connections into one file or grouping different databases by system or department in separate files. Remember you can scan Microsoft Access or Excel documents by using the JDBC-ODBC bridge.


The Ant task Java source (PrivacyDataScannerTask.java)

You could write the privacy data scanner code as a command line Java program or use various other techniques to make it scriptable. However, if you write the privacy scanner code as an Ant task it becomes easy to inject into the build automation and other scripts, and is more reusable across projects that use Ant (see Listing 3). Because your code is going to be using JDBC, it makes sense to extend the built-in Ant task class org.apache.tools.ant.taskdefs.JDBCTask.

Listing 3. PrivacyDataScannerTask.java, part 1: The import statements
import org.apache.tools.ant.BuildException;
import org.apache.tools.ant.taskdefs.JDBCTask;

import java.io.*;
import java.sql.Connection;
import java.sql.DatabaseMetaData;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;

Note that you use Ant to import those classes, and you'll need Ant 1.6 or above in your CLASSPATH (see Listing 4).

Listing 4. PrivacyDataScannerTask.java, part 2: The class header and setters
public class PrivacyDataScannerTask extends JDBCTask {

    // inherits attributes and setters for JDBC connection parameters
    private String catalog;
    private String schemaPattern;
    private File output;
    private File filtersFile;
    private List<String> filters = new ArrayList<String>();

    public void setCatalog(final String catalog) {
        this.catalog = catalog;
    }

    public void setSchemaPattern(final String schemaPattern) {
        this.schemaPattern = schemaPattern;
    }

    public void setOutput(final File output) {
        this.output = output;
    }

    public void setFiltersFile(final File filtersFile) throws IOException {
        this.filtersFile = filtersFile;
        final FileReader fileReader = new FileReader(filtersFile);
        final BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = null;
        while ((line = bufferedReader.readLine()) != null)
            filters.add(line);
        bufferedReader.close();
    }

Your task is extending a built-in Ant Task (JDBCTask) and it automatically inherits all the JDBC connection setting properties. The setCatalog, setSchemaPattern, setOutput, and setFiltersFile are all automatically called by Ant when the project references the XML attributes in the <scanner> tag shown in Listing 5.

Listing 5. PrivacyDataScannerTask.java, part 3: The method that actually does the checking for PII or PHI
    public String isPIIOrPHI(final String text) {
        for (final String regex : filters) {
            if (text.matches(regex)) return regex;
        }

        return null;
    }

The real work of matching table names, column names, and other schema objects is handled by this one method. For now, you are simply checking to see if the text passed in (which is going to be the name of a column or table) and meets our rules. The rules were defined in filter.txt in Listing 1 and read in using the setFiltersFile method in Listing 4.

Of course, if you have a different way to specify the filters, such as get the filters from an HTTP service or retrieve them from a central database, you can modify this method to use your rules source and business rules (see Listing 6).

Listing 6. PrivacyDataScannerTask.java, part 4: The main Ant method that performs the actual scanning.
    public void execute() throws BuildException {
       final Writer analysis = new StringWriter();
       final Connection conn = getConnection();
       try {
          final DatabaseMetaData dbmd = conn.getMetaData();

          analysis.write("** Privacy Analysis Scanner Output ** \n\n");
          analysis.write("driver: " + dbmd.getDriverName() + "\n");
		  analysis.write("driver-version: " + dbmd.getDriverVersion() + "\n");
          analysis.write("product: " + dbmd.getDatabaseProductName() + "\n");
          analysis.write("product-version: " + dbmd.getDatabaseProductVersion() + "\n");

            if (catalog != null && catalog.length() > 0) {
                analysis.write("catalog: " + catalog + "\n");
                if (schemaPattern != null && schemaPattern.length() > 0) {
                    analysis.write("schemaPattern: " + schemaPattern + "\n\n");

                    ResultSet tables = null;
                    try {
                        tables = dbmd.getTables(catalog, schemaPattern, null, 
                                  new String[]{"TABLE"});
                        while (tables.next()) {
                            final String tableName = tables.getString(3);
                            analysis.write("table name=\"" + tableName);
                            String matchesPII = isPIIOrPHI(tableName);
                            if(matchesPII != null)
                                analysis.write(", *** Contains PII or PHI: " + matchesPII
                                  + " ***\n");
                            else
                                analysis.write("\n");

                            ResultSet columns = null;
                            try {
                               columns = dbmd.getColumns(null, null, tableName, null);
                               while (columns.next()) {
                                  final String columnName = columns.getString(4);
                                  analysis.write("  column name=\"" + columnName + "\"");
                                  analysis.write(", type=\"" + columns.getShort(5));
                                  matchesPII = isPIIOrPHI(columnName);
                                  if(matchesPII != null)
                                      analysis.write(", *** Contains PII or PHI: " 
                                        + matchesPII + " ***\n");
                                  else
                                      analysis.write("\n");
                               }
                            }
                            finally {
                                if (columns != null) columns.close();
                            }
                        }
                    }
                    finally {
                        tables.close();
                    }
                } else {
                   analysis.write("No 'schemaPattern' provided -- available schemas: ");
                   ResultSet rs = null;
                   try {
                        rs = dbmd.getSchemas();
                        while (rs.next()) {
                            analysis.write(", " + rs.getObject(1).toString());
                        }
                    }
                    finally {
                        if (rs != null) rs.close();
                    }
                    analysis.write("\n");
                }
            } else {
                analysis.write("No 'catalog' provided - listing available catalogs: ");
                ResultSet rs = null;
                try {
                    rs = dbmd.getCatalogs();
                    while (rs.next()) {
                        analysis.write(", " + rs.getObject(1).toString());
                    }
                }
                finally {
                    if (rs != null) rs.close();
                }
                analysis.write("\n");
            }
        } catch (SQLException e) {
            throw new BuildException(e);
        } catch (IOException e) {
            throw new BuildException(e);
        }

        // at this point we've captured the output and can e-mail or send it to
        // standard out; we're just going to write it out to a file for now
        FileWriter out = null;
        try {
            out = new FileWriter(output);
            out.write(analysis.toString());
        } catch (IOException e) {
            e.printStackTrace(); 
        }
    }
}

The execute() method of an Ant task is the method that will be called by the Ant project (see Listing 2) when the <scanner> tag ends. All the attributes will have been passed into the setters defined in Listing 4 and the execute method can then use the private member variables for its processing requirements.

The method starts by checking to see if a catalog and schemaPattern are provided. If either a catalog or schemaPattern is not provided, the method will simply list the catalogs and schemas available so that you can modify the project file in Listing 2 to choose the appropriate schema and catalog.

If a proper catalog and schemaPattern are provided, the code simply loops through all tables (inside each table it loops through all columns) and runs the PII and PHI check for each table name and column name using the method defined in Listing 5. As the method loops through the tables and columns, it stores messages to save to a file. At the end of the method the messages are collected and written to the file.

This solution is just a start, but you can see it saves you a lot of time as you conduct your search to protect privacy data.


Conclusion

It is vital to track PII and PHI throughout the enterprise. If you're not able to either manually or automatically scan for PII and PHI, you risk the integrity of confidential information, the loss of security of personal information (identity theft), potential theft of intellectual property, bad publicity, and costly breach notifications. Using the techniques outlined in this article, you can see that it is relatively simple to create working code that can inventory databases and spreadsheets so you are aware of which databases need private protection and attention.


Download

DescriptionNameSize
Article source codepiiphi.source.zip3KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Industries, Open source, Information Management
ArticleID=551340
ArticleTitle=Find databases with protected health information
publish-date=10192010