Find databases with protected health information

Inventory PII, PHI, and other sensitive data within databases and spreadsheets


PII and PHI data collection and storage is expanding

Identity theft is already a well-known problem in the finance industry, and companies are trying to take steps to counter it as best they can. Medical-related fraud is nascent at this time—currently less than 5 percent of reported breaches occurring within healthcare—but growth in healthcare fraud will continue to accelerate as government stimulus funds expand the use of electronic health records. When breaches for identity and medical theft occur, they do so because most information systems that have value to organizations contain some level of personally identifiable information (PII) or protected health information (PHI). PII and PHI are the most private kinds of data stored about people and, if they are breached or stolen, they cause adverse events like identity theft or medical fraud. For example, the Department of Defense, Department of Veterans' affairs, handlers of PII and PHI (such as an insurance or personal investment company), and employers (such as a hotel chain) have all reported significant losses of PII and PHI data, in some cases up to 25 million records. Protecting the information is not easy but it is vital.

PII is usually defined as any name or number that may be used, alone or in conjunction with any other information, to identify a specific individual. It is usually an individual's name, in combination with one or more of the following: driver's license number, state ID card number, Social Security number, account number, credit card number, or debit card number.

PHI, defined as any information that links a person with their health condition, is a little more complex. There are eighteen official identifiers defined by the Health Insurance Portability and Accountability Act (HIPAA) and they fall into the following groups:

  • Names
  • Dates
  • Numbers
  • Addresses
  • Graphics

If all these groups, or identifiers, are removed or somehow encrypted the information becomes de-identified or anonymized and is no longer considered PHI. The specific list of eighteen types of data defined in government regulations such as HIPAA consists of:

  1. Names
  2. All geographic subdivisions smaller than a state, including: street address, city, county, precinct, ZIP Code, and their equivalent geographical codes, except for the initial 3 digits of a ZIP Code if, according to the current publicly available data from the Bureau of the Census:
    1. The geographic unit formed by combining all ZIP Codes with the same three initial digits contains more than 20,000 people.
    2. The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000. According to the August 2002 Final Modifications to the Privacy Rule and utilizing the 2000 Census data, there are 17 restricted three-digit ZIP Codes that correspond to populations of 20,000 or fewer persons and must be changed to 000 to be de-identified: 036, 059, 063, 102, 203, 556, 692, 790, 821, 823, 830, 831, 878, 879, 884, 890, and 893.
  3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, and date of death. All ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.
  4. Telephone number
  5. Fax number
  6. E-mail address
  7. Social Security number
  8. Medical record number
  9. Health plan beneficiary number
  10. Account numbers
  11. Certificate or license number
  12. Vehicle identifiers and serial numbers (including license plates)
  13. Device identifiers and serial numbers
  14. URL address
  15. IP address
  16. Biometric identifiers, such as fingerprints and voiceprints
  17. Full-face photos and any comparable images
  18. Any other unique identifying number, characteristic, or code

PHI appears in different forms: paper or hard copy records, electronic records, and verbal and sign language communications. It is easier to protect paper records and verbal communications because we've been using and protecting these types of records for hundreds of years and generally, we're good at it. Electronic communications are less than sixty years old and most health data is less than thirty years old, so we're only beginning to understand how to protect it. Unlike stolen paper records that are difficult and time consuming to copy and disseminate, stolen electronic records are easy to distribute. For this reason, this article focuses on finding and protecting electronic records.

A guide to protecting PII confidentiality

The National Institute of Standards and Technology (NIST) publication SP 800-122, "Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)" (see Related topics) is one of the best free and unbiased resources available to help you understand what privacy data is and how you protect it. Originally created to help federal organizations identify PII and determine the level of protection each case requires, this document is just as applicable to commercial and non-profit organizations. It advises what the appropriate protection and safeguards for PII are and how to handle data breaches. While the focus is on PII, it can support PHI for your own privacy and security plans as well.

To protect privacy data, you need to know where it is

You know you need to protect PII and PHI that reside in your databases. You also know that encryption in transit is a minimum requirement to secure private data. However, what data should be encrypted and protected? Declaring everything as private is not wise because data security costs are high.

Protecting large enterprise databases is straightforward—at least you know where the data is. The difficult task is how to handle the dozens, perhaps hundreds, of Lotus® Notes® documents and Microsoft® Word, Access®, and Excel® files. While one could treat these documents as simple file management problems they are more than that, they are real applications and real databases. Any reasonable analysis of HIPAA privacy concerns related to patient information in a sizable organization could uncover thousands of Microsoft Word, Access, and Excel files with health data needing protection.

Sensitive data discovery

This article presents some data discovery (inventorying) mechanisms to help identify PII and PHI that might reside in any data source that can be connected through JDBC or ODBC. The basic approach is as follows:

  • Look at the table, column, and other schema object names to see if they match known patterns and assume that all matching objects have PII or PHI data.
  • Look at the actual SQL or queries being conducted to scan the SQL Data Query Language (DQL) or results generated.
  • Look at data at rest within the database or files using string or pattern matching for specific words like do not distribute, confidential, \d{3}\-\d{2}\-\d{4} (SSN), and flag them.
  • Look at all content being transferred through firewalls or proxies using similar matching patterns.

How to classify data as sensitive or private

There are many techniques for classifying data, but they either:

  • Match deterministically on specific keywords or patterns in the actual data or the metadata that defines the data.
  • Algorithmically use document clustering with taxonomic or other classification, natural language analysis, Bayesian inference, or statistical analysis.

The algorithmic mechanisms are beyond the scope of this paper, but you'll see that it's not difficult to match on keywords in metadata.

Auto-discovery of PII or PHI columns in JDBC or ODBC data sources

Almost all popular relational databases are directly accessible through JDBC; Microsoft Office documents such as Access or Excel are accessible through JDBC-ODBC bridges. Using either the bridge model or direct JDBC along with the JDBC java.sql.DatabaseMetaData class gives you everything you need to do the quick scanning of database columns, tables, and other schema objects. Wrap that into an Ant task and you'll have a reusable command line tool that you can use on any platform connected to any database.

Defining the filters file

Define a text file that contains a list of different regular expressions to look for. Call it the filters list. You can create a file called filters.txt (see Download for the file) and it can start with the patterns shown in Listing 1. You will want to expand the file so that it has all your specific rules.

Listing 1. filters.txt contains sample regular expressions that identify patterns to locate

Using the Ant Task

To use the Ant task you will need to create an Ant Project File (see Listing 2). Notice that you'll be using the previously defined filters.txt file. You can have as many <scanner> tags as you need to search through all your different databases, Access and Excel documents, and other files. The full power of Ant is available to use variables, scripting, and so on.

Listing 2. scanner.xml—the Ant Project File that describes the JDBC connections and options
<?xml version="1.0" encoding="utf-8"?>
<project id="PII and PHI Scanner">
    <taskdef name="scanner" classname="PrivacyDataScanner"/>
    <scanner  catalog=""  schemaPattern=""


    <scanner  scanner  catalog=""  schemaPattern=""


If you want to search additional databases or documents you simply add additional <scanner> sections in the Ant project file. I suggest adding all the different database connections into one file or grouping different databases by system or department in separate files. Remember you can scan Microsoft Access or Excel documents by using the JDBC-ODBC bridge.

The Ant task Java source (

You could write the privacy data scanner code as a command line Java program or use various other techniques to make it scriptable. However, if you write the privacy scanner code as an Ant task it becomes easy to inject into the build automation and other scripts, and is more reusable across projects that use Ant (see Listing 3). Because your code is going to be using JDBC, it makes sense to extend the built-in Ant task class

Listing 3., part 1: The import statements

import java.sql.Connection;
import java.sql.DatabaseMetaData;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;

Note that you use Ant to import those classes, and you'll need Ant 1.6 or above in your CLASSPATH (see Listing 4).

Listing 4., part 2: The class header and setters
public class PrivacyDataScannerTask extends JDBCTask {

    // inherits attributes and setters for JDBC connection parameters
    private String catalog;
    private String schemaPattern;
    private File output;
    private File filtersFile;
    private List<String> filters = new ArrayList<String>();

    public void setCatalog(final String catalog) {
        this.catalog = catalog;

    public void setSchemaPattern(final String schemaPattern) {
        this.schemaPattern = schemaPattern;

    public void setOutput(final File output) {
        this.output = output;

    public void setFiltersFile(final File filtersFile) throws IOException {
        this.filtersFile = filtersFile;
        final FileReader fileReader = new FileReader(filtersFile);
        final BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = null;
        while ((line = bufferedReader.readLine()) != null)

Your task is extending a built-in Ant Task (JDBCTask) and it automatically inherits all the JDBC connection setting properties. The setCatalog, setSchemaPattern, setOutput, and setFiltersFile are all automatically called by Ant when the project references the XML attributes in the <scanner> tag shown in Listing 5.

Listing 5., part 3: The method that actually does the checking for PII or PHI
    public String isPIIOrPHI(final String text) {
        for (final String regex : filters) {
            if (text.matches(regex)) return regex;

        return null;

The real work of matching table names, column names, and other schema objects is handled by this one method. For now, you are simply checking to see if the text passed in (which is going to be the name of a column or table) and meets our rules. The rules were defined in filter.txt in Listing 1 and read in using the setFiltersFile method in Listing 4.

Of course, if you have a different way to specify the filters, such as get the filters from an HTTP service or retrieve them from a central database, you can modify this method to use your rules source and business rules (see Listing 6).

Listing 6., part 4: The main Ant method that performs the actual scanning.
    public void execute() throws BuildException {
       final Writer analysis = new StringWriter();
       final Connection conn = getConnection();
       try {
          final DatabaseMetaData dbmd = conn.getMetaData();

          analysis.write("** Privacy Analysis Scanner Output ** \n\n");
          analysis.write("driver: " + dbmd.getDriverName() + "\n");
		  analysis.write("driver-version: " + dbmd.getDriverVersion() + "\n");
          analysis.write("product: " + dbmd.getDatabaseProductName() + "\n");
          analysis.write("product-version: " + dbmd.getDatabaseProductVersion() + "\n");

            if (catalog != null && catalog.length() > 0) {
                analysis.write("catalog: " + catalog + "\n");
                if (schemaPattern != null && schemaPattern.length() > 0) {
                    analysis.write("schemaPattern: " + schemaPattern + "\n\n");

                    ResultSet tables = null;
                    try {
                        tables = dbmd.getTables(catalog, schemaPattern, null, 
                                  new String[]{"TABLE"});
                        while ( {
                            final String tableName = tables.getString(3);
                            analysis.write("table name=\"" + tableName);
                            String matchesPII = isPIIOrPHI(tableName);
                            if(matchesPII != null)
                                analysis.write(", *** Contains PII or PHI: " + matchesPII
                                  + " ***\n");

                            ResultSet columns = null;
                            try {
                               columns = dbmd.getColumns(null, null, tableName, null);
                               while ( {
                                  final String columnName = columns.getString(4);
                                  analysis.write("  column name=\"" + columnName + "\"");
                                  analysis.write(", type=\"" + columns.getShort(5));
                                  matchesPII = isPIIOrPHI(columnName);
                                  if(matchesPII != null)
                                      analysis.write(", *** Contains PII or PHI: " 
                                        + matchesPII + " ***\n");
                            finally {
                                if (columns != null) columns.close();
                    finally {
                } else {
                   analysis.write("No 'schemaPattern' provided -- available schemas: ");
                   ResultSet rs = null;
                   try {
                        rs = dbmd.getSchemas();
                        while ( {
                            analysis.write(", " + rs.getObject(1).toString());
                    finally {
                        if (rs != null) rs.close();
            } else {
                analysis.write("No 'catalog' provided - listing available catalogs: ");
                ResultSet rs = null;
                try {
                    rs = dbmd.getCatalogs();
                    while ( {
                        analysis.write(", " + rs.getObject(1).toString());
                finally {
                    if (rs != null) rs.close();
        } catch (SQLException e) {
            throw new BuildException(e);
        } catch (IOException e) {
            throw new BuildException(e);

        // at this point we've captured the output and can e-mail or send it to
        // standard out; we're just going to write it out to a file for now
        FileWriter out = null;
        try {
            out = new FileWriter(output);
        } catch (IOException e) {

The execute() method of an Ant task is the method that will be called by the Ant project (see Listing 2) when the <scanner> tag ends. All the attributes will have been passed into the setters defined in Listing 4 and the execute method can then use the private member variables for its processing requirements.

The method starts by checking to see if a catalog and schemaPattern are provided. If either a catalog or schemaPattern is not provided, the method will simply list the catalogs and schemas available so that you can modify the project file in Listing 2 to choose the appropriate schema and catalog.

If a proper catalog and schemaPattern are provided, the code simply loops through all tables (inside each table it loops through all columns) and runs the PII and PHI check for each table name and column name using the method defined in Listing 5. As the method loops through the tables and columns, it stores messages to save to a file. At the end of the method the messages are collected and written to the file.

This solution is just a start, but you can see it saves you a lot of time as you conduct your search to protect privacy data.


It is vital to track PII and PHI throughout the enterprise. If you're not able to either manually or automatically scan for PII and PHI, you risk the integrity of confidential information, the loss of security of personal information (identity theft), potential theft of intellectual property, bad publicity, and costly breach notifications. Using the techniques outlined in this article, you can see that it is relatively simple to create working code that can inventory databases and spreadsheets so you are aware of which databases need private protection and attention.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Industries, Open source, Information Management
ArticleTitle=Find databases with protected health information