Resolving NumberFormat's parsing issues

Guard against standard parsing's potential data loss

The Java™ Standard Edition (SE) API's NumberFormat class lets a program parse formatted text that represents numeric values. It provides out-of-the-box localization with little effort and is a useful tool for every Java programmer. Unfortunately, the underlying DecimalFormat class can cause unexpected loss of signs and data without notification. In this article, Joe Sam Shirah explains the issues and provides code to handle them properly.

Joe Sam Shirah (joesam@conceptgo.com), Principal and developer, conceptGO

Joe Sam ShirahJoe Sam Shirah is a principal and developer at conceptGO. While trying to keep clients happy, he has authored several tutorials for developerWorks and the Sun developer site. A winner of the Java Community Award, he is also the moderator of the developerWorks Java Filter Forum and has managed jGuru's JDBC, I18N, and Java400 FAQs.



17 October 2006

Also available in Chinese Japanese

Beginning programmers quickly discover that the textual representation of a number differs distinctly from a numeric variable on which programs can perform mathematical operations. For example, "123" is not the same as a true numeric value of 123 or hex 0x7B. Programs must use an algorithm or conversion routine to obtain a number from text -- especially if the text is formatted with grouping or decimal separators (such as commas and decimal points in U.S. number formats). Text-to-numeric conversion is primarily a concern for interactive programming but is also frequently encountered with HTML, XML, and many other file and communications formats that deal with data as text.

The Java SE API provides methods like Integer.parseInt() and Double.parseDouble() for conversion, but these methods expect their arguments in the form defined for literals in the Java Language Specification (see Resources). For purposes of this article, which looks primarily at integers and doubles, that format is essentially composed only of the following characters:

  • A leading minus sign (ASCII value 45 or hex 0x2D)
  • The digits 0 through 9 (ASCII value 48 through 57 or hex 0x30 through 0x39)
  • For floating point values, a decimal point represented by a dot or period (ASCII value 46 or hex 0x2E )

That requirement is reasonable for programmers and code, but users expect to enter and view numbers in the common format for their local culture. The Java SE API's java.text.NumberFormat class includes a convenient parse(String source) method that most programmers use to parse locale-specific formatted text into numeric values. Unfortunately, that method can yield unexpected -- and inaccurate -- results. This article explains the rationale for NumberFormat, reviews its functionality, exposes the class's parsing gotchas, and offers guidelines for using it reliably.

Parsing without validation

This article's example program (see Download) -- NumberInput, shown in Figure 1 -- is a Swing application that lets you explore several methods for converting textual input to a numerical value. In addition to input fields, the program displays the default locale name, the values as originally keyed, their original length, and the number of positions parsed (when applicable). At startup, it loads the input fields with the double value 123456.7 and the integer value 1234567, respectively. Both values are formatted for the default locale as a native user would expect. Because I reside in the United States, the program displays "123,456.7" for the double value and "1,234,567" for the integer value.

Figure 1. NumberInput initial display
NumberInput initial display

When you click on the NoCheck button, the program uses Double.parseDouble() and Integer.parseInt() for straight-ahead parsing with no attempt at validation. Note that leading and trailing spaces are removed from the input strings in actionPerformed() prior to invoking any other methods. Figure 2 shows the result:

Figure 2. Double.parseDouble() throws NumberFormatExceptionfor formatted text
Double.parseDouble() throws NumberFormatException for formatted text

The reason for the error is the comma used as a grouping separator for the U.S. locale. Once the comma is removed so that the input is "123456.7", the program accepts the double value.

What about negative values? Keying in a leading sign (and removing the comma) makes the program happy, but the result for a trailing sign is: NumberFormatException for input string: "123456.7-".

Integer parsing displays similar behavior, for the same reasons. The code invoked for the NoCheck button is in NumberInput's noCheckInput() method. It uses Double.parseDouble() for input from the JTextField jtD and Integer.parseInt() for input from the JTextField jtI. These outcomes are normal according to the rules and should be expected behavior to most Java programmers beyond the absolute beginner stage.


NumberFormat to the rescue

Completely aside from typos and other user-input errors, there's always been an uneasy relationship between displaying formatted numeric text and receiving numeric text input from the same field. We've probably all known programmers who decide that the solution is just to display a message that says approximately, "Key numbers with no commas and a leading minus sign." Heads-down, data-entry staff often don't have much of a problem with this approach, but users typically want to see formatted numbers and then often key directly over the formatted display, maintaining separators and groupings. Generally, after some grumbling, the U.S. programmer's first step toward resolving the issue is to write a routine that strips out commas and moves any trailing minus sign to the front of the input value. Many programs written roughly that way have had long lives in production. In some sense, this was even the programmer's first foray into internationalization (I18N) and localization (L10N) (see Resources). The problem is that this kind of code effectively localizes a program for only one or a limited set of locales.

Java programs are touted as being capable of running on any enabled platform, and many people take this to mean in any country and language too, in a familiar way. The Java SE SDK provides APIs to make much of this expectation a reality. However, a program written like the first effort I just described soon begins to break down when used outside of its assumed sphere. In various countries, the value 123456.7 can be formatted or keyed as "123.456,7", "123456,7", or "123'456,7", among other possibilities. Any program that assumes the same grouping and decimal separators (again, the U.S. example here uses "," and "." respectively) for all locales just won't work. In anticipation of this issue, the API includes java.text.NumberFormat. The class provides externally simple parse() and format() methods that are automatically locale aware, including knowledge of formatting symbols. In fact, NumberInput uses NumberFormat for formatting the values displayed in the input fields.

A Java Locale object represents and identifies a specific combination of language and region or country. It does not, in and of itself, provide localized behavior; classes must provide localization themselves. However, the Java platform does support a consistent set of locales, and many of the standard classes implement consistent localized behavior. These classes usually have two versions of methods: one that takes a Locale argument and another that assumes the default. The default locale is automatically determined at program startup or overridden by arguments passed to the Java runtime.

NumberFormat is an abstract class, but it provides static factory getXXXInstance() methods for obtaining concrete implementations with predefined, localized formats. The underlying implementation is normally an instance of java.text.DecimalFormat. The code and discussion in this article use the defaults returned by NumberFormat.getNumberInstance() for formatting and parsing double values and NumberFormat.getIntegerInstance() for integer values.

It's worthwhile to note how little code it takes to fully localize parsing. The steps are:

  1. Get a NumberFormat instance.
  2. Parse a String to a Number.
  3. Grab the appropriate numeric value.

The benefit for so little effort is huge, and every Java programmer should use NumberFormat to handle formatted numeric conversions. To try it out for different locales, invoke the NumberInput application using the following command line, where lc is the ISO-639 language code and cc is the ISO-3166 country code:

java -Duser.language=lc -Duser.region=cc NumberInput

As of JDK 1.4, you can use the user.country system property instead of user.region. To determine locales supported by the Java platform, see Supported Locales in the Internationalization section of the JDK documentation (see Resources). A program can determine locale support at run time with java.util.Locale's static getAvailableLocales() method.

Listing 1 shows the relevant code for NumberInput's NFInput() method, which is invoked when the NF button is clicked. The method uses NumberFormat.parse(String) for validation and conversion.

Listing 1. The NFInput() method uses NumberFormat.parse(String)
...
NumberFormat nfDLocal =  
   NumberFormat.getNumberInstance(),
             nfILocal = 
   NumberFormat.getIntegerInstance();
...

  public void NFInput( String sDouble, String sInt )
  { // "standard" NumberFormat parsing
    double d;
    int    i;
    Number n;

    try
    {
      n = nfDLocal.parse( sDouble );
      d = n.doubleValue();
      ...
      n = nfILocal.parse( sInt );
      i = n.intValue();
      ...
    }
    catch( ParseException pe ) 
    { 
      ...
    }
  } // end NFInput

The implementation of a NumberFormat instance

At this point, it's useful to review briefly both what happens when a NumberFormat is asked to getXXXInstance() and the role of the DecimalFormatSymbols class. The discussion is based on a review of the reference implementation source code shipped with J2SE 1.4 and is subject to change.

The basic flow of events is that NumberFormat consults an internal ListResourceBundle for an appropriate pattern based on the associated locale and returns a DecimalFormat object created with the pattern. When there's no explicit negative pattern, a leading minus sign is assumed in combination with the positive pattern. During the process, a locale-appropriate DecimalFormatSymbols object is created and the DecimalFormat instance obtains a reference to it. Because the NumberFormat.getXXXInstance() methods are based on a factory pattern, other implementations or future reference implementations may return a different class. For that reason, any custom code must ensure that a DecimalFormat is actually returned before attempting to access the associated DecimalFormatSymbols instance.

The DecimalFormatSymbols object contains information such as the appropriate decimal and grouping separators and minus sign symbol. NumberInput collects much of this information and displays it in a dialog when the Info buttons are clicked. Figure 3 displays an example using the en_US locale. This information is critical to parsing and validating a number in a localized format.

Figure 3. DecimalFormatSymbols data
DecimalFormatSymbols data

Try clicking on the NF button and you'll see that the values are accepted properly even with the commas or other local grouping separators. Minus signs are also accepted when placed in accordance with the current pattern. What about when the minus sign is in another position? That, along with several other issues, is the topic of the next section and the impetus for this article.


UnexpectedResults.equals(bigTrouble)

Most articles on Java internationalization focus on NumberFormat's formatting capabilities and end any discussion of parsing after some variation of the information I've given so far. Unfortunately, testing and experimentation with the class (actually the concrete DecimalFormat subclass returned from NumberFormat.getXXXInstance()) reveal parsing warts that can be surprising: Under a number of common field conditions, NumberFormat.parse(String) cheerfully truncates data and loses signs with no indication to the programmer. The following conditions exhibit this behavior (the en_US locale was used unless otherwise specified):

  • Multiple contiguous or irregularly inserted grouping separators prior to a decimal separator are ignored.

    For example, "123,,,456.7" and "123,45,6.7" are accepted and both return 123456.7.

    After a great deal of thought, I came to the conclusion that, while this behavior is technically in error, no data is lost and any solution causes more work than it's worth. You should be aware of the behavior, but the NumberInput application doesn't correct it, and I won't refer to it further in this article.
  • A grouping separator that occurs after a decimal separator results in truncation.

    "123,456.7,85" is accepted as 123456.7.
  • Multiple decimal separators result in truncation.

    "123,456..7" is accepted as 123456.0; "12.3.456.7" is accepted as 12.3.
  • For patterns with a leading minus sign (negative prefix), truncation occurs at the point of a nondigit character, including embedded minus signs.

    "123,4r56.7" is accepted as 1234.0; "12-3,456.7" is accepted as 12.0 (positive value).
  • For patterns with a trailing minus sign (negative suffix), truncation occurs at the point of a nondigit character excluding embedded minus signs. Embedded minus signs are accepted but any additional data is truncated.

    For the Saudi Arabian locale (ar_SA) "123,4r56.7" is accepted as 1234.0; "12-3,456.7" is accepted as -12.0 (negative value).
  • If the pattern specifies a leading minus sign for negative input, a trailing minus sign is ignored.
    "123,456.7-" is accepted as 123456.7 (positive value), and "-123.456,7-" (Dutch locale nl_NL) is accepted as -123456.7 (negative value).

Figure 4 and Figure 5 show an example of some of these behaviors when the NF button is clicked. Although it can be difficult to interpret what was intended by the original entries, it's a safe bet that a double result of 1234.0 was not anticipated, nor did the user mean to have the last two digits dropped from the integer input. Again, no exception is thrown, and there's no indication that portions of the input were ignored.

Figure 4. Unexpected results with NumberFormat.parse(String)
Unexpected Results with NumberFormat.parse(String)
Figure 5. NumberFormat.parse(String) accepts truncated value
NumberFormat.parse(String) accepts truncated value

These results, which were consistent over many tests with JDK 1.4 and 5.0, are difficult to understand given the amount of work that went into implementing the NumberFormat and DecimalFormat classes. On the other hand, code can't be much more straightforward than passing an argument to a method and examining the result. The only real clue is in the JDK documentation for NumberFormat.parse(String source), which says, without further explanation, "The method may not use the entire text of the given string."

Seeming anomalies like these are troublesome, and at first glance, it may seem better to return to the "Key it my way or else" method of programming. "Garbage in, garbage out" is a cliche in computing, but that only means that a program can never guarantee that data is correct; the programmer's obligation is to ensure, as far as possible, that all of the input is valid. Rather than being a bug, it appears to be the design of NumberFormat.parse(String) to return a number from some portion of an input string if at all possible. Unfortunately, that behavior includes an unstated assumption that the data has already been validated. The end result is that the programmer cannot determine when the input is invalid, which breaks an implicit contract with users and the data itself.

On discovering these issues several years ago, my first response was to write what amounted to a front-end preprocessor for the parse(String) method. That worked, but the cost was additional, partially redundant code and more time to process the data. Fortunately, it turns out that an existing NumberFormat method, when used with care, can resolve the problem.


Using ParsePosition for Validation

The parse(String source, ParsePosition parsePosition) method is unusual in that it doesn't throw any exceptions. It's normally intended to be used when you're parsing multiple numbers from a single string. However, upon method return, the value from ParsePosition.getIndex() is the last position parsed in the input string plus one. If the code always begins with the index set to zero, after processing, the index value will equal the number of parsed characters. The key to using the method for validation is to compare the updated index to the length of the original input string.

To avoid confusion, I should mention that ParsePosition also has a getErrorIndex() method. This method is essentially useless for the conditions discussed here because no errors are detected. In addition, when it is used, the error index must be reset to -1 before each parse operation; otherwise the result can be misleading.

The NumberInput application displays the ParsePosition index under the Length/PP column when either the NF or NFPP button is clicked. If the length of the original value is greater than zero and matches the index value, both are shown in green; otherwise the values are shown in red. This operation is done separately from the specific validation methods. If you look at Figure 4 again, you'll see that the values are in red, indicating an error, even though the NFInput() method associated with the NF button accepted the data.

For the final validation version, the NFPPInput() method is invoked when the NFPP button is clicked. This method uses parse(String, ParsePosition) to validate input and obtain numeric values. Figure 6 and Figure 7 show that the invalid input from Figure 4 is detected in NFPPInput(). In my testing, the method properly handled all of the conditions missed by NumberFormat.parse(String).

Figure 6. Detecting invalid double entries
Detecting invalid double entries
Figure 7. Detecting invalid integer entries
Detecting invalid integer entries

You must follow several guidelines to ensure proper results with parse(String, ParsePosition):

  • Remember that the method never throws an exception.

    For clarity and demonstration purposes, the code here just displays Acceptable/Unacceptable dialogs. In a general-purpose case, you should throw a ParseException to be more in line with normal expectations.
  • Always reset the ParsePosition index to zero before invoking parse(String, ParsePosition).

    A reset is necessary because, with this method, parsing begins at the ParsePosition index within the input string.
  • Use NumberFormat.getNumberInstance() for parsing double values and NumberFormat.getIntegerInstance() for parsing integer values.

    If you don't use an integer instance (or, alternatively, apply setParseIntegerOnly(true) to a number instance) for integers, the method parses past any decimal separators to the end of the input string. The result is that the length and index match, and you have accepted invalid input.
  • In addition to comparing the length and index values for equality, you must also check for either a null Number after parsing or an empty input string ("" or length of zero).

    Clearing an input field causes an empty string. In this case, both the length and index values are zero, so they match. The parse method returns null for an empty string input. This behavior is different from the result for empty strings using NumberFormat.parse(String source), which throw an "unparsable number" ParseException. Remember that parse(String source, ParsePosition parsePosition) never throws an exception! In NumberInput, the code snippet in Listing 2 is used to handle the possibilities:
    Listing 2. Checking for error conditions
    if( sDouble.length() != pp.getIndex() || 
        n == null )
    { /* error */ }

To summarize, the steps for proper input processing are:

  1. Get an appropriate NumberFormat and define a ParsePosition variable.
  2. Set the ParsePosition index to zero.
  3. Parse the input value with parse(String source, ParsePosition parsePosition).
  4. Perform error operations if the input length and ParsePosition index value don't match or if the parsed Number is null.
  5. Otherwise, the value passed validation.

Listing 3 shows the relevant code:

Listing 3. The NFPPInput() method
... 
NumberFormat  nfDLocal = 
   NumberFormat.getNumberInstance(), 
              nfILocal = 
   NumberFormat.getIntegerInstance();

ParsePosition pp;
...

  public void NFPPInput( String sDouble, 
                         String sInt )
  { // validate NumberFormat with ParsePosition 
    Number n;
    double d;
    int    i;

    pp.setIndex( 0 );
    n = nfDLocal.parse( sDouble, pp );

    if( sDouble.length() != pp.getIndex() || 
        n == null )
    {
      showErrorMsg( 
        "Double Input Not Acceptable\n" + 
         "\"" + sDouble + "\"");
    }
    else
    {
      d = n.doubleValue();
      jtD.setText( nfDLocal.format( d ) );
      showInfoMsg( "Double Accepted \n" + d );
    }

    pp.setIndex( 0 );
    n = nfILocal.parse( sInt, pp );
    if( sInt.length() != pp.getIndex()  || 
        n == null )
    {
      showErrorMsg( 
        "Int Input Not Acceptable \n" + 
         "\"" + sInt + "\"");
    }
    else
    {
      i = n.intValue();
      jtI.setText( nfILocal.format( i ) );
      showInfoMsg( "Int Accepted \n" + i );
    }
  } // end NFPPInput

Conclusion

A tremendous amount of work has been incorporated into the Java SE API to allow "write once, run anywhere" not only at the bytecode level, but also to accommodate internationalized and localized applications. NumberFormat and DecimalFormat are classes that Java programmers who intend to write world-class applications can't live without. However, as this article has shown, developers also can't live with the parse(String source) method as it stands, unless perfect input can be assumed -- something that is seldom the case in the real world. The information and code I've presented in this article give you the alternate technique of using parse(String source, ParsePosition parsePosition) to determine when entries are invalid and obtain correct results.


Download

DescriptionNameSize
NumberInput source code and classesj-numberformat.zip8KB

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=167513
ArticleTitle=Resolving NumberFormat's parsing issues
publish-date=10172006