Topic
  • 11 replies
  • Latest Post - ‏2012-12-07T05:05:53Z by SystemAdmin
SystemAdmin
SystemAdmin
17917 Posts

Pinned topic How to replace non-ascii characters with ascii characters

‏2008-11-25T22:04:00Z |
Hi,
I need to write a DB2 stored procedure for locating the non-ascii characters and replacing them with ascii characters. Could anyone help on this? We have DB2 V9.1 and AIX in our system.
Thank you very much in advance.
Updated on 2012-12-07T05:05:53Z at 2012-12-07T05:05:53Z by SystemAdmin
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-11-26T17:58:35Z  
    Probably the easiest way is to use the TRANSLATE function. Just specify the characters you want to translate in the third argument (it can be a hexadecimal string, if necessary) and the characters you want them translated to in the second argument; e.g., TRANSLATE(string, 'ABCD', X'01020304')
  • SystemAdmin
    SystemAdmin
    17917 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-11-26T19:53:04Z  
    • dlyko
    • ‏2008-11-26T17:58:35Z
    Probably the easiest way is to use the TRANSLATE function. Just specify the characters you want to translate in the third argument (it can be a hexadecimal string, if necessary) and the characters you want them translated to in the second argument; e.g., TRANSLATE(string, 'ABCD', X'01020304')
    dlyko, your suggestion could work for the cases that we have known where the non-ascii characters are located in a field. My real problem is that the non-ascii characters could be embedded in any fields of the tables, so I need to identify them and then replace them with ascii characters. There could be one or multiple non-ascii characters embedded into a field string. I am looking for two possible solutions: (1)Identify the fields with non-ascii characters and convert those fields to ascii characters. I know how to identify them but don't know how to convert them. There seems no function provided by IBM we could use to do this job in DB2 V9.1 and AIX environment. Hope I could be wrong. (2)Find the exact locations of the non-ascii characters in a field and then replace them with ascii characters. I have no idea how this option 2 could be done. Any help will be appreciated.
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-11-26T20:09:47Z  
    dlyko, your suggestion could work for the cases that we have known where the non-ascii characters are located in a field. My real problem is that the non-ascii characters could be embedded in any fields of the tables, so I need to identify them and then replace them with ascii characters. There could be one or multiple non-ascii characters embedded into a field string. I am looking for two possible solutions: (1)Identify the fields with non-ascii characters and convert those fields to ascii characters. I know how to identify them but don't know how to convert them. There seems no function provided by IBM we could use to do this job in DB2 V9.1 and AIX environment. Hope I could be wrong. (2)Find the exact locations of the non-ascii characters in a field and then replace them with ascii characters. I have no idea how this option 2 could be done. Any help will be appreciated.
    You don't need to know where in the column any non-ascii characters are, or even if there are any. You do need to be able to identify all the possible non-ascii characters that are of interest.

    Remember, there are 256 possible characters, each represented by a 2-hex-digit hexadecimal value. You need to identify all the non-ascii combinations and put them into a string, such as X'213a435b...' (these are just random digits). This string becomes the third argument of your TRANSLATE function. The second arguemnt of the TRANSLATE function is a string containing the values to which the corresponding values in the third argument are translated. Every occurrence of the first character in argument 3 is translated to the first character in argument 2, every occurrence of the second character in argument 3 is translated to the second character of argument 2, etc. Any characters in the input string not found in argument 3 remain unchanged. You can imbed this function in a stored procedure. if you like. Just apply it to any column of any table that contains the non-ascii characters you want to translate.
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-11-26T21:15:35Z  
    dlyko, your suggestion could work for the cases that we have known where the non-ascii characters are located in a field. My real problem is that the non-ascii characters could be embedded in any fields of the tables, so I need to identify them and then replace them with ascii characters. There could be one or multiple non-ascii characters embedded into a field string. I am looking for two possible solutions: (1)Identify the fields with non-ascii characters and convert those fields to ascii characters. I know how to identify them but don't know how to convert them. There seems no function provided by IBM we could use to do this job in DB2 V9.1 and AIX environment. Hope I could be wrong. (2)Find the exact locations of the non-ascii characters in a field and then replace them with ascii characters. I have no idea how this option 2 could be done. Any help will be appreciated.
    A good way to accomplish what you want to do is to create a user-defined function (see below). This sample function translates all the hex characters from X'00' to X'1F' to a dash ("-"). You can customize it to include as many non-ascii characters as you like. Note the use of the optional pad character as the 4th argument of the TRANSLATE function. If you want to translate each to a different character, itemize them in the second argument instead of using an empty string.

    ------> CREATE FUNCTION TRAN_ASCII
    ------> (STRING VARCHAR(4000))
    ------> RETURNS VARCHAR(4000)
    ------> LANGUAGE SQL
    ------> CONTAINS SQL
    ------> NO EXTERNAL ACTION
    ------> DETERMINISTIC
    ------> RETURN
    ------> TRANSLATE(STRING, '',
    ------> X'000102030405060708090A0B0C0D0E0F' ||
    ------> X'101112131415161718191A1B1C1D1E1F', '-');

    ------> CREATE TABLE TEST (COL_1 CHAR(10) NOT NULL);

    ------> INSERT INTO TEST VALUES ('ABC' || X'000102' || 'DEFG');

    ------> SELECT COL_1, TRAN_ASCII(COL_1) FROM TEST;

    ------> ABC"""DEFG ABC---DEFG

    The double quote characters in the first expression of the result represent the unprintable characters contained in the result value. Note how they have been translated to dashes in the second expression.
  • SystemAdmin
    SystemAdmin
    17917 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T15:26:19Z  
    • dlyko
    • ‏2008-11-26T21:15:35Z
    A good way to accomplish what you want to do is to create a user-defined function (see below). This sample function translates all the hex characters from X'00' to X'1F' to a dash ("-"). You can customize it to include as many non-ascii characters as you like. Note the use of the optional pad character as the 4th argument of the TRANSLATE function. If you want to translate each to a different character, itemize them in the second argument instead of using an empty string.

    ------> CREATE FUNCTION TRAN_ASCII
    ------> (STRING VARCHAR(4000))
    ------> RETURNS VARCHAR(4000)
    ------> LANGUAGE SQL
    ------> CONTAINS SQL
    ------> NO EXTERNAL ACTION
    ------> DETERMINISTIC
    ------> RETURN
    ------> TRANSLATE(STRING, '',
    ------> X'000102030405060708090A0B0C0D0E0F' ||
    ------> X'101112131415161718191A1B1C1D1E1F', '-');

    ------> CREATE TABLE TEST (COL_1 CHAR(10) NOT NULL);

    ------> INSERT INTO TEST VALUES ('ABC' || X'000102' || 'DEFG');

    ------> SELECT COL_1, TRAN_ASCII(COL_1) FROM TEST;

    ------> ABC"""DEFG ABC---DEFG

    The double quote characters in the first expression of the result represent the unprintable characters contained in the result value. Note how they have been translated to dashes in the second expression.
    dlyko, thanks for your suggestions and sorry not to reply you earlier.
    According to your suggestions I've created a new conversion function as follows:
    CREATE FUNCTION CONVERT_NONASCII
    (STRING VARCHAR(4000))
    RETURNS VARCHAR(4000)
    LANGUAGE SQL
    CONTAINS SQL
    NO EXTERNAL ACTION
    DETERMINISTIC
    RETURN
    TRANSLATE(STRING, '',
    x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F', '-')@

    I have also created another function for identifying the non-ascii characters as follows:
    CREATE FUNCTION CONTAINS_NONASCII
    (
    STR VARCHAR(4000)
    )
    RETURNS CHAR(1)
    DETERMINISTIC
    NO EXTERNAL ACTION
    RETURN
    (
    CASE WHEN TRANSLATE(STR, '', x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F') = ''
    THEN 'N' ELSE 'Y' END
    )@

    And here is sql command for testing these two functions:

    select ID, CONVERT_NONASCII(STATUSREASON) as STATUSREASON_new from TableTest group by ID, STATUSREASON
    having ID=1001 and STATUSREASON is not null and STATUSREASON '' and MAX(CONTAINS_NONASCII(STATUSREASON)) = 'Y'

    I inserted some non-ascii characters into STATUSREASON of TableTest. When I executed the select query, all ascii characters are converted to '-' and non-ascii charcters remain unchanged. Any ideas how to fix it so that the ascii characters remain unchanged and non-ascii characters should be converted?
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T15:59:54Z  
    dlyko, thanks for your suggestions and sorry not to reply you earlier.
    According to your suggestions I've created a new conversion function as follows:
    CREATE FUNCTION CONVERT_NONASCII
    (STRING VARCHAR(4000))
    RETURNS VARCHAR(4000)
    LANGUAGE SQL
    CONTAINS SQL
    NO EXTERNAL ACTION
    DETERMINISTIC
    RETURN
    TRANSLATE(STRING, '',
    x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F', '-')@

    I have also created another function for identifying the non-ascii characters as follows:
    CREATE FUNCTION CONTAINS_NONASCII
    (
    STR VARCHAR(4000)
    )
    RETURNS CHAR(1)
    DETERMINISTIC
    NO EXTERNAL ACTION
    RETURN
    (
    CASE WHEN TRANSLATE(STR, '', x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F') = ''
    THEN 'N' ELSE 'Y' END
    )@

    And here is sql command for testing these two functions:

    select ID, CONVERT_NONASCII(STATUSREASON) as STATUSREASON_new from TableTest group by ID, STATUSREASON
    having ID=1001 and STATUSREASON is not null and STATUSREASON '' and MAX(CONTAINS_NONASCII(STATUSREASON)) = 'Y'

    I inserted some non-ascii characters into STATUSREASON of TableTest. When I executed the select query, all ascii characters are converted to '-' and non-ascii charcters remain unchanged. Any ideas how to fix it so that the ascii characters remain unchanged and non-ascii characters should be converted?
    I'm not sure why you think you need the second function, the CONTAINS_NONASCII function. Also, if you really do need it, I don't think it's doing what you think it is.

    If you want to know whether an input string contains any non-ascii characters, you can use the first function in a predicate as follows:

    WHERE string ^= CONVERT_NONASCII(string) means that the string contains at least one non-ascii character, and

    WHERE string = CONVERT_NONASCII(string) means that the string contains NO non-ascii characters.

    Also, as I look at your first function, I'm not sure what you are defining as ascii and non-ascii characters, but if you mean to count all the numbers, letters and special characters as "ascii" characters and everything else as non-ascii, you need to change the hexadecimal string in your function definition. All the numbers, letters and special characters fall in the range x'20' thru x'7E' generally. These should all be removed from your hex string and all other hex characters added.

    Keep in mind, also, that the function, as written, is designed to translate all non-ascii characters to a dash. If you prefer something else, just change the pad character in the TRANSLATE function, and, if you want to translate some or all characters to more than one result, just eliminate the pad character and itemize the result characters in the 2d argument.
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T17:06:16Z  
    • dlyko
    • ‏2008-12-01T15:59:54Z
    I'm not sure why you think you need the second function, the CONTAINS_NONASCII function. Also, if you really do need it, I don't think it's doing what you think it is.

    If you want to know whether an input string contains any non-ascii characters, you can use the first function in a predicate as follows:

    WHERE string ^= CONVERT_NONASCII(string) means that the string contains at least one non-ascii character, and

    WHERE string = CONVERT_NONASCII(string) means that the string contains NO non-ascii characters.

    Also, as I look at your first function, I'm not sure what you are defining as ascii and non-ascii characters, but if you mean to count all the numbers, letters and special characters as "ascii" characters and everything else as non-ascii, you need to change the hexadecimal string in your function definition. All the numbers, letters and special characters fall in the range x'20' thru x'7E' generally. These should all be removed from your hex string and all other hex characters added.

    Keep in mind, also, that the function, as written, is designed to translate all non-ascii characters to a dash. If you prefer something else, just change the pad character in the TRANSLATE function, and, if you want to translate some or all characters to more than one result, just eliminate the pad character and itemize the result characters in the 2d argument.
    Just to complete my previous note, here is a hex string you can use as the 3d argument of your TRANSLATE function:

    ------> RETURN TRANSLATE(STRING, '',
    ------> X'000102030405060708090A0B0C0D0E0F' ||
    ------> X'101112131415161718191A1B1C1D1E1F' ||
    ------> X'7F' ||
    ------> X'808182838485868788898A8B8C8D8E8F' ||
    ------> X'909192939495969798999A9B9C9D9E9F' ||
    ------> X'A0A1A2A3A4A5A6A7A8A9AAABACADAEAF' ||
    ------> X'B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF' ||
    ------> X'C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF' ||
    ------> X'D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF' ||
    ------> X'E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF' ||
    ------> X'F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF', '-')

    Note that this is simply the entire 256-character set minus the characters from x'20' thru x'7E'. These are the characters that will be translated; the missing characters (the valid ascii characters) will remain unchanged. This string allows for the most general case, but, if you are sure that your input data will never contain some of these values, you can certainly omit them from the string, possibly resulting in a slight (though I'm sure, unmeasurable) performance improvement.
  • SystemAdmin
    SystemAdmin
    17917 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T17:15:28Z  
    • dlyko
    • ‏2008-12-01T15:59:54Z
    I'm not sure why you think you need the second function, the CONTAINS_NONASCII function. Also, if you really do need it, I don't think it's doing what you think it is.

    If you want to know whether an input string contains any non-ascii characters, you can use the first function in a predicate as follows:

    WHERE string ^= CONVERT_NONASCII(string) means that the string contains at least one non-ascii character, and

    WHERE string = CONVERT_NONASCII(string) means that the string contains NO non-ascii characters.

    Also, as I look at your first function, I'm not sure what you are defining as ascii and non-ascii characters, but if you mean to count all the numbers, letters and special characters as "ascii" characters and everything else as non-ascii, you need to change the hexadecimal string in your function definition. All the numbers, letters and special characters fall in the range x'20' thru x'7E' generally. These should all be removed from your hex string and all other hex characters added.

    Keep in mind, also, that the function, as written, is designed to translate all non-ascii characters to a dash. If you prefer something else, just change the pad character in the TRANSLATE function, and, if you want to translate some or all characters to more than one result, just eliminate the pad character and itemize the result characters in the 2d argument.
    dlyko, You're right. CONTAINS_NONASCII function may not be needed. I tested my sql query using WHERE string ^= CONVERT_NONASCII(string) as you suggested, got the same results.

    I also tested the range x'20' thru x'7E' for the numbers, letters and special characters, as you mentioned. But it seems that many ascii characters are not covered by the range. I could be wrong, but the one x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F'
    used in my function CONVERT_NONASCII seems capturing the non-acsii characters correctly. I'm a newbie to the function TRANSLATE. Could you please let me know how to fix CONVERT_NONASCII so that it may convert the non-ascii characters into ascii characters? Thank you very much.
  • dlyko
    dlyko
    50 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T17:35:14Z  
    dlyko, You're right. CONTAINS_NONASCII function may not be needed. I tested my sql query using WHERE string ^= CONVERT_NONASCII(string) as you suggested, got the same results.

    I also tested the range x'20' thru x'7E' for the numbers, letters and special characters, as you mentioned. But it seems that many ascii characters are not covered by the range. I could be wrong, but the one x'000102030405060708090A0B0C0D0E0F' ||
    x'101112131415161718191A1B1C1D1E1F' ||
    x'202122232425262728292A2B2C2D2E2F' ||
    x'303132333435363738393A3B3C3D3E3F' ||
    x'404142434445464748494A4B4C4D4E4F' ||
    x'505152535455565758595A5B5C5D5E5F' ||
    x'606162636465666768696A6B6C6D6E6F' ||
    x'707172737475767778797A7B7C7D7E7F'
    used in my function CONVERT_NONASCII seems capturing the non-acsii characters correctly. I'm a newbie to the function TRANSLATE. Could you please let me know how to fix CONVERT_NONASCII so that it may convert the non-ascii characters into ascii characters? Thank you very much.
    Our last two posts may have crossed in the mail. My last one contains all you should need. Note that the TRANSLATE function operates only on the specific characters included in the 3d argument string. Any characters not included are left unchanged. Thus, a string that excludes all the valid ascii characters (x'20' thru 'x'7e') causes all the remaining (non-ascii) characters to be translated, which is your desired result. Good luck.
  • SystemAdmin
    SystemAdmin
    17917 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2008-12-01T22:15:10Z  
    • dlyko
    • ‏2008-12-01T17:06:16Z
    Just to complete my previous note, here is a hex string you can use as the 3d argument of your TRANSLATE function:

    ------> RETURN TRANSLATE(STRING, '',
    ------> X'000102030405060708090A0B0C0D0E0F' ||
    ------> X'101112131415161718191A1B1C1D1E1F' ||
    ------> X'7F' ||
    ------> X'808182838485868788898A8B8C8D8E8F' ||
    ------> X'909192939495969798999A9B9C9D9E9F' ||
    ------> X'A0A1A2A3A4A5A6A7A8A9AAABACADAEAF' ||
    ------> X'B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF' ||
    ------> X'C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF' ||
    ------> X'D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF' ||
    ------> X'E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF' ||
    ------> X'F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF', '-')

    Note that this is simply the entire 256-character set minus the characters from x'20' thru x'7E'. These are the characters that will be translated; the missing characters (the valid ascii characters) will remain unchanged. This string allows for the most general case, but, if you are sure that your input data will never contain some of these values, you can certainly omit them from the string, possibly resulting in a slight (though I'm sure, unmeasurable) performance improvement.
    dlyko, this works for me now. Thank you very much for your help!!!!!!
  • SystemAdmin
    SystemAdmin
    17917 Posts

    Re: How to replace non-ascii characters with ascii characters

    ‏2012-12-07T05:05:53Z  
    • dlyko
    • ‏2008-11-26T21:15:35Z
    A good way to accomplish what you want to do is to create a user-defined function (see below). This sample function translates all the hex characters from X'00' to X'1F' to a dash ("-"). You can customize it to include as many non-ascii characters as you like. Note the use of the optional pad character as the 4th argument of the TRANSLATE function. If you want to translate each to a different character, itemize them in the second argument instead of using an empty string.

    ------> CREATE FUNCTION TRAN_ASCII
    ------> (STRING VARCHAR(4000))
    ------> RETURNS VARCHAR(4000)
    ------> LANGUAGE SQL
    ------> CONTAINS SQL
    ------> NO EXTERNAL ACTION
    ------> DETERMINISTIC
    ------> RETURN
    ------> TRANSLATE(STRING, '',
    ------> X'000102030405060708090A0B0C0D0E0F' ||
    ------> X'101112131415161718191A1B1C1D1E1F', '-');

    ------> CREATE TABLE TEST (COL_1 CHAR(10) NOT NULL);

    ------> INSERT INTO TEST VALUES ('ABC' || X'000102' || 'DEFG');

    ------> SELECT COL_1, TRAN_ASCII(COL_1) FROM TEST;

    ------> ABC"""DEFG ABC---DEFG

    The double quote characters in the first expression of the result represent the unprintable characters contained in the result value. Note how they have been translated to dashes in the second expression.
    Hi dlyko,

    You made my day 6 years later after your post :)

    I had a similar task to find all fields that have special non-printable control characters and break XML marshaling.

    Here is SQL that I used:
    
    select * from 
    "MYSCHEMA".
    "MYTABLE" where TRANSLATE(DESCRIPTION,
    '', X
    '000102030405060E0F' ||  X
    '101112131415161718191A1B1C1D1E1F') DESCRIPTION FOR READ ONLY WITH UR
    


    Regards,
    Alex the chef