ugrep Command

Purpose

Searches for a Unicode pattern in a file.

Syntax

ugrep unicode_hex_notation_pattern [ -i loose_match_unicode_hex_notation_pattern ] File..

Description

The ugrep command searches an input file for characters that match the specified hexadecimal representation of the Unicode-defined code point of a character.

The regular expression writer might not use a Unicode character set to specify the pattern that needs to be searched. Also, it might not be possible to input Unicode-defined code points for every character of the major written languages by using a keyboard. Therefore, the Unicode pattern that is specified to search must be the hexadecimal representation of the Unicode-defined code point of a character.

Note: The function of the ugrep command is the same as the function of the grep command with the -U flag.

Flags

Table 1. Flags
Item Description
unicode_hex_notation_pattern Specifies the hexadecimal representation of the Unicode-defined code point of a character. For example, to represent the 𝄞 character, whose Unicode-defined code point is U+1D11E, the value of the unicode_hex_notation_pattern pattern can be the hexadecimal representation as \U0001D11E, \x{1D11E}, or \u{1D11E}.
-i unicode_hex_notation_pattern Specifies that the search is based on a loose match of the specified Unicode hex notation pattern. Most of the regular expression engines offer case-insensitive matching as the only loose matching. If the expression engine offers case-insensitive matching as the only loose matching, then the expression engine must account for the large range of cased Unicode characters outside of the ASCII characters.

Exit Status

This command returns the following exit values:

Table 2. Exit status
Item Description
0 A match was found.
1 No match was found.
>1 A syntax error was found or a file was inaccessible (even if matches were found).

Examples

  1. To search the regex_test.txt file for the character , whose Unicode-defined code point is U+6211 and the hexadecimal representation is \u6211, enter the following command:
    ugrep "\u6211" regex_test.txt
    To search multiple characters, you can add a list of hexadecimal representations of the Unicode-defined code points without any space. For example, to search the characters and in the regex_test.txt file, enter the following command:
    ugrep “\u0918\u0930" regex_test.txt
  2. To specify a range of characters between the code points U+6200 and U+6300 to search in the regex_test.txt file, enter the following command:
    ugrep "[\u6200-\u6300]" regex_test.txt
    To specify a range of characters between the code points U+6200 and U+6300 that are also uppercase to search in the regex_test.txt file, enter the following command:
    ugrep "[\u0000-\U0010FFFF--\p{Lu}]" regex_test.txt
  3. To do a loose match search of the character 𐐥, whose Unicode-defined code point is U+10425 and the hexadecimal representation is \U00010425, enter the following command:
    ugrep -i "\U00010425" regex_test.txt
  4. To search the regex_test.txt file for a number with decimal digits, enter the following command:
    ugrep "\p{Nd}" regex_test.txt
    where Nd is a Unicode character property for numbers with decimal digits.
  5. To search the regex_test.txt file for Hiragana characters in the Japanese language, enter the following code:
    ugrep "\p{Hiragana}" regex_test.txt
  6. To search the regex_test.txt file for uppercase letters, lowercase letters, or numbers by using Unicode properties, enter the following commands:
    • ugrep "\p{Ll}" regex_test.txt
      where the property Ll matches lowercase letters in Unicode and includes lowercase letters from all languages.
    • ugrep "\p{Lu}" regex_test.txt
      where the property Lu matches uppercase letters in Unicode and includes uppercase letters from all languages.
    • ugrep "\p{L}" regex_test.txt

      Or

      ugrep "\p{letter}" regex_test.txt
      where the properties L and letter matches all letters in Unicode. The search by using the Lu property includes uppercase letters, lowercase letters, and connector characters. However, the search by using the letter property includes only the uppercase and lowercase letters.
    • ugrep "[\p{L}||\p{Nd}]" regex_test.txt
      where the property Nd matches numeric digits in Unicode and includes numeric digits from all languages.
  7. To search for characters of the Latin language by using the script property, enter the following command:
    ugrep "\p{script=Latin}" regex_test.txt

    You can search for characters in any language by setting the value of the script property to the specific language.

Files

Item Description
/usr/bin/ugrep Contains the ugrep command.