Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Text processing with Ruby

Harness the power for Ruby for efficient text processing

Santhosh Krishnamoorthy (santhoshk@in.ibm.com), Staff Software Engineer, IBM
Photo of Santhosh Krishnamoorth
Santhosh Krishnamoorthy is a test engineer with the TXSeries team, in IBM Software Labs, Bangalore, working in the area of intersystem communications and Java technology. His interests include Ruby, Ruby on Rails, and Python programming.

Summary:  Ruby is a feature-rich, free, simple, extensible, portable, and object-oriented scripting language. As a powerful text processing language, it has immense capability. With powerful built-in libraries and a set of external libraries, Ruby is a viable option for a solution to any mundane text processing task that you might encounter.

Date:  18 Aug 2009
Level:  Intermediate PDF:  A4 and Letter (45KB | 15 pages)Get Adobe® Reader®
Also available in:   Chinese  Korean  Japanese

Activity:  16674 views
Comments:  

On the same page as Perl or Python, Ruby has great capabilities to be a powerful text processing language. This article briefly talks about the textual data processing abilities of Ruby and how you can use it to efficiently handle different formats of textual data, whether CSV data or XML data.

Ruby strings

Frequently used acronyms

  • CSV: Comma Separated Values
  • REXML: Ruby Electric XML
  • XML: Extensible Markup Language

Strings in Ruby are a powerful way to hold, compare, and manipulate textual data. In Ruby, String is a class that you can instantiate by invoking String::new or by just assigning a literal value.

When you assign values to Strings, you can use a pair of single quotes (') to enclose the value, or a pair of double quotes ( "" ). Single quotes and double quotes specify Strings differently in a few ways. Double quotes allow escape sequences that use a leading backslash ( \ ) and also allow evaluation of expressions within the strings using the #{} operator. Single quoted strings are simple, straight literals.

Listing 1 is an example.


Listing 1. Working with Ruby strings: Defining strings


message = 'Heal the World…'

puts message

message1 = "Take home Rs #{100*3/2} "

puts message1

Output :

# ./string1.rb

# Heal the World…

# Take home Rs 150


Here, the first string is defined with a pair of single quotes. The second one uses a pair of double quotes. In the second example, the expression within #{} is evaluated before display.

Another useful way to define a string is generally used for multi-line string definitions.

From here on, I will use the interactive ruby console irb>> for my explanations. You should have it installed along with your Ruby installation. If not, I suggest, you get the irb Ruby gem and install it. It is a very useful tool for learning about Ruby and its modules. Once you install it, you can run it with the irb>> command.


Listing 2. Working with Ruby strings : Defining multiline strings


irb>> str = >>EOF

irb>> "hello world

irb>> "how do you feel?

irb>> "how r u ?

irb>> EOF

"hello, world\nhow do you feel?\nhow r u?\n"

irb>> puts str

hello, world
how do you feel?
how r u?


In Listing 2, everything between >>EOF and EOF is considered as a part of the string, including the \n (new line) characters.

The Ruby String class has a powerful set of methods to manipulate and process data stored in them. The examples in Listings 3, 4, and 5 illustrate a few of them.


Listing 3. Working with Ruby strings : Concatenating


irb>> str = "The world for a horse"    # String initialized with a value

The world for a horse

irb>> str*2                     # Multiplying with an integer returns a 
                                         # new string containing that many times
                                         # of the old string.

The world for a horseThe world for a horse

irb>> str + " Who said it ? "      # Concatenation of strings using the '+' operator

The world for a horse Who said it ?

irb>> str<<" is it? "	  # Concatenation using the '<<' operator

The world for a horse is it?


Extracting substrings and manipulating parts of the string


Listing 4. Working with Ruby Strings : Extracting and manipulating


irb>> str[0]	# The '[]' operator can be used to extract substrings, just 
                        # like accessing entries in an array.
                        # The index starts from 0.
84			# A single index returns the ascii value
                        # of the character at that position

irb>> str[0,5]	# a range can be specified as a pair. The first is the starting 
                        # index , second is the length of the substring from the
                        # starting index.

The w

irb>> str[16,5]="Ferrari"   # The same '[]' operator can be used
                                  # to replace substrings in a string
                                  # by using the assignment like '[]='
irb>>str

The world for a Ferrari

Irb>> str[10..22]		# The range can also be specified using [x1..x2] 

for a Ferrari

irb>> str[" Ferrari"]=" horse"	# A substring can be specified to be replaced by a new
                               # string. Ruby strings are intelligent enough to adjust the
                               # size of the string to make up for the replacement string.

irb>> s

The world for a horse

irb>> s.split	       # Split, splits the string based on the given delimiter
                               # default is a whitespace, returning an array of strings.

["The", "world", "for", "a", "horse"]

irb>> s.each(' ') { |str| p str.chomp(' ') }

                               # each , is a way of block processing the
			       # string splitting it on a record separator
			       # Here, I use chomp() to cut off the trailing space

"The"
"world"
"for"
"a"
"horse"


Many other utility methods are available with the Ruby String class, including methods to change case, get the length, remove record separators, scan through the string, encrypt, decrypt the string, and so on. Another useful method is the freeze method by which a string can be made immutable. After you invoke that method on the String str (str.freeze, str cannot be modified).

Ruby also has what are called destructor methods. A method ending with an exclamation point (!) will modify the string permanently. Normal methods (those without the exclamation point at the end) modify and return a copy of the string they were invoked upon. The exclamation point methods modify the string which invokes the method.


Listing 5. Working with Ruby strings : Modifying a string permanently


irb>> str = "hello, world"

hello, world

irb>> str.upcase

HELLO, WORLD

irb>>str		       # str, remains as is.

Hello, world

irb>> str.upcase!	       # here, str gets modified by the '!' at the end of 
                               # upcase.
HELLO, WORLD

irb>> str

HELLO, WORLD


In Listing 5, the string in str is modified by the upcase! method, but just the upcase method returns a copy of the string with case changed. These ! methods are sometimes very useful.

Ruby Strings are very powerful. Once you have your data captured in Strings, you are on your way to process them in a very easy and efficient manner using a plethora of methods at your disposal.


Handling CSV files

A CSV file is a very common way to represent tabular data, most commonly used as the format for data exported from a spreadsheet (such as a list of contacts with their contact details).

Ruby has a powerful library to handle and process such files. csv is the ruby module that deals with CSV files. It has methods to create, read and parse such files.

The example in Listing 6 shows how to create such a CSV file and then parse it using the Ruby csv module.


Listing 6. Handling CSV files : Create and parse a CSV file


require 'csv'

writer = CSV.open('mycsvfile.csv','w')

begin

	print "Enter Contact Name: "

	name = STDIN.gets.chomp

	print "Enter Contact No: "

	num = STDIN.gets.chomp

	s = name+" "+num

	row1 = s.split

	writer << row1

	print "Do you want to add more ? (y/n): "

	ans = STDIN.gets.chomp

end while ans != "n"

writer.close

file = File.new('mycsvfile.csv')

lines = file.readlines

parsed = CSV.parse(lines.to_s)

p parsed

puts ""

puts "Details of Contacts stored are as follows..."

puts ""

puts "-------------------------------"

puts "Contact Name | Contact No"

puts "-------------------------------"

puts ""

CSV.open('mycsvfile.csv','r') do |row|

	puts row[0] + " | " + row[1]	

	puts ""
end


Listing 7 shows the output:


Listing 7. Handling CSV files : Create and parse a CSV file output


Enter Contact Name: Santhosh

Enter Contact No: 989898

Do you want to add more ? (y/n): y

Enter Contact Name: Sandy

Enter Contact No: 98988

Do you want to add more ? (y/n): n

Details of Contacts stored are as follows...

---------------------------------
Contact Name | Contact No
---------------------------------

Santhosh | 989898

Sandy | 98988


Let's quickly review the example.

First, include the csv module (require 'csv').

To create a new CSV file named mycsvfile.csv, open it using the CSV.open() call. This returns a writer object.

This example creates a CSV file which holds a simple contact list, storing the name of the person along with his phone number. In the loop, the user is asked to enter the name of the contact and the phone number. The name and the phone number are concatenated into a single string and then split into an array of two strings. This array is passed to the writer object to be written into the CSV file. Thus, one pair of CSV values is stored as a single line in the file.

Once out of the loop, everything is done. Now close the writer and the data in the file is saved.

The next step is to parse the CSV file that is created.

One way to open and parse the file is to create a new File object that uses the name of the new CSV file.

Call the readlines method to read all the lines in the file into an array called lines.

Convert the lines array into a String object by calling lines.to_s and pass the string to the CSV.parse method, which parses the CSV data and returns the content as and array of arrays.

Next, you see another way to open and parse the file. Open the file again using the CSV.open call in read mode. This returns an array of rows. Print each row with some formatting to display the contact details. Each row here is a line in the file.

As you can see, Ruby provides a powerful module for working with CSV files and data.


Working with XML files

For working with XML files, Ruby has a powerful built-in library called REXML. This can be used to read and parse XML documents.

Look at this XML file and try to parse it using Ruby and REXML.

Below is a simple XML file listing the contents of a typical shopping cart in an online shopping mall. It has the following elements:

  • cart – is the root element
  • user - the user who is shopping
  • item - item the user has added to his cart
  • id, price and quantity - sub-elements of item.

Listing 8 shows the structure of the XML:


Listing 8. Working with XML Files : Sample XML File


<cart id="userid">

<item code="item-id">

	<price>

		<price/unit>

	</price>

	<qty>

		<number-of-units>

	</qty>

</item>

</cart>


Go to Download for the sample XML file. Now, load this XML file and parse through the tree using REXML.


Listing 9. Working with XML files : Parsing XML files


require 'rexml/document'

include REXML

file = File.new('shoppingcart.xml')

doc = Document.new(file)

root = doc.root

puts ""

puts "Hello, #{root.attributes['id']}, Find below the bill generated for your purchase..."

puts ""

sumtotal = 0

puts "-----------------------------------------------------------------------"

puts "Item\t\tQuantity\t\tPrice/unit\t\tTotal"

puts "-----------------------------------------------------------------------"

root.each_element('//item') { |item| 

code = item.attributes['code']

qty = item.elements["qty"].text.split(' ')

price = item.elements["price"].text.split(' ')

total = item.elements["price"].text.to_i * item.elements["qty"].text.to_i

puts "#{code}\t\t  #{qty}\t\t          #{price}\t\t         #{total}"

puts ""

sumtotal += total

}

puts "-----------------------------------------------------------------------"

puts "\t\t\t\t\t\t     Sum total : " + sumtotal.to_s

puts "-----------------------------------------------------------------------"


Listing 10 shows the output.


Listing 10. Working with XML files : Parsing XML files output


Hello, santhosh, Find below the bill generated for your purchase...

-------------------------------------------------------------------------
Item            Quantity                Price/unit              Total
-------------------------------------------------------------------------
CS001             2                          100                      200

CS002             5                          200                     1000

CS003             3                          500                     1500

CS004             5                          150                      750

-------------------------------------------------------------------------
                                                         Sum total : 3450
--------------------------------------------------------------------------


The example in Listing 9 parses the shopping cart XML file and generates a bill with the individual item totals and the sum total for the purchase (Listing 10).

Let's quickly go through it.

First, include the REXML module of Ruby. This has the methods to parse through the XML file.

Open the shoppingcart.xml file and create a Document object from it. This Document object is the one which contains the parsed XML file.

Assign the root of the document to the element object root. This will now point to the cart tag in your XML.

Each element object has an attributes object which is a hash of the element attribute names as keys and their values as values. Here, root.attributes['id'], will give the value of the attribute id of the root element, which in this case is the userid.

Next, initialize the sumtotal to 0 and print the headers.

Each element object also has an object called elements, with each and [] methods to access the sub-elements. The block runs through all the sub-elements of the root element with the name item, specified by the XPath expression //item. Each element object also has an attribute text that holds the textual value for that element.

Next, get the item element's code attribute and the text value of the price and qty elements and calculate the total for the item. Print the details into the bill. Also, add the item total to the sumtotal.

Finally, print the sum total.

This example shows how easy and simple it is to parse XML files with REXML and Ruby. It is as easy to generate XML files on the fly, and to add and delete elements and their attributes.


Listing 11. Working with XML files : Generate XML files


doc = Document.new

doc.add_element("cart1", {"id" => "user2"})

cart = doc.root.elements[1]

item = Element.new("item")

item.add_element("price")

item.elements["price"].text = "100"

item.add_element("qty")

item.elements["qty"].text = "4"

cart .elements << item


The snippet in Listing 11 creates the XML structure by creating a cart element, and an item element and its sub-elements. It populates them with values and adds them to the Document root.

Similarly, to delete elements and attributes, use the delete_element and delete_attribute methods of the Elements object.

The above is an example of what is called tree parsing. Yet another way of parsing XML documents is known as stream parsing. This is faster than tree parsing and can be used where speed is imperative. Stream parsing is event-based and works with listeners. When a tag is encountered, the listener is called and it does the processing.

Listing 12 shows is an example


Listing 12. Working with XML files : Stream parsing


require 'rexml/document'

require 'rexml/streamlistener'

include REXML

class Listener

  include StreamListener

  def tag_start(name, attributes)

    puts "Start #{name}"

  end

  def tag_end(name)

    puts "End #{name}"

  end

end

listener = Listener.new

parser = Parsers::StreamParser.new(File.new("shoppingcart.xml"), listener)

parser.parse


Listing 13 shows the output


Listing 13. Working with XML files : Stream parsing output


Start cart

Start item

Start price

End price

Start qty

End qty

End item

Start item

Start price

End price

Start qty

End qty

End item

Start item

Start price

End price

Start qty

End qty

End item

Start item

Start price

End price

Start qty

End qty

End item

End cart


Thus , REXML and Ruby provide a powerful combination for you to work with and manipulate XML data in a very efficient and intuitive way.


Summary

Ruby has a great set of built-in and external libraries for quick, powerful, and efficient text processing. You can harness this capability to simplify and enhance a variety of textual data processing needs that you might encounter. This article just touches upon a few of the aspects of this ability of Ruby. You can achieve a lot more.

Ruby is definitely, a great tool, that you'll want in your toolbox.



Download

DescriptionNameSizeDownload method
Sample code for the articlesample-code.zip2KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Santhosh Krishnamoorth

Santhosh Krishnamoorthy is a test engineer with the TXSeries team, in IBM Software Labs, Bangalore, working in the area of intersystem communications and Java technology. His interests include Ruby, Ruby on Rails, and Python programming.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=421167
ArticleTitle=Text processing with Ruby
publish-date=08182009
author1-email=santhoshk@in.ibm.com
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers