Exploring alternative methods of accounting data input

Use open source OCR and barcode methods to Input data into an application

When the opportunity cost of time is high, data input into an accounting program from printed receipts and invoices can be tedious and subject to error, especially when long alphanumeric strings are involved. Alternative methods are offered by technologies like optical character recognition (OCR) and barcodes. This article examines the context of the problem and explores creative open source methods of data input, including reading a printed receipt using OCR methods and using encryption and decryption of QR codes. Finally, the article uses PHP to extract the results and format a prototypical XML output file.

Share:

Colin Beckingham, Author, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a variety of fields, including banking, horticulture, horse racing, teaching, civil service, retail, and travel/tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP and voice-control applications on Linux. You can reach him at colbec@start.ca.



30 November 2010

Also available in Japanese

The problem

Procrastination is the enemy of accounting data input. You throw all your receipts and invoices into a shoebox for later attention, then bookkeeping day comes and the data has to be massaged into your software, at which time it takes most of a day to catch up. Some data input can come from a Quicken Interchange Format (QIF) or Open Financial Exchange (OFX) file downloaded from a financial institution, but this rarely captures all data input required.

A technical specialist could look at this issue and pose the question: Can you use technology to make this task less onerous? The process is the same for each piece of paper:

  1. Read the invoice.
  2. Decide if the data will be read in from another source (for example, a bank download statement).
  3. Assess whether it is a personal or business transaction.
  4. Map out what subaccounts are involved.
  5. Begin, continue, and conclude the transaction entries.
  6. Check that the overall sum of debits and credits for the transaction match.
  7. Ensure that all detail is recorded.

On the one hand there is some human interpretation required, and on the other there is simple mechanical reading and recording. Technology should be able to assist in making this process less painful and more accurate. Two approaches that look as if they can help are OCR and barcode scanning.


OCR and barcodes

OCR and barcodes are only two of many technologies that could help input accounting data. For OCR, the developer is trying to machine-read the text in a printed receipt. For a barcode, you expect that the vendor will print a barcode on the receipt with appropriate information encoded so it can be scanned later. For these tools, the question might be in two parts: Given an arbitrary receipt, can you scan it sufficiently well to extract data? And then, given that data, can you import it directly into the accounting application? Its usefulness as an aid then depends on whether the alternative process of data input is less onerous than the keyboard approach. If not, the application is still one of many solutions looking for a problem.

Scanning images and barcodes is not usually difficult. However, receipts present special problems. They are often printed on a small printer with a fading ribbon on poor quality rough paper (which may curl and have a poorly reflective surface), resulting in an image that can be read by a human, but less easily by a machine. The ability to scan receipts for OCR purposes can vary from good to quite impossible. Scanning barcodes is generally highly successful provided the quality of the printed barcode is good.

Scripting raw text input into something that the final accounting application can use is simplified by the fact that, in general, receipts follow a pattern. First come the name and contact details of the vendor, perhaps followed by the date, then one or more lines of itemized data containing a description, the price, and what taxes are applied. Then follows a subtotal, additional charges such as shipping, handling, brokerage, taxes, and the total. At some point, a tax registration number, method of payment, and other data may appear. This order may vary but generally only between vendors. One vendor's point of sale (POS) machines are usually consistently the same from location to location.


A practical example

Figure 1 shows an example of a fictional receipt. It is not representative of the average readability, but is superior from a scanning point of view. Some receipts in my collection scan poorly or not at all.

Figure 1. The raw receipt image
Example receipt

A human reader would scan this receipt, decide on the data it contains, then enter the transaction via the keyboard. Table 1 shows some example double-entry detail for this receipt.

Table 1. Bookkeeping entries from receipt
AccountDebitsCredits
Miscellaneous expenses64.68
Tax Paid6.47
MasterCard Account71.15
Totals71.1571.15

To make the link with the final accounting application, you need to make an assumption: If you can create an XML file in the prototypical format (see Listing 1), the application will be able to import it.

Listing 1. Simplified XML format for invoice
<invoice>
  <transaction date="xxxx-xx-xx" currency="XXX">
    <vendor>
      <name></name>
      <phone></phone>
      <tax_reg></tax_reg>
    </vendor>
    <entry>
      <detail></detail>
      <account></account>
      <amount></amount>
    </entry>
  </transaction>
</invoice>

In this format, an invoice creates one or more transactions, each of which consists of one vendor block with information about the vendor, and multiple entries for each transaction (associated with the rows in Table 1). Each entry contains a detail element, the account to which the amount is applied and the amount which can be positive or negative. An output in OFX or Organization for the Advancement of Structured Information Standards (OASIS) Invoice format is simply a matter of adjusting names of elements and the schema.


The OCR approach

This article uses Tesseract (see Resources) as an example open source character recognition application. While Tesseract basically follows the usual ./configure, make, make install process, it is important to follow the instructions in the bundled INSTALL file carefully — there are a lot of points to bear in mind to achieve a successful install.

The following instruction asks Tesseract to examine Figure 1 stored in Tagged Image File Format (TIFF) format and then use the second argument, wm, as the base of a file name to receive output. The output automatically goes to the wm.txt file in the current directory.

tesseract wm.tif wm

The wm.txt file then contains the output shown in Listing 2.

Listing 2. The output from Tesseract
ACME HARDWARE
88 MAIN STREET
ANYTOWN, ST 12345-67890
123-555-6789
TAX NO - 987654-321
CUSTOMER - CASH SALE
ORDER - 000456
DATE - 2010-08-07
DESC - SKU
ITEM - 12345
2 @ 12 34 = 24 68
SECOND - 98765
2 @ 15 00 = 30 00
THIRD - 44887744
2 @ 5 00 = 10 00
SUBTOTAL = 64 68
TAX 10% = 6 47
TOTAL = 71 15
PAYMENT - MASTERCARD
TRANS - 0678453
REGISTER - 22
EMPLOYEE - 456
THANKYOU EOR SHOPPING AT
ACME

Tesseract makes only two types of errors in this case: missing all the decimal points and mistaking one F for an E. I took special care to try to make Figure 1 as compatible with Tesseract as possible, using Verdana font size 12, only uppercase characters, and only single spaces.

At this point, you just need a script that cleans up the text, extracts the information, outputs the required XML, and you are done. However, because some of this might be duplicated for the barcodes example, I will delay detailing the script until barcodes have been considered.


The barcode approach

Barcodes are specialist images that are designed to convey information cleanly and clearly with attention to error checking. Many vendors already print a 1-D barcode on receipts for their own purposes. This section considers one such format, the QR code, for which open source tools are readily available and readers are commonly implemented on portable electronics such as cell phones.

QR codes are 2-D images designed to carry alphanumeric and other information. Able to encode a maximum of about 4,000 alphanumeric characters in a square image, they lend themselves nicely to placement on a narrow receipt slip. They have already found a niche use on wine bottle labels, with information encoded that allows a potential purchaser or drinker to use a scanner built into many mobile phones to find more information about the contents of the bottle. Using barcodes in the context of receipts presupposes that the vendor is willing to add the barcodes to the printed receipt. But then many vendors are willing to consider ways to distinguish their quality of service to customers, so this is not a far stretch.

The barcode in Figure 2 contains the text information from Figure 1.

Figure 2. Example of QR code
Example QR code

Note that the barcode does not have to contain the raw information directly, but can contain a (much shorter) URL pointing back to the vendor's server, retrieving the information indirectly from a database.

There are a number of open source tools able to create and read QR codes. This article uses qrencode to encode and the Java™-based Open Source QR Code Library (qrcode) to decode (see Resources). You can do the same things using many other free testing services.

The installation of qrencode is straightforward with the usual ./configure, make, make install. Make sure that you have the javac routine available on your system, then unpack the code from QR code and run the examples as indicated in the bundled QUICKSTART.txt file.

The following command encodes the text from Figure 1 when stored as regular text in the acmesrc.txt file and stores it in an image file called acmeqr.png:

> qrencode -o acmeqr.png < acme.txt

In a similar way, you can get QR code to check that the content of acmeqr.png is satisfactory:

> java -cp classes:lib/qrcode.jar example.QRCodeDecoderCUIExample \
      /path-to-image/acmeqr.png

This instruction from the qrcode directory should repeat back the content of the acme.txt file. Note that the size of this square image containing 440 characters is close to the 8 cm width of a POS printer slip. Encoding more information is certainly possible but implies a wider receipt to accommodate the larger image.


Mapping to XML

The next step is to take the results of the scan (whether from OCR or QR code) and put it into the XML format ready for import into the accounting application. For this, you require a scripting language: Python, Perl, PHP, and others are easily capable of this. This example uses PHP with SimpleXML and Document Object Model (DOM) XML routines. I chose the Tesseract output from Listing 2, with its faults, as the test case for input into the script. You can also use input from a barcode in the same way, probably with less error checking required.

The script needs to do a number of things:

  1. Read the raw input.
  2. Decide on the vendor format.
  3. Correct any errors.
  4. Map the data into variables intelligently.
  5. Output the XML.

Listing 3 shows a sample script.

Listing 3. Example mapping script
<?php
// mapper - Colin Beckingham 2010
//
// firstly load the data
if (!$infile = $argv[1]) die("Missing input file!\n");
$raw = file_get_contents($infile);
$inarray = explode("\n",$raw);
$vendor["name"] = $inarray[0];
$sumcheck=0;
switch ($vendor["name"]) {
  case "ACME HARDWARE":
    $invoid = substr($inarray[6],8);
    $invdate = substr($inarray[7],7);
    $index["sku"] = 8;
    $index["subt"] = array_search_substr("SUBTOTAL",$inarray);
    $index["tax"] = $index["subt"]+1;
    $index["tot"] = $index["subt"]+2;
    $index["method"] = $index["subt"]+3;
    $vendor["phone"] = $inarray[3];
    $vendor["tax_reg"] = substr($inarray[4],9);
    $detail = $vendor["name"]." $invoid";
    $items["expdetail"]=$detail;
    $items["expaccount"]=1;
    $merch = substr($inarray[$index["subt"]],strpos($inarray[$index["subt"]],"=")+2);
    $merch = str_replace(" ",".",$merch);
    $items["expamount"]=($merch)*-1;
      $sumcheck += $items["expamount"];
    $items["taxdetail"]=$detail;
    $items["taxaccount"]=2;
    $tax = substr($inarray[$index["tax"]],strpos($inarray[$index["tax"]],"=")+2);
    $tax = str_replace(" ",".",$tax);
    $items["taxamount"]=($tax)*-1;
      $sumcheck += $items["taxamount"];
    $items["astdetail"]=$detail;
    $items["astaccount"]=3;
    $tot = substr($inarray[$index["tot"]],strpos($inarray[$index["tot"]],"=")+2);
    $tot = str_replace(" ",".",$tot);
    $items["astamount"]=$tot;
      $sumcheck += $items["astamount"];
      if ($sumcheck != 0) echo "Unbalanced entries! ($sumcheck)\n";
  break;
  default:
    die("Vendor not recognized!\n");
  break;
}
// secondly create XML and substitute data
$basic = "<?xml version=\"1.0\" ?>";
$basic .= "<invoice></invoice>";
$xml = simplexml_load_string($basic);
$xml->addchild("transaction");
$xml->transaction->addAttribute("date",$invdate);
$xml->transaction->addAttribute("currency","CAD");
$xml->transaction->addchild("vendor");
$xml->transaction->vendor->addchild("name",$vendor["name"]);
$xml->transaction->vendor->addchild("phone",$vendor["phone"]);
$xml->transaction->vendor->addchild("tax_reg",$vendor["tax_reg"]);
$xml->transaction->addchild("entry");
$j=0;
$xml->transaction->entry[$j]->addchild("detail",$items["expdetail"]);
$xml->transaction->entry[$j]->addchild("account",$items["expaccount"]);
$xml->transaction->entry[$j]->addchild("amount",$items["expamount"]);
if ($items["taxamount"] != 0) {
  $xml->transaction->addchild("entry");
  $j++;
  $xml->transaction->entry[$j]->addchild("detail",$items["taxdetail"]);
  $xml->transaction->entry[$j]->addchild("account",$items["taxaccount"]);
  $xml->transaction->entry[$j]->addchild("amount",$items["taxamount"]);
}
$xml->transaction->addchild("entry");
$j++;
$xml->transaction->entry[$j]->addchild("detail",$items["astdetail"]);
$xml->transaction->entry[$j]->addchild("account",$items["astaccount"]);
$xml->transaction->entry[$j]->addchild("amount",$items["astamount"]);
show_nice_xml($xml);
//
function show_nice_xml($xml) {
  $doc = new DOMDocument('1.0');
  $doc->formatOutput = true;
  $domnode = dom_import_simplexml($xml);
  $domnode = $doc->importNode($domnode, true);
  $domnode = $doc->appendChild($domnode);
  echo $doc->saveXML();
}
function array_search_substr($needle,$haystack) {
  foreach ($haystack as $k=>$h) {
    if (substr($h,0,strlen($needle)) == $needle) return $k;
  }
  return false;
}
?>

The first section looks for the required argument supplied in the command line, which gives the name of the text file output from Tesseract in Listing 2. The script loads the file text into a string variable, explodes it into an array using new lines, and determines the name of the vendor from the first element of the array. If the vendor name is recognized, the switch statement follows the expected pattern for that vendor. If the vendor is unknown, it stops the script. The vendor detail, date, and order ID number are in expected places in the array, but the array indexes of the subtotal and subsequent lines depend on the number of items detailed. You can determine the index of the SUBTOTAL element by searching for the substring SUBTOTAL. When you have determined this index, you then know the number of items, and you can determine the indexes of the elements occurring after the subtotal line as offsets. You can use a for statement to iterate over multiple items to extract fine detail, but I omitted this for simplicity.

Now that you know the indexes, you can extract the values for the merchandise subtotal, taxes, and final total from the array. Because I anticipate missing decimal points in amounts for this vendor and method, the script inserts them if necessary. The script keeps a running total of the amounts and issues a warning if they do not sum to zero at the end. It also uses the vendor name and the order ID for the detail requirements.

Having assigned the needed variables into the items array, the script now generates the XML. It begins by creating the root invoice element as a string and loading it into a SimpleXML object. It then adds the required transaction element with its two attributes and the vendor element with its children. Then it uses the information to generate three entry elements (one each for the expense, tax, and asset categories), each of which has detail, account, and amount children. If the item is not taxable, it does not create an entry element for tax. Finally, the script outputs a clean indented version of the XML, a step which is only necessary when testing.

You can call the script in the following way to produce the final mapped output:

> php /path-to/mapper.php /path-to/acme.txt

Listing 4 shows the XML output, which is in the format required for input into the accounting application (compare with Listing 1).

Listing 4. XML results of mapping script
<?xml version="1.0"?>
<invoice>
  <transaction date="2010-08-07" currency="CAD">
    <vendor>
      <name>ACME HARDWARE</name>
      <phone>123-555-6789</phone>
      <tax_reg>987654-321</tax_reg>
    </vendor>
    <entry>
      <detail>ACME HARDWARE 000456</detail>
      <account>1</account>
      <amount>-64.68</amount>
    </entry>
    <entry>
      <detail>ACME HARDWARE 000456</detail>
      <account>2</account>
      <amount>-6.47</amount>
    </entry>
    <entry>
      <detail>ACME HARDWARE 000456</detail>
      <account>3</account>
      <amount>71.15</amount>
    </entry>
  </transaction>
</invoice>

Using the QR code output requires much less error checking. The basic cycle is now complete.

Further possibilities are clear, the most important of which is a need for the script to examine the individual items one at a time and map them to known accounts, which is something that the script can "learn" over time by matching new items to those on previous invoices.


Conclusion

This article demonstrates that it is possible, admittedly using some assumptions, to use OCR and QR codes to input accounting data. Whether such alternative methods of input are worth the effort depends on how clean the output from OCR is, how much effort is saved, and how accuracy is improved. Where complex alphanumeric items have to be keyed in (such as SKU numbers and tax registration details) and there is a possibility of batch input of multiple receipts, an automated input pays for itself rapidly.

As developers, the goal is to make products as useful as possible for the end users. If a major objective of the printed receipt is to allow end users to scan and read with OCR, then there are a number of points to keep in mind:

  • Paper should be as white as possible, printing as black as possible, and printing as consistent as possible.
  • It is helpful to both human readers and machine readers if the date is close to the top of the receipt.
  • Fonts should be compatible with a wide range of OCR decoders.
  • Vendor name on a printed receipt should be in a decodable format. Many vendors have fancy icons that are recognizable to humans but not to OCR. If you want to use your icon that's fine, but repeat the vendor name in a standard font as well.
  • Advertising should be separate from the receipt information. It is fair game to add an advertisement to the end of the receipt, but if the advertising is inserted into the body of the receipt it makes it more difficult for both human readers and OCR to find pertinent information
  • Clearly separate non-numeric symbols such as currency signs and tax application markers from the currency values.

OCR and QR codes are not by any means the only alternatives. Another 2-D barcode format is PDF417, which in macro mode is able to split one large barcode into multiple smaller barcodes, but few non-specialist portable devices support this format. Another cool possibility is to have a human reader input the detail using an intelligent voice interaction routine.

Resources

Learn

Get products and technologies

  • Download Tesseract.
  • Check out qrencode, a QR encoder.
  • Explore the Open Source QR Code Library.
  • Learn about GOCR, which is a free OCR program.
  • OCRopus is another free OCR system you can investigate.
  • Drupal.org is the official website of Drupal, an open source content management platform.
  • WordPress is a state-of-the-art publishing platform with a focus on aesthetics, web standards, and usability. WordPress is both free and priceless at the same time.
  • Mambo is a full-featured, award-winning content management system that can be used for everything from simple websites to complex corporate applications.
  • SilverStripe is an innovative open source CMS and framework.
  • Innovate your next open source development project with IBM trial software, available for download or on DVD.
  • Download IBM product evaluation versions or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=588432
ArticleTitle=Exploring alternative methods of accounting data input
publish-date=11302010