Chitika

Monday, December 13, 2004

POI for Excel parser

It's been over a month since my last post. Sorry guys. It's been busy at work, and no internet connection. I had a nice two weeks vacation though, and it was fun.. :D

Anyway, during the vacation, I happened to help a friend of mine finish up his project. I helped him to create a component to parse resume documents. These documents can be in Word, Excel, HTML or plain text. He wanted me to be able to parse the content of the documents, removing all the formatting code, and then index the document and store it into the database for google-like search later on.

Again, Java and the Open Source community has been a great help in achieving this goal. There are many options for parsing documents, but here are my final choices:
- Apache Jakarta POI, for parsing Excel documents
- TextMining, for parsing Word documents
- HTMLParser, for parsing HTML documents
These open source libraries have helped me in a great deal to complete the component.

Here's a sample code for parsing an Excel document:

public interface Parser {
  public void setFileName (String fileName);
  public String getParsedText ();
  public void parse ();
}

public class ExcelParser implements Parser {
  private String fileName;
  private String parsedText;

  /**
   * Using HSSF in POI library, this method will parse
   * the Excel file, and extract all the values for
   * every sheet, and combine them into a single
   * String.
   */
  public void parse () {
    // initializing the InputStream from a file using
    // POIFSFileSystem, before converting the result
    // into an HSSFWorkbook instance
    HSSFWorkbook wb = null;
    try {
      InputStream is = this.getClass()
        .getResourceAsStream (fileName);
      POIFSFileSystem fs = new POIFSFileSystem (is);
      wb = new HSSFWorkbook (fs);
    } catch (IOException e) {
      e.printStackTrace();
    }

    // start appending the values
    StringBuffer sb = new StringBuffer ();

    // loop for every worksheet in the workbook
    int numOfSheets = wb.getNumberOfSheets();
    for (int i=0; i<numofsheets; i++) {
      HSSFSheet sheet = wb.getSheetAt (i);

      // loop for every row in each worksheet
      for (Iterator rows = sheet.rowIterator();
              rows.hasNext(); ) {
        HSSFRow row = (HSSFRow) rows.next ();
        short c1 = row.getFirstCellNum();
        short c2 = row.getLastCellNum();

        // loop for every cell in each row
        for (short c=c1; c<c2; c++) {
          HSSFCell cell = row.getCell(c);
          if (cell != null) {
            String cellValue = getCellValue (cell);
            if (cellValue != null &&
                cellValue.trim().length() > 0) {

              // append the value of the cell separated
              // by a SPACE
              b.append (TextFilter.SPACE)
               .append (cellValue);
            }
          }
        }
      }
    }

    // store the parsed Text
    parsedText = sb.toString().trim();

    // filter the parsed Text, replacing the forbidden
    // characters with SPACE
    parsedText = TextFilter.filterForbiddenCharacters (
      parsedText,
      TextFilter.DEFAULT_FORBIDDEN_CHARACTERS);
    }


  /**
   * This is a helper method to retrieve the value of a
   * cell regardles of its type, which will be converted
   * into a String.
   *
   * @param cell
   * @return
   */
  private String getCellValue (HSSFCell cell) {
    if (cell == null) return null;

    String result = null;

    int cellType = cell.getCellType();
    switch (cellType) {
      case HSSFCell.CELL_TYPE_BLANK:
        result = "";
        break;
      case HSSFCell.CELL_TYPE_BOOLEAN:
        result = cell.getBooleanCellValue() ?
          "true" : "false";
        break;
      case HSSFCell.CELL_TYPE_ERROR:
        result = "ERROR: " + cell.getErrorCellValue();
        break;
      case HSSFCell.CELL_TYPE_FORMULA:
        result = cell.getCellFormula();
        break;
      case HSSFCell.CELL_TYPE_NUMERIC:
        HSSFCellStyle cellStyle = cell.getCellStyle();
        short dataFormat = cellStyle.getDataFormat();

        // assumption is made that dataFormat = 15,
        // when cellType is HSSFCell.CELL_TYPE_NUMERIC
        // is equal to a DATE format.
        if (dataFormat == 15) {
          result = cell.getDateCellValue().toString();
        } else {
          result = String.valueOf (
            cell.getNumericCellValue());
        }

        break;
      case HSSFCell.CELL_TYPE_STRING:
        result = cell.getStringCellValue();
        break;
      default: break;
    }

    return result;
  }

  public void setFileName(String fileName) {
    this.fileName = fileName;
  }

  public String getParsedText() {
    return parsedText;
  }
}


public class TextFilter {
  public static final char CH07 = (char) 7;
  public static final char CH10 = (char) 10;
  public static final char CH13 = (char) 13;
  public static final char[] DEFAULT_FORBIDDEN_CHARACTERS =
    { CH07, CH10, CH13 };
  public static final char SPACE = (char) 32;

  public static String filterForbiddenCharacters (
    String s, char[] forbidden) {
    if (s == null || s.trim().length() <= 0) return s;
      for (int i = 0; i<forbidden.length; i++) {
        char c = forbidden[i];
        s = s.replace (c, SPACE);
      }
      return s;
    }
  }
}

I'll try sharing more codes later.. :D

5 comments:

  1. Do people out there have a problem with HSSFSheet.rowsIterator() returning all 65536 rows of a spreadsheet, even though most rows were empty? Did I do something wrong in Excel when I created this sheet?
    -Anon

    ReplyDelete
  2. I could not find the excel sheet for the corresponding program.

    ReplyDelete
  3. This is such a good tutorial. It's exactly what I am looking for. I'll give it a try. Thank you very much for doing a good job.

    ReplyDelete
  4. This is such a good tutorial. It's exactly what I am looking for. Thanks for doing a good job.

    ReplyDelete
  5. This what exactly I was looking for, but in my program it works for little changes (May be due to different jar file)

    didn't worked --->
    InputStream is = this.getClass()
    .getResourceAsStream (fileName);

    I have used --->
    InputStream is = new FileInputStream(fileName);

    Its working fine, thanks for the help

    ReplyDelete