07 mayo 2010

java parse fixed-length files (2)

http://jsapar.tigris.org/

Mission

The goal of this project is to create a java library that contains a parser of flat files and csv files. The library should be simple to use and possible to extend.

Existing features

  • Support for flat files with fixed positions.
  • Support for CSV files.
  • The schema can be expressed with xml notation or created directly within the java code.
  • The parser can either produce a Document class, representing the content of the file, or you can choose to receive events for each line that has been successfully parsed.
  • Can handle huge files without loading everything into memory.
  • The output Document class contains a list of lines which contains a list of cells.
  • The Document class can be transformed into a Java object (via reflection) if the schema is carefully written.
  • It is also possible to produce java objects directly from the parser.
  • It is possible convert a list of java objects into a file according to a schema if the schema is carefully written.
  • The Document class can be built from a xml file (according to an internal xml schema).
  • The input and outputs are given by java.io.Reader and java.io.Writer which means that it is not necessarily files that are parsed or generated.
  • The file parsing schema contains information about how to parse each cell regarding data type and syntax.
  • Parsing errors can either be handled by exceptions thrown at first error or the errors can be collected during parsing to be able to deal with them later.
  • JUnit tests for most classes within the library.
  • Support for localisation.

Java Schema Parser

The javadoc within the package contains more comprehensive documentation regarding the classes mentioned below.

The JSaPar package is a java library that provides a parser for flat and CSV (Comma Separated Values) files. The concept is that a schema class denotes the way a file should be parsed or written. The schema class can be built by specifying a xml-document or it can be constructed programmatically by using java code. The output of the parser is usually a org.jsapar.Document object that contains a list of org.jsapar.Line objects which contains a list of org.jsapar.Cell objects.

Supported file formats:
  • Fixed width - Also refered to as flat file. Each cell is described only by its positions within the line. The type of the line is denoted by its position within the file.
  • Fixed width contol value - The same as Fixed width above except that each line type is denoted by a control value in the leading characters of each line.
  • CSV - (Comma Separated Values) Each cell is limited by a separator character (or characters). The type of the line is denoted by its position within the file.
  • CSV contol value - The same as CSV above except that each line type is denoted by a control value in the leading cell of each line.

Events for each line

For very large files there can be a problem to build the complete org.jsapar.Document in the memory before further processing. It may simply take up to much memory. In that case you may choose to get an event for each line that is parsed instead. You do that by registering a sub-class of org.jsapar.ParsingEventListener to the org.jsapar.input.Parser. That way you can process one line at a time, thus freeing memory as you go along.

Converter

If you are only interesting in converting a file of one format into another, you can use the org.jsapar.io.Converter where you specify the input and the output schema for the conversion. The converter uses the event mechanism under the hood, thus it reads, converts and writes one line at a time. This means it is very lean regarding memory usage.

Building java objects

Use the method org.jsapar.Parser.buildJava() in order to build java objects for each line in a file (or input). Note that in order to be able to use this feature, the schema have to be carefully written. For instance, the line type (name) of the line within the schema have to contain the complete class name of the java class to build for each line.

Converting java objects into a file

Use the class org.jsapar.input.JavaBuilder in order to convert java objects into a org.jsapar.Document, which can be used to produce the output file according to a schema.

Using xml as input

It is possilbe to build a org.jsapar.Document by using a xml document according to the XMLDocumentFormat.xsl (http://jsapar.tigris.org/XMLDocumentFormat/1.0). Use the class org.jsapar.input.XmlDocumentParser in order to convert a xml document into a org.jsapar.Document.

1 comentario:

Wilhelm dijo...

Cual es la intencion de este blog ? Me gustaria aprender Seam de JBoss, pero estas son meras referencias. Algún consejo ?