fileparser

package module
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 14, 2025 License: MIT Imports: 23 Imported by: 0

README

fileparser

Go Reference Go Report Card MultiPlatformUnitTest Coverage

fileparser is a Go library for parsing various tabular data formats. It provides a unified interface for reading CSV, TSV, LTSV, Parquet, and XLSX files, with optional compression support.

This package is designed to be used by filesql, fileprep, and fileframe.

  • fileprep: struct-tag preprocessing and validation for CSV/TSV/LTSV, Parquet, Excel.
  • filesql: sql driver for CSV, TSV, LTSV, Parquet, Excel with compression support.
  • fileframe: DataFrame API for CSV/TSV/LTSV, Parquet, Excel.

Features

  • Multiple formats: CSV, TSV, LTSV, Parquet, XLSX
  • Compression support: gzip, bzip2, xz, zstd, zlib, snappy, s2, lz4
  • Type inference: Automatically detects column types (TEXT, INTEGER, REAL, DATETIME)
  • File type detection: Detects file format from path extension
  • Pure Go: No CGO required for any compression format

Installation

go get github.com/nao1215/fileparser

Usage

Parsing CSV
csvData := `name,age,score
Alice,30,85.5
Bob,25,92.0
Charlie,35,78.5`

result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))
fmt.Println("First row:", result.Records[0])

Output:

Headers: [name age score]
Records: 3
First row: [Alice 30 85.5]
Parsing TSV
tsvData := `id	product	price
1	Laptop	999.99
2	Mouse	29.99
3	Keyboard	79.99`

result, err := fileparser.Parse(strings.NewReader(tsvData), fileparser.TSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))

Output:

Headers: [id product price]
Records: 3
Parsing LTSV
ltsvData := `host:192.168.1.1	method:GET	path:/index.html
host:192.168.1.2	method:POST	path:/api/users`

result, err := fileparser.Parse(strings.NewReader(ltsvData), fileparser.LTSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("First row:", result.Records[0])

Output:

Headers: [host method path]
First row: [192.168.1.1 GET /index.html]
Auto-detect File Type
paths := []string{
    "data.csv",
    "data.csv.gz",
    "report.xlsx",
    "logs.ltsv.zst",
    "analytics.parquet",
}

for _, path := range paths {
    ft := fileparser.DetectFileType(path)
    fmt.Printf("%s -> %s\n", path, ft)
}

Output:

data.csv -> CSV
data.csv.gz -> CSV (gzip)
report.xlsx -> XLSX
logs.ltsv.zst -> LTSV (zstd)
analytics.parquet -> Parquet
Check Compression
types := []fileparser.FileType{
    fileparser.CSV,
    fileparser.CSVGZ,
    fileparser.Parquet,
    fileparser.ParquetZSTD,
}

for _, ft := range types {
    fmt.Printf("%s compressed: %v\n", ft, fileparser.IsCompressed(ft))
}

Output:

CSV compressed: false
CSV (gzip) compressed: true
Parquet compressed: false
Parquet (zstd) compressed: true
Get Base File Type
types := []fileparser.FileType{
    fileparser.CSV,
    fileparser.CSVGZ,
    fileparser.TSVBZ2,
    fileparser.ParquetZSTD,
}

for _, ft := range types {
    base := fileparser.BaseFileType(ft)
    fmt.Printf("%s -> %s\n", ft, base)
}

Output:

CSV -> CSV
CSV (gzip) -> CSV
TSV (bzip2) -> TSV
Parquet (zstd) -> Parquet
Convert String Values to Typed Values
// Integer column
intVal := fileparser.ParseValue("42", fileparser.TypeInteger)
fmt.Printf("Integer: %v (%T)\n", intVal, intVal)

// Real column
realVal := fileparser.ParseValue("3.14", fileparser.TypeReal)
fmt.Printf("Real: %v (%T)\n", realVal, realVal)

// Text column
textVal := fileparser.ParseValue("hello", fileparser.TypeText)
fmt.Printf("Text: %v (%T)\n", textVal, textVal)

// Empty value returns nil
nilVal := fileparser.ParseValue("", fileparser.TypeInteger)
fmt.Printf("Empty: %v\n", nilVal)

Output:

Integer: 42 (int64)
Real: 3.14 (float64)
Text: hello (string)
Empty: <nil>
Automatic Column Type Inference
csvData := `id,name,score,date
1,Alice,85.5,2024-01-15
2,Bob,92.0,2024-01-16
3,Charlie,78.5,2024-01-17`

result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
    log.Fatal(err)
}

for i, header := range result.Headers {
    fmt.Printf("%s: %s\n", header, result.ColumnTypes[i])
}

Output:

id: INTEGER
name: TEXT
score: REAL
date: DATETIME

Supported File Types

Format Extension Compressed Variants
CSV .csv .csv.gz, .csv.bz2, .csv.xz, .csv.zst, .csv.z, .csv.snappy, .csv.s2, .csv.lz4
TSV .tsv .tsv.gz, .tsv.bz2, .tsv.xz, .tsv.zst, .tsv.z, .tsv.snappy, .tsv.s2, .tsv.lz4
LTSV .ltsv .ltsv.gz, .ltsv.bz2, .ltsv.xz, .ltsv.zst, .ltsv.z, .ltsv.snappy, .ltsv.s2, .ltsv.lz4
Parquet .parquet .parquet.gz, .parquet.bz2, .parquet.xz, .parquet.zst, .parquet.z, .parquet.snappy, .parquet.s2, .parquet.lz4
XLSX .xlsx .xlsx.gz, .xlsx.bz2, .xlsx.xz, .xlsx.zst, .xlsx.z, .xlsx.snappy, .xlsx.s2, .xlsx.lz4
ACH .ach Not supported

ACH (NACHA) Support - Experimental

Warning: ACH file support is experimental. The API may change or delete in future versions.

The ach subpackage provides support for parsing ACH (Automated Clearing House) files following the NACHA format. ACH files are converted to multiple TableData structures for SQL querying.

Supported Tables
Table Name Description
file_header File header information (immediate destination, origin, etc.)
batches Batch header and control information
entries Entry detail records (transactions)
addenda Standard addenda records (02, 05, 98, 99, etc.)
iat_entries IAT (International ACH Transaction) entry details
iat_addenda IAT addenda records (10-18, 98, 99)
Limitations

Read-only fields: The following fields are exported for viewing but changes are not written back to ACH files:

  • IAT Addenda sequence numbers (entry_detail_sequence_number, sequence_number)

Addenda05 index behavior: When an entry has multiple addenda types (e.g., Addenda02 + Addenda05), the addenda_index represents the position within all addenda for that entry, not the index within Addenda05 specifically. For updates, use addenda_type = '05' to filter correctly.

Validation: Modifying ACH data via SQL may create invalid ACH files. The moov-io/ach library's Create() method will validate the file, but users should ensure data consistency (e.g., AddendaRecordIndicator matches actual addenda presence).

Usage
import "github.com/nao1215/fileparser/ach"

// Parse ACH file
file, err := os.Open("payment.ach")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

tableSet, err := ach.Parse(file)
if err != nil {
    log.Fatal(err)
}

// Access tables
entries := tableSet.GetEntriesTable()
fmt.Printf("Found %d entries\n", len(entries.Records))

Compression Formats

Format Extension Library
gzip .gz compress/gzip (standard library)
bzip2 .bz2 compress/bzip2 (standard library)
xz .xz github.com/ulikunitz/xz
zstd .zst github.com/klauspost/compress/zstd
zlib .z compress/zlib (standard library)
Snappy .snappy github.com/klauspost/compress/snappy
S2 .s2 github.com/klauspost/compress/s2
LZ4 .lz4 github.com/pierrec/lz4/v4

Column Types

The parser automatically infers column types based on the data:

Type Description
TypeText String/text data
TypeInteger Integer numbers
TypeReal Floating-point numbers
TypeDatetime Date and time values

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Documentation

Overview

Package fileparser provides file parsing functionality for various tabular data formats. It supports CSV, TSV, LTSV, XLSX, and Parquet files, with optional compression (gzip, bzip2, xz, zstd).

This package can be used by filesql, fileprep, fileframe, or any application that needs to parse tabular data files.

Memory Considerations

All parsing functions in this package load the entire dataset into memory. This design is intentional for simplicity and compatibility with formats that require random access (Parquet, XLSX), but has implications for large files:

  • CSV/TSV/LTSV: Entire file content is read into memory
  • XLSX: Entire workbook is loaded (Excel files can be large even with few rows)
  • Parquet: Entire file is read into memory for random access

For files larger than available memory, consider:

  • Using streaming APIs for CSV/TSV
  • Pre-filtering or splitting large files before processing
  • Increasing available memory for the process

Example usage

f, _ := os.Open("data.csv")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSV)
if err != nil {
    log.Fatal(err)
}
fmt.Println("Columns:", result.Headers)
fmt.Println("Rows:", len(result.Records))

Type Conversion

Use ParseValue to convert string records to typed Go values based on ColumnType.

Index

Examples

Constants

View Source
const (
	ExtCSV     = ".csv"
	ExtTSV     = ".tsv"
	ExtLTSV    = ".ltsv"
	ExtParquet = ".parquet"
	ExtXLSX    = ".xlsx"
	ExtGZ      = ".gz"
	ExtBZ2     = ".bz2"
	ExtXZ      = ".xz"
	ExtZSTD    = ".zst"
	ExtZLIB    = ".z"
	ExtSNAPPY  = ".snappy"
	ExtS2      = ".s2"
	ExtLZ4     = ".lz4"
)

File extensions

Variables

This section is empty.

Functions

func IsCompressed

func IsCompressed(ft FileType) bool

IsCompressed returns true if the file type is compressed.

Example
package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	types := []fileparser.FileType{
		fileparser.CSV,
		fileparser.CSVGZ,
		fileparser.Parquet,
		fileparser.ParquetZSTD,
	}

	for _, ft := range types {
		fmt.Printf("%s compressed: %v\n", ft, fileparser.IsCompressed(ft))
	}
}
Output:

CSV compressed: false
CSV (gzip) compressed: true
Parquet compressed: false
Parquet (zstd) compressed: true

func ParseValue

func ParseValue(value string, colType ColumnType) any

ParseValue converts a string value to the appropriate Go type based on ColumnType. This function is useful for converting string records from TableData to typed values.

Conversion rules:

  • TypeInteger: returns int64, or original string if parsing fails
  • TypeReal: returns float64, or original string if parsing fails
  • TypeDatetime: returns string (caller can parse with time.Parse if needed)
  • TypeText: returns string as-is
  • Empty values return nil
Example
package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	// Integer column
	intVal := fileparser.ParseValue("42", fileparser.TypeInteger)
	fmt.Printf("Integer: %v (%T)\n", intVal, intVal)

	// Real column
	realVal := fileparser.ParseValue("3.14", fileparser.TypeReal)
	fmt.Printf("Real: %v (%T)\n", realVal, realVal)

	// Text column
	textVal := fileparser.ParseValue("hello", fileparser.TypeText)
	fmt.Printf("Text: %v (%T)\n", textVal, textVal)

	// Empty value returns nil
	nilVal := fileparser.ParseValue("", fileparser.TypeInteger)
	fmt.Printf("Empty: %v\n", nilVal)
}
Output:

Integer: 42 (int64)
Real: 3.14 (float64)
Text: hello (string)
Empty: <nil>

Types

type ColumnType

type ColumnType int

ColumnType represents the inferred type of a column.

const (
	// TypeText represents text/string column type.
	TypeText ColumnType = iota
	// TypeInteger represents integer column type.
	TypeInteger
	// TypeReal represents floating-point column type.
	TypeReal
	// TypeDatetime represents datetime column type.
	TypeDatetime
)

func (ColumnType) String

func (ct ColumnType) String() string

String returns the string representation of ColumnType.

type FileType

type FileType int

FileType represents supported file types including compression variants.

const (
	// CSV represents CSV file type.
	CSV FileType = iota
	// TSV represents TSV file type.
	TSV
	// LTSV represents LTSV (Labeled Tab-Separated Values) file type.
	LTSV
	// Parquet represents Apache Parquet file type.
	Parquet
	// XLSX represents Excel XLSX file type.
	XLSX

	// CSVGZ represents gzip-compressed CSV file type.
	CSVGZ
	// CSVBZ2 represents bzip2-compressed CSV file type.
	CSVBZ2
	// CSVXZ represents xz-compressed CSV file type.
	CSVXZ
	// CSVZSTD represents zstd-compressed CSV file type.
	CSVZSTD

	// TSVGZ represents gzip-compressed TSV file type.
	TSVGZ
	// TSVBZ2 represents bzip2-compressed TSV file type.
	TSVBZ2
	// TSVXZ represents xz-compressed TSV file type.
	TSVXZ
	// TSVZSTD represents zstd-compressed TSV file type.
	TSVZSTD

	// LTSVGZ represents gzip-compressed LTSV file type.
	LTSVGZ
	// LTSVBZ2 represents bzip2-compressed LTSV file type.
	LTSVBZ2
	// LTSVXZ represents xz-compressed LTSV file type.
	LTSVXZ
	// LTSVZSTD represents zstd-compressed LTSV file type.
	LTSVZSTD

	// ParquetGZ represents gzip-compressed Parquet file type.
	ParquetGZ
	// ParquetBZ2 represents bzip2-compressed Parquet file type.
	ParquetBZ2
	// ParquetXZ represents xz-compressed Parquet file type.
	ParquetXZ
	// ParquetZSTD represents zstd-compressed Parquet file type.
	ParquetZSTD

	// XLSXGZ represents gzip-compressed XLSX file type.
	XLSXGZ
	// XLSXBZ2 represents bzip2-compressed XLSX file type.
	XLSXBZ2
	// XLSXXZ represents xz-compressed XLSX file type.
	XLSXXZ
	// XLSXZSTD represents zstd-compressed XLSX file type.
	XLSXZSTD

	// CSVZLIB represents zlib-compressed CSV file type.
	CSVZLIB
	// TSVZLIB represents zlib-compressed TSV file type.
	TSVZLIB
	// LTSVZLIB represents zlib-compressed LTSV file type.
	LTSVZLIB
	// ParquetZLIB represents zlib-compressed Parquet file type.
	ParquetZLIB
	// XLSXZLIB represents zlib-compressed XLSX file type.
	XLSXZLIB

	// CSVSNAPPY represents snappy-compressed CSV file type.
	CSVSNAPPY
	// TSVSNAPPY represents snappy-compressed TSV file type.
	TSVSNAPPY
	// LTSVSNAPPY represents snappy-compressed LTSV file type.
	LTSVSNAPPY
	// ParquetSNAPPY represents snappy-compressed Parquet file type.
	ParquetSNAPPY
	// XLSXSNAPPY represents snappy-compressed XLSX file type.
	XLSXSNAPPY

	// CSVS2 represents s2-compressed CSV file type.
	CSVS2
	// TSVS2 represents s2-compressed TSV file type.
	TSVS2
	// LTSVS2 represents s2-compressed LTSV file type.
	LTSVS2
	// ParquetS2 represents s2-compressed Parquet file type.
	ParquetS2
	// XLSXS2 represents s2-compressed XLSX file type.
	XLSXS2

	// CSVLZ4 represents lz4-compressed CSV file type.
	CSVLZ4
	// TSVLZ4 represents lz4-compressed TSV file type.
	TSVLZ4
	// LTSVLZ4 represents lz4-compressed LTSV file type.
	LTSVLZ4
	// ParquetLZ4 represents lz4-compressed Parquet file type.
	ParquetLZ4
	// XLSXLZ4 represents lz4-compressed XLSX file type.
	XLSXLZ4

	// Unsupported represents unsupported file type.
	Unsupported
)

func BaseFileType

func BaseFileType(ft FileType) FileType

BaseFileType returns the base file type without compression.

Example
package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	types := []fileparser.FileType{
		fileparser.CSV,
		fileparser.CSVGZ,
		fileparser.TSVBZ2,
		fileparser.ParquetZSTD,
	}

	for _, ft := range types {
		base := fileparser.BaseFileType(ft)
		fmt.Printf("%s -> %s\n", ft, base)
	}
}
Output:

CSV -> CSV
CSV (gzip) -> CSV
TSV (bzip2) -> TSV
Parquet (zstd) -> Parquet

func DetectFileType

func DetectFileType(path string) FileType

DetectFileType detects file type from path extension, including compression variants.

Example
package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	paths := []string{
		"data.csv",
		"data.csv.gz",
		"report.xlsx",
		"logs.ltsv.zst",
		"analytics.parquet",
	}

	for _, path := range paths {
		ft := fileparser.DetectFileType(path)
		fmt.Printf("%s -> %s\n", path, ft)
	}
}
Output:

data.csv -> CSV
data.csv.gz -> CSV (gzip)
report.xlsx -> XLSX
logs.ltsv.zst -> LTSV (zstd)
analytics.parquet -> Parquet

func (FileType) String

func (ft FileType) String() string

String returns a human-readable string representation of the FileType.

type TableData

type TableData struct {
	// Headers contains the column names in order.
	Headers []string
	// Records contains the data rows. Each record is a slice of string values.
	Records [][]string
	// ColumnTypes contains the inferred types for each column.
	// The length matches Headers.
	ColumnTypes []ColumnType
}

TableData contains the parsed data from a file.

Example (ColumnTypes)
package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	csvData := `id,name,score,date
1,Alice,85.5,2024-01-15
2,Bob,92.0,2024-01-16
3,Charlie,78.5,2024-01-17`

	result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	for i, header := range result.Headers {
		fmt.Printf("%s: %s\n", header, result.ColumnTypes[i])
	}
}
Output:

id: INTEGER
name: TEXT
score: REAL
date: DATETIME

func Parse

func Parse(reader io.Reader, fileType FileType) (result *TableData, err error)

Parse reads data from an io.Reader and returns parsed results. The fileType parameter specifies the format and compression of the data.

Example:

f, _ := os.Open("data.csv.gz")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSVGZ)
Example (Csv)
package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	csvData := `name,age,score
Alice,30,85.5
Bob,25,92.0
Charlie,35,78.5`

	result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("Records:", len(result.Records))
	fmt.Println("First row:", result.Records[0])
}
Output:

Headers: [name age score]
Records: 3
First row: [Alice 30 85.5]
Example (Ltsv)
package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	ltsvData := `host:192.168.1.1	method:GET	path:/index.html
host:192.168.1.2	method:POST	path:/api/users`

	result, err := fileparser.Parse(strings.NewReader(ltsvData), fileparser.LTSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("First row:", result.Records[0])
}
Output:

Headers: [host method path]
First row: [192.168.1.1 GET /index.html]
Example (Tsv)
package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	tsvData := `id	product	price
1	Laptop	999.99
2	Mouse	29.99
3	Keyboard	79.99`

	result, err := fileparser.Parse(strings.NewReader(tsvData), fileparser.TSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("Records:", len(result.Records))
}
Output:

Headers: [id product price]
Records: 3

Directories

Path Synopsis
Package ach provides bidirectional conversion between ACH files and TableData.
Package ach provides bidirectional conversion between ACH files and TableData.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL