fileparser

package module

v0.3.1 Latest Latest Go to latest Published: Dec 14, 2025 License: MIT Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/nao1215/fileparser

Links

Open Source Insights

README ¶

fileparser

Coverage

fileparser is a Go library for parsing various tabular data formats. It provides a unified interface for reading CSV, TSV, LTSV, Parquet, and XLSX files, with optional compression support.

This package is designed to be used by filesql, fileprep, and fileframe.

fileprep: struct-tag preprocessing and validation for CSV/TSV/LTSV, Parquet, Excel.
filesql: sql driver for CSV, TSV, LTSV, Parquet, Excel with compression support.
fileframe: DataFrame API for CSV/TSV/LTSV, Parquet, Excel.

Features

Multiple formats: CSV, TSV, LTSV, Parquet, XLSX
Compression support: gzip, bzip2, xz, zstd, zlib, snappy, s2, lz4
Type inference: Automatically detects column types (TEXT, INTEGER, REAL, DATETIME)
File type detection: Detects file format from path extension
Pure Go: No CGO required for any compression format

Installation

go get github.com/nao1215/fileparser

Usage

Parsing CSV

csvData := `name,age,score
Alice,30,85.5
Bob,25,92.0
Charlie,35,78.5`

result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))
fmt.Println("First row:", result.Records[0])

Output:

Headers: [name age score]
Records: 3
First row: [Alice 30 85.5]

Parsing TSV

tsvData := `id	product	price
1	Laptop	999.99
2	Mouse	29.99
3	Keyboard	79.99`

result, err := fileparser.Parse(strings.NewReader(tsvData), fileparser.TSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))

Output:

Headers: [id product price]
Records: 3

Parsing LTSV

ltsvData := `host:192.168.1.1	method:GET	path:/index.html
host:192.168.1.2	method:POST	path:/api/users`

result, err := fileparser.Parse(strings.NewReader(ltsvData), fileparser.LTSV)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Headers:", result.Headers)
fmt.Println("First row:", result.Records[0])

Output:

Headers: [host method path]
First row: [192.168.1.1 GET /index.html]

Auto-detect File Type

paths := []string{
    "data.csv",
    "data.csv.gz",
    "report.xlsx",
    "logs.ltsv.zst",
    "analytics.parquet",
}

for _, path := range paths {
    ft := fileparser.DetectFileType(path)
    fmt.Printf("%s -> %s\n", path, ft)
}

Output:

data.csv -> CSV
data.csv.gz -> CSV (gzip)
report.xlsx -> XLSX
logs.ltsv.zst -> LTSV (zstd)
analytics.parquet -> Parquet

Check Compression

types := []fileparser.FileType{
    fileparser.CSV,
    fileparser.CSVGZ,
    fileparser.Parquet,
    fileparser.ParquetZSTD,
}

for _, ft := range types {
    fmt.Printf("%s compressed: %v\n", ft, fileparser.IsCompressed(ft))
}

Output:

CSV compressed: false
CSV (gzip) compressed: true
Parquet compressed: false
Parquet (zstd) compressed: true

Get Base File Type

types := []fileparser.FileType{
    fileparser.CSV,
    fileparser.CSVGZ,
    fileparser.TSVBZ2,
    fileparser.ParquetZSTD,
}

for _, ft := range types {
    base := fileparser.BaseFileType(ft)
    fmt.Printf("%s -> %s\n", ft, base)
}

Output:

CSV -> CSV
CSV (gzip) -> CSV
TSV (bzip2) -> TSV
Parquet (zstd) -> Parquet

Convert String Values to Typed Values

// Integer column
intVal := fileparser.ParseValue("42", fileparser.TypeInteger)
fmt.Printf("Integer: %v (%T)\n", intVal, intVal)

// Real column
realVal := fileparser.ParseValue("3.14", fileparser.TypeReal)
fmt.Printf("Real: %v (%T)\n", realVal, realVal)

// Text column
textVal := fileparser.ParseValue("hello", fileparser.TypeText)
fmt.Printf("Text: %v (%T)\n", textVal, textVal)

// Empty value returns nil
nilVal := fileparser.ParseValue("", fileparser.TypeInteger)
fmt.Printf("Empty: %v\n", nilVal)

Output:

Integer: 42 (int64)
Real: 3.14 (float64)
Text: hello (string)
Empty: <nil>

Automatic Column Type Inference

csvData := `id,name,score,date
1,Alice,85.5,2024-01-15
2,Bob,92.0,2024-01-16
3,Charlie,78.5,2024-01-17`

result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
    log.Fatal(err)
}

for i, header := range result.Headers {
    fmt.Printf("%s: %s\n", header, result.ColumnTypes[i])
}

Output:

id: INTEGER
name: TEXT
score: REAL
date: DATETIME

Supported File Types

Format	Extension	Compressed Variants
CSV	`.csv`	`.csv.gz`, `.csv.bz2`, `.csv.xz`, `.csv.zst`, `.csv.z`, `.csv.snappy`, `.csv.s2`, `.csv.lz4`
TSV	`.tsv`	`.tsv.gz`, `.tsv.bz2`, `.tsv.xz`, `.tsv.zst`, `.tsv.z`, `.tsv.snappy`, `.tsv.s2`, `.tsv.lz4`
LTSV	`.ltsv`	`.ltsv.gz`, `.ltsv.bz2`, `.ltsv.xz`, `.ltsv.zst`, `.ltsv.z`, `.ltsv.snappy`, `.ltsv.s2`, `.ltsv.lz4`
Parquet	`.parquet`	`.parquet.gz`, `.parquet.bz2`, `.parquet.xz`, `.parquet.zst`, `.parquet.z`, `.parquet.snappy`, `.parquet.s2`, `.parquet.lz4`
XLSX	`.xlsx`	`.xlsx.gz`, `.xlsx.bz2`, `.xlsx.xz`, `.xlsx.zst`, `.xlsx.z`, `.xlsx.snappy`, `.xlsx.s2`, `.xlsx.lz4`
ACH	`.ach`	Not supported

ACH (NACHA) Support - Experimental

Warning: ACH file support is experimental. The API may change or delete in future versions.

The ach subpackage provides support for parsing ACH (Automated Clearing House) files following the NACHA format. ACH files are converted to multiple TableData structures for SQL querying.

Supported Tables

Table Name	Description
`file_header`	File header information (immediate destination, origin, etc.)
`batches`	Batch header and control information
`entries`	Entry detail records (transactions)
`addenda`	Standard addenda records (02, 05, 98, 99, etc.)
`iat_entries`	IAT (International ACH Transaction) entry details
`iat_addenda`	IAT addenda records (10-18, 98, 99)

Limitations

Read-only fields: The following fields are exported for viewing but changes are not written back to ACH files:

IAT Addenda sequence numbers (entry_detail_sequence_number, sequence_number)

Addenda05 index behavior: When an entry has multiple addenda types (e.g., Addenda02 + Addenda05), the addenda_index represents the position within all addenda for that entry, not the index within Addenda05 specifically. For updates, use addenda_type = '05' to filter correctly.

Validation: Modifying ACH data via SQL may create invalid ACH files. The moov-io/ach library's Create() method will validate the file, but users should ensure data consistency (e.g., AddendaRecordIndicator matches actual addenda presence).

Usage

import "github.com/nao1215/fileparser/ach"

// Parse ACH file
file, err := os.Open("payment.ach")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

tableSet, err := ach.Parse(file)
if err != nil {
    log.Fatal(err)
}

// Access tables
entries := tableSet.GetEntriesTable()
fmt.Printf("Found %d entries\n", len(entries.Records))

Compression Formats

Format	Extension	Library
gzip	`.gz`	`compress/gzip` (standard library)
bzip2	`.bz2`	`compress/bzip2` (standard library)
xz	`.xz`	`github.com/ulikunitz/xz`
zstd	`.zst`	`github.com/klauspost/compress/zstd`
zlib	`.z`	`compress/zlib` (standard library)
Snappy	`.snappy`	`github.com/klauspost/compress/snappy`
S2	`.s2`	`github.com/klauspost/compress/s2`
LZ4	`.lz4`	`github.com/pierrec/lz4/v4`

Column Types

The parser automatically infers column types based on the data:

Type	Description
`TypeText`	String/text data
`TypeInteger`	Integer numbers
`TypeReal`	Floating-point numbers
`TypeDatetime`	Date and time values

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Documentation ¶

Overview ¶

Package fileparser provides file parsing functionality for various tabular data formats. It supports CSV, TSV, LTSV, XLSX, and Parquet files, with optional compression (gzip, bzip2, xz, zstd).

This package can be used by filesql, fileprep, fileframe, or any application that needs to parse tabular data files.

Memory Considerations ¶

All parsing functions in this package load the entire dataset into memory. This design is intentional for simplicity and compatibility with formats that require random access (Parquet, XLSX), but has implications for large files:

CSV/TSV/LTSV: Entire file content is read into memory
XLSX: Entire workbook is loaded (Excel files can be large even with few rows)
Parquet: Entire file is read into memory for random access

For files larger than available memory, consider:

Using streaming APIs for CSV/TSV
Pre-filtering or splitting large files before processing
Increasing available memory for the process

Example usage ¶

f, _ := os.Open("data.csv")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSV)
if err != nil {
    log.Fatal(err)
}
fmt.Println("Columns:", result.Headers)
fmt.Println("Rows:", len(result.Records))

Type Conversion ¶

Use ParseValue to convert string records to typed Go values based on ColumnType.

Index ¶

Constants
func IsCompressed(ft FileType) bool
func ParseValue(value string, colType ColumnType) any
type ColumnType
- func (ct ColumnType) String() string
type FileType
- func BaseFileType(ft FileType) FileType
- func DetectFileType(path string) FileType
- func (ft FileType) String() string
type TableData
- func Parse(reader io.Reader, fileType FileType) (result *TableData, err error)

Constants ¶

View Source

const (
	ExtCSV     = ".csv"
	ExtTSV     = ".tsv"
	ExtLTSV    = ".ltsv"
	ExtParquet = ".parquet"
	ExtXLSX    = ".xlsx"
	ExtGZ      = ".gz"
	ExtBZ2     = ".bz2"
	ExtXZ      = ".xz"
	ExtZSTD    = ".zst"
	ExtZLIB    = ".z"
	ExtSNAPPY  = ".snappy"
	ExtS2      = ".s2"
	ExtLZ4     = ".lz4"
)

File extensions

Variables ¶

This section is empty.

Functions ¶

func IsCompressed ¶

func IsCompressed(ft FileType) bool

IsCompressed returns true if the file type is compressed.

Example ¶

package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	types := []fileparser.FileType{
		fileparser.CSV,
		fileparser.CSVGZ,
		fileparser.Parquet,
		fileparser.ParquetZSTD,
	}

	for _, ft := range types {
		fmt.Printf("%s compressed: %v\n", ft, fileparser.IsCompressed(ft))
	}
}

Output:

CSV compressed: false
CSV (gzip) compressed: true
Parquet compressed: false
Parquet (zstd) compressed: true

func ParseValue ¶

func ParseValue(value string, colType ColumnType) any

ParseValue converts a string value to the appropriate Go type based on ColumnType. This function is useful for converting string records from TableData to typed values.

Conversion rules:

TypeInteger: returns int64, or original string if parsing fails
TypeReal: returns float64, or original string if parsing fails
TypeDatetime: returns string (caller can parse with time.Parse if needed)
TypeText: returns string as-is
Empty values return nil

Example ¶

package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	// Integer column
	intVal := fileparser.ParseValue("42", fileparser.TypeInteger)
	fmt.Printf("Integer: %v (%T)\n", intVal, intVal)

	// Real column
	realVal := fileparser.ParseValue("3.14", fileparser.TypeReal)
	fmt.Printf("Real: %v (%T)\n", realVal, realVal)

	// Text column
	textVal := fileparser.ParseValue("hello", fileparser.TypeText)
	fmt.Printf("Text: %v (%T)\n", textVal, textVal)

	// Empty value returns nil
	nilVal := fileparser.ParseValue("", fileparser.TypeInteger)
	fmt.Printf("Empty: %v\n", nilVal)
}

Output:

Integer: 42 (int64)
Real: 3.14 (float64)
Text: hello (string)
Empty: <nil>

Types ¶

type ColumnType ¶

type ColumnType int

ColumnType represents the inferred type of a column.

const (
	// TypeText represents text/string column type.
	TypeText ColumnType = iota
	// TypeInteger represents integer column type.
	TypeInteger
	// TypeReal represents floating-point column type.
	TypeReal
	// TypeDatetime represents datetime column type.
	TypeDatetime
)

func (ColumnType) String ¶

func (ct ColumnType) String() string

String returns the string representation of ColumnType.

type FileType ¶

type FileType int

FileType represents supported file types including compression variants.

const (
	// CSV represents CSV file type.
	CSV FileType = iota
	// TSV represents TSV file type.
	TSV
	// LTSV represents LTSV (Labeled Tab-Separated Values) file type.
	LTSV
	// Parquet represents Apache Parquet file type.
	Parquet
	// XLSX represents Excel XLSX file type.
	XLSX

	// CSVGZ represents gzip-compressed CSV file type.
	CSVGZ
	// CSVBZ2 represents bzip2-compressed CSV file type.
	CSVBZ2
	// CSVXZ represents xz-compressed CSV file type.
	CSVXZ
	// CSVZSTD represents zstd-compressed CSV file type.
	CSVZSTD

	// TSVGZ represents gzip-compressed TSV file type.
	TSVGZ
	// TSVBZ2 represents bzip2-compressed TSV file type.
	TSVBZ2
	// TSVXZ represents xz-compressed TSV file type.
	TSVXZ
	// TSVZSTD represents zstd-compressed TSV file type.
	TSVZSTD

	// LTSVGZ represents gzip-compressed LTSV file type.
	LTSVGZ
	// LTSVBZ2 represents bzip2-compressed LTSV file type.
	LTSVBZ2
	// LTSVXZ represents xz-compressed LTSV file type.
	LTSVXZ
	// LTSVZSTD represents zstd-compressed LTSV file type.
	LTSVZSTD

	// ParquetGZ represents gzip-compressed Parquet file type.
	ParquetGZ
	// ParquetBZ2 represents bzip2-compressed Parquet file type.
	ParquetBZ2
	// ParquetXZ represents xz-compressed Parquet file type.
	ParquetXZ
	// ParquetZSTD represents zstd-compressed Parquet file type.
	ParquetZSTD

	// XLSXGZ represents gzip-compressed XLSX file type.
	XLSXGZ
	// XLSXBZ2 represents bzip2-compressed XLSX file type.
	XLSXBZ2
	// XLSXXZ represents xz-compressed XLSX file type.
	XLSXXZ
	// XLSXZSTD represents zstd-compressed XLSX file type.
	XLSXZSTD

	// CSVZLIB represents zlib-compressed CSV file type.
	CSVZLIB
	// TSVZLIB represents zlib-compressed TSV file type.
	TSVZLIB
	// LTSVZLIB represents zlib-compressed LTSV file type.
	LTSVZLIB
	// ParquetZLIB represents zlib-compressed Parquet file type.
	ParquetZLIB
	// XLSXZLIB represents zlib-compressed XLSX file type.
	XLSXZLIB

	// CSVSNAPPY represents snappy-compressed CSV file type.
	CSVSNAPPY
	// TSVSNAPPY represents snappy-compressed TSV file type.
	TSVSNAPPY
	// LTSVSNAPPY represents snappy-compressed LTSV file type.
	LTSVSNAPPY
	// ParquetSNAPPY represents snappy-compressed Parquet file type.
	ParquetSNAPPY
	// XLSXSNAPPY represents snappy-compressed XLSX file type.
	XLSXSNAPPY

	// CSVS2 represents s2-compressed CSV file type.
	CSVS2
	// TSVS2 represents s2-compressed TSV file type.
	TSVS2
	// LTSVS2 represents s2-compressed LTSV file type.
	LTSVS2
	// ParquetS2 represents s2-compressed Parquet file type.
	ParquetS2
	// XLSXS2 represents s2-compressed XLSX file type.
	XLSXS2

	// CSVLZ4 represents lz4-compressed CSV file type.
	CSVLZ4
	// TSVLZ4 represents lz4-compressed TSV file type.
	TSVLZ4
	// LTSVLZ4 represents lz4-compressed LTSV file type.
	LTSVLZ4
	// ParquetLZ4 represents lz4-compressed Parquet file type.
	ParquetLZ4
	// XLSXLZ4 represents lz4-compressed XLSX file type.
	XLSXLZ4

	// Unsupported represents unsupported file type.
	Unsupported
)

func BaseFileType ¶

func BaseFileType(ft FileType) FileType

BaseFileType returns the base file type without compression.

Example ¶

package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	types := []fileparser.FileType{
		fileparser.CSV,
		fileparser.CSVGZ,
		fileparser.TSVBZ2,
		fileparser.ParquetZSTD,
	}

	for _, ft := range types {
		base := fileparser.BaseFileType(ft)
		fmt.Printf("%s -> %s\n", ft, base)
	}
}

Output:

CSV -> CSV
CSV (gzip) -> CSV
TSV (bzip2) -> TSV
Parquet (zstd) -> Parquet

func DetectFileType ¶

func DetectFileType(path string) FileType

DetectFileType detects file type from path extension, including compression variants.

Example ¶

package main

import (
	"fmt"

	"github.com/nao1215/fileparser"
)

func main() {
	paths := []string{
		"data.csv",
		"data.csv.gz",
		"report.xlsx",
		"logs.ltsv.zst",
		"analytics.parquet",
	}

	for _, path := range paths {
		ft := fileparser.DetectFileType(path)
		fmt.Printf("%s -> %s\n", path, ft)
	}
}

Output:

data.csv -> CSV
data.csv.gz -> CSV (gzip)
report.xlsx -> XLSX
logs.ltsv.zst -> LTSV (zstd)
analytics.parquet -> Parquet

func (FileType) String ¶

func (ft FileType) String() string

String returns a human-readable string representation of the FileType.

type TableData ¶

type TableData struct {
	// Headers contains the column names in order.
	Headers []string
	// Records contains the data rows. Each record is a slice of string values.
	Records [][]string
	// ColumnTypes contains the inferred types for each column.
	// The length matches Headers.
	ColumnTypes []ColumnType
}

TableData contains the parsed data from a file.

Example (ColumnTypes) ¶

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	csvData := `id,name,score,date
1,Alice,85.5,2024-01-15
2,Bob,92.0,2024-01-16
3,Charlie,78.5,2024-01-17`

	result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	for i, header := range result.Headers {
		fmt.Printf("%s: %s\n", header, result.ColumnTypes[i])
	}
}

Output:

id: INTEGER
name: TEXT
score: REAL
date: DATETIME

func Parse ¶

func Parse(reader io.Reader, fileType FileType) (result *TableData, err error)

Parse reads data from an io.Reader and returns parsed results. The fileType parameter specifies the format and compression of the data.

Example:

f, _ := os.Open("data.csv.gz")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSVGZ)

Example (Csv) ¶

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	csvData := `name,age,score
Alice,30,85.5
Bob,25,92.0
Charlie,35,78.5`

	result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("Records:", len(result.Records))
	fmt.Println("First row:", result.Records[0])
}

Output:

Headers: [name age score]
Records: 3
First row: [Alice 30 85.5]

Example (Ltsv) ¶

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	ltsvData := `host:192.168.1.1	method:GET	path:/index.html
host:192.168.1.2	method:POST	path:/api/users`

	result, err := fileparser.Parse(strings.NewReader(ltsvData), fileparser.LTSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("First row:", result.Records[0])
}

Output:

Headers: [host method path]
First row: [192.168.1.1 GET /index.html]

Example (Tsv) ¶

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileparser"
)

func main() {
	tsvData := `id	product	price
1	Laptop	999.99
2	Mouse	29.99
3	Keyboard	79.99`

	result, err := fileparser.Parse(strings.NewReader(tsvData), fileparser.TSV)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println("Headers:", result.Headers)
	fmt.Println("Records:", len(result.Records))
}

Output:

Headers: [id product price]
Records: 3

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
ach Package ach provides bidirectional conversion between ACH files and TableData.	Package ach provides bidirectional conversion between ACH files and TableData.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL