Documentation
¶
Overview ¶
Package fileparser provides file parsing functionality for various tabular data formats. It supports CSV, TSV, LTSV, XLSX, and Parquet files, with optional compression (gzip, bzip2, xz, zstd).
This package can be used by filesql, fileprep, fileframe, or any application that needs to parse tabular data files.
Memory Considerations ¶
All parsing functions in this package load the entire dataset into memory. This design is intentional for simplicity and compatibility with formats that require random access (Parquet, XLSX), but has implications for large files:
- CSV/TSV/LTSV: Entire file content is read into memory
- XLSX: Entire workbook is loaded (Excel files can be large even with few rows)
- Parquet: Entire file is read into memory for random access
For files larger than available memory, consider:
- Using streaming APIs for CSV/TSV
- Pre-filtering or splitting large files before processing
- Increasing available memory for the process
Example usage ¶
f, _ := os.Open("data.csv")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSV)
if err != nil {
log.Fatal(err)
}
fmt.Println("Columns:", result.Headers)
fmt.Println("Rows:", len(result.Records))
Type Conversion ¶
Use ParseValue to convert string records to typed Go values based on ColumnType.
Index ¶
Examples ¶
Constants ¶
const ( ExtCSV = ".csv" ExtTSV = ".tsv" ExtLTSV = ".ltsv" ExtParquet = ".parquet" ExtXLSX = ".xlsx" ExtGZ = ".gz" ExtBZ2 = ".bz2" ExtXZ = ".xz" ExtZSTD = ".zst" ExtZLIB = ".z" ExtSNAPPY = ".snappy" ExtS2 = ".s2" ExtLZ4 = ".lz4" )
File extensions
Variables ¶
This section is empty.
Functions ¶
func IsCompressed ¶
IsCompressed returns true if the file type is compressed.
Example ¶
package main
import (
"fmt"
"github.com/nao1215/fileparser"
)
func main() {
types := []fileparser.FileType{
fileparser.CSV,
fileparser.CSVGZ,
fileparser.Parquet,
fileparser.ParquetZSTD,
}
for _, ft := range types {
fmt.Printf("%s compressed: %v\n", ft, fileparser.IsCompressed(ft))
}
}
Output: CSV compressed: false CSV (gzip) compressed: true Parquet compressed: false Parquet (zstd) compressed: true
func ParseValue ¶
func ParseValue(value string, colType ColumnType) any
ParseValue converts a string value to the appropriate Go type based on ColumnType. This function is useful for converting string records from TableData to typed values.
Conversion rules:
- TypeInteger: returns int64, or original string if parsing fails
- TypeReal: returns float64, or original string if parsing fails
- TypeDatetime: returns string (caller can parse with time.Parse if needed)
- TypeText: returns string as-is
- Empty values return nil
Example ¶
package main
import (
"fmt"
"github.com/nao1215/fileparser"
)
func main() {
// Integer column
intVal := fileparser.ParseValue("42", fileparser.TypeInteger)
fmt.Printf("Integer: %v (%T)\n", intVal, intVal)
// Real column
realVal := fileparser.ParseValue("3.14", fileparser.TypeReal)
fmt.Printf("Real: %v (%T)\n", realVal, realVal)
// Text column
textVal := fileparser.ParseValue("hello", fileparser.TypeText)
fmt.Printf("Text: %v (%T)\n", textVal, textVal)
// Empty value returns nil
nilVal := fileparser.ParseValue("", fileparser.TypeInteger)
fmt.Printf("Empty: %v\n", nilVal)
}
Output: Integer: 42 (int64) Real: 3.14 (float64) Text: hello (string) Empty: <nil>
Types ¶
type ColumnType ¶
type ColumnType int
ColumnType represents the inferred type of a column.
const ( // TypeText represents text/string column type. TypeText ColumnType = iota // TypeInteger represents integer column type. TypeInteger // TypeReal represents floating-point column type. TypeReal // TypeDatetime represents datetime column type. TypeDatetime )
func (ColumnType) String ¶
func (ct ColumnType) String() string
String returns the string representation of ColumnType.
type FileType ¶
type FileType int
FileType represents supported file types including compression variants.
const ( // CSV represents CSV file type. CSV FileType = iota // TSV represents TSV file type. TSV // LTSV represents LTSV (Labeled Tab-Separated Values) file type. LTSV // Parquet represents Apache Parquet file type. Parquet // XLSX represents Excel XLSX file type. XLSX // CSVGZ represents gzip-compressed CSV file type. CSVGZ // CSVBZ2 represents bzip2-compressed CSV file type. CSVBZ2 // CSVXZ represents xz-compressed CSV file type. CSVXZ // CSVZSTD represents zstd-compressed CSV file type. CSVZSTD // TSVGZ represents gzip-compressed TSV file type. TSVGZ // TSVBZ2 represents bzip2-compressed TSV file type. TSVBZ2 // TSVXZ represents xz-compressed TSV file type. TSVXZ // TSVZSTD represents zstd-compressed TSV file type. TSVZSTD // LTSVGZ represents gzip-compressed LTSV file type. LTSVGZ // LTSVBZ2 represents bzip2-compressed LTSV file type. LTSVBZ2 // LTSVXZ represents xz-compressed LTSV file type. LTSVXZ // LTSVZSTD represents zstd-compressed LTSV file type. LTSVZSTD // ParquetGZ represents gzip-compressed Parquet file type. ParquetGZ // ParquetBZ2 represents bzip2-compressed Parquet file type. ParquetBZ2 // ParquetXZ represents xz-compressed Parquet file type. ParquetXZ // ParquetZSTD represents zstd-compressed Parquet file type. ParquetZSTD // XLSXGZ represents gzip-compressed XLSX file type. XLSXGZ // XLSXBZ2 represents bzip2-compressed XLSX file type. XLSXBZ2 // XLSXXZ represents xz-compressed XLSX file type. XLSXXZ // XLSXZSTD represents zstd-compressed XLSX file type. XLSXZSTD // CSVZLIB represents zlib-compressed CSV file type. CSVZLIB // TSVZLIB represents zlib-compressed TSV file type. TSVZLIB // LTSVZLIB represents zlib-compressed LTSV file type. LTSVZLIB // ParquetZLIB represents zlib-compressed Parquet file type. ParquetZLIB // XLSXZLIB represents zlib-compressed XLSX file type. XLSXZLIB // CSVSNAPPY represents snappy-compressed CSV file type. CSVSNAPPY // TSVSNAPPY represents snappy-compressed TSV file type. TSVSNAPPY // LTSVSNAPPY represents snappy-compressed LTSV file type. LTSVSNAPPY // ParquetSNAPPY represents snappy-compressed Parquet file type. ParquetSNAPPY // XLSXSNAPPY represents snappy-compressed XLSX file type. XLSXSNAPPY // CSVS2 represents s2-compressed CSV file type. CSVS2 // TSVS2 represents s2-compressed TSV file type. TSVS2 // LTSVS2 represents s2-compressed LTSV file type. LTSVS2 // ParquetS2 represents s2-compressed Parquet file type. ParquetS2 // XLSXS2 represents s2-compressed XLSX file type. XLSXS2 // CSVLZ4 represents lz4-compressed CSV file type. CSVLZ4 // TSVLZ4 represents lz4-compressed TSV file type. TSVLZ4 // LTSVLZ4 represents lz4-compressed LTSV file type. LTSVLZ4 // ParquetLZ4 represents lz4-compressed Parquet file type. ParquetLZ4 // XLSXLZ4 represents lz4-compressed XLSX file type. XLSXLZ4 // Unsupported represents unsupported file type. Unsupported )
func BaseFileType ¶
BaseFileType returns the base file type without compression.
Example ¶
package main
import (
"fmt"
"github.com/nao1215/fileparser"
)
func main() {
types := []fileparser.FileType{
fileparser.CSV,
fileparser.CSVGZ,
fileparser.TSVBZ2,
fileparser.ParquetZSTD,
}
for _, ft := range types {
base := fileparser.BaseFileType(ft)
fmt.Printf("%s -> %s\n", ft, base)
}
}
Output: CSV -> CSV CSV (gzip) -> CSV TSV (bzip2) -> TSV Parquet (zstd) -> Parquet
func DetectFileType ¶
DetectFileType detects file type from path extension, including compression variants.
Example ¶
package main
import (
"fmt"
"github.com/nao1215/fileparser"
)
func main() {
paths := []string{
"data.csv",
"data.csv.gz",
"report.xlsx",
"logs.ltsv.zst",
"analytics.parquet",
}
for _, path := range paths {
ft := fileparser.DetectFileType(path)
fmt.Printf("%s -> %s\n", path, ft)
}
}
Output: data.csv -> CSV data.csv.gz -> CSV (gzip) report.xlsx -> XLSX logs.ltsv.zst -> LTSV (zstd) analytics.parquet -> Parquet
type TableData ¶
type TableData struct {
// Headers contains the column names in order.
Headers []string
// Records contains the data rows. Each record is a slice of string values.
Records [][]string
// ColumnTypes contains the inferred types for each column.
// The length matches Headers.
ColumnTypes []ColumnType
}
TableData contains the parsed data from a file.
Example (ColumnTypes) ¶
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileparser"
)
func main() {
csvData := `id,name,score,date
1,Alice,85.5,2024-01-15
2,Bob,92.0,2024-01-16
3,Charlie,78.5,2024-01-17`
result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
fmt.Println("Error:", err)
return
}
for i, header := range result.Headers {
fmt.Printf("%s: %s\n", header, result.ColumnTypes[i])
}
}
Output: id: INTEGER name: TEXT score: REAL date: DATETIME
func Parse ¶
Parse reads data from an io.Reader and returns parsed results. The fileType parameter specifies the format and compression of the data.
Example:
f, _ := os.Open("data.csv.gz")
defer f.Close()
result, err := fileparser.Parse(f, fileparser.CSVGZ)
Example (Csv) ¶
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileparser"
)
func main() {
csvData := `name,age,score
Alice,30,85.5
Bob,25,92.0
Charlie,35,78.5`
result, err := fileparser.Parse(strings.NewReader(csvData), fileparser.CSV)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))
fmt.Println("First row:", result.Records[0])
}
Output: Headers: [name age score] Records: 3 First row: [Alice 30 85.5]
Example (Ltsv) ¶
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileparser"
)
func main() {
ltsvData := `host:192.168.1.1 method:GET path:/index.html
host:192.168.1.2 method:POST path:/api/users`
result, err := fileparser.Parse(strings.NewReader(ltsvData), fileparser.LTSV)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println("Headers:", result.Headers)
fmt.Println("First row:", result.Records[0])
}
Output: Headers: [host method path] First row: [192.168.1.1 GET /index.html]
Example (Tsv) ¶
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileparser"
)
func main() {
tsvData := `id product price
1 Laptop 999.99
2 Mouse 29.99
3 Keyboard 79.99`
result, err := fileparser.Parse(strings.NewReader(tsvData), fileparser.TSV)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println("Headers:", result.Headers)
fmt.Println("Records:", len(result.Records))
}
Output: Headers: [id product price] Records: 3