解析逗号分隔文本

来源：互联网发布：排卵期计算法编辑：程序博客网时间：2024/05/09 07:28

解析逗号分隔文本

在这一节，我将考虑两种简单的方法解析逗号分隔文本。逗号分隔（comma separated values，CSV）格式，是以文本格式保存数据的一种方法，逗号用于界定保存在每一列中的值，它是一种最古老的文本格式，在 PC 还没有标准规范的时代就已经在使用了。逗号分隔文本真正复杂的是处理转义字符，以及列中包含逗号。在这种情况下，需要考虑如何解码列，使它不会产生新列。然而，我们不考虑这种情况，因为我们将使用解析器工具，如fslex.exe、fsyacc.exe 和 FParsec。

一个有用的逗号分隔文本解析器，必须以表格形式返回文件数据，用 F# 写只要几行：

open System.IO

let parseFile filename =

letlines = File.ReadAllLines filename

seq{ for line in lines -> line.Split(',') }

这个示例使用了.NET BCL 库，去打开文件，然后，使用 F# 序列表达式和字符串的Split 方法创建数组序列，包含了列中的值。虽然这种方法很简单，也工作得相当好，但是，也有一些不足。首先，不能保证每一行的列数相同，其次，结果的类型都是字符串，因此，代码的用户必须处理值的转换。

为了解决这两个问题，可以创建逗号分隔文本解析器类，使用F# 反射动态创建强类型元组。其基本思想是这样的，级这个类一个类型参数，它必须是元组，这个类使用反射决定元组中有什么元素，以及应该如何解析这些元素；然后，把这个信息用于解析给定文件中的每一行；最后，实现IEnumerable<'a> 接口，让客户端代码能够很容易枚举产生的元组列表。反射的使用说明类的实现不简单：

open System

open System.IO

open System.Collections

open System.Reflection

open Microsoft.FSharp.Reflection

// a class to read CSV values from afile

type CsvReader<'a>(filename) =

// fail if the gernic type is no a tuple

do ifnot (FSharpType.IsTuple(typeof<'a>)) then

failwith "Type parameter must be a tuple"

// get the elements of the tuple

let elements = FSharpType.GetTupleElements(typeof<'a>)

// create functions to parse a element of type t

let getParseFunct =

match t with

| _ when t =typeof<string>->

// for string types return a function that down

// casts to an object

fun x -> x :>obj

| _ ->

// for all other types test to see if they have a

// Parse static method and use that

let parse = t.GetMethod("Parse",

BindingFlags.Static |||

BindingFlags.Public,

null,

[| typeof<string>|],

null)

fun (s: string)->

parse.Invoke(null, [| box s |])

// create a list of parse functions from the tuple'selements

let funcs = Seq.map getParseFunc elements

// read all lines from the file

let lines = File.ReadAllLines filename

// create a function parse each row

let parseRow row =

let items =

Seq.zip (List.ofArray row)funcs

|> Seq.map (fun(ele, parser)-> parser ele)

FSharpValue.MakeTuple(Array.ofSeq items, typeof<'a>)

// parse each row cast value to the type of the given tuple

let items =

lines

|> Seq.map (funx -> (parseRow(x.Split(','))) :?>'a)

|> Seq.toList

// implement the generic IEnumerable<'a> interface

interfaceseq<'a>with

memberx.GetEnumerator() =

let seq = Seq.ofList items

seq.GetEnumerator()

// implement the non-generic IEnumerable interface

interfaceIEnumerable with

memberx.GetEnumerator() =

let seq = Seq.ofList items

seq.GetEnumerator() :> IEnumerator

然而，只要完成类的实现以后，就可以以强类型的方式，用它来处理逗号分隔文本文件中的数据，且非常简单。

System.IO.Directory.SetCurrentDirectory (__SOURCE_DIRECTORY__)

let values = newCsvReader<int*int*int>("numbers.csv")

for x, y, z in values do

assert (x + y= z)

printfn "%i + %i = %i" x y z

[

第一句，是原文中没有的。

加上以后，调试起来比较方便，只要把文本文件numbers.csv 与源程序文件放在同一个目录下就可以了；否则，系统会在 %temp% 目录下查找这个文件。

]

0 0