Sally配置文件讲解

来源：互联网发布：数据流程图例题编辑：程序博客网时间：2024/04/29 18:51

一般包含三个部分输入配置，特征配置以及输出配置。

Input configuration

input_format:这里的格式会告诉sally需要使用什么方式打开输入文件，不同格式文件对应不同方法；
chunck_size：To enable an efficient processing of large data sets, sally processes strings in chunks. This parameter defines the number of strings in one of these chunks. Depending on the lengths of the strings, this parameter can be adjusted to balance loading and processing of data；
decode_str：如果这个参数设置为“1”，sally会自动对URI编码元素进行解码。（That is, substrings of the form %XX are replaced with the byte represented by the hexadecimal number XX. This feature comes handy, if binary data is provided using the textual input format “lines”. For example, HTTP requests can be stored in a single line if line-breaks are represented by “%0a%0d”.）；
reverse_str：这个参数被设定为“1”时，输入字符串将会被颠倒顺序（Such reversing might help in situations where the reading direction of the input strings is unspecified）；
fasta_regex:在生物信息学中，FASTA格式（又称为Pearson格式），是一种基于文本用于表示核苷酸序列或氨基酸序列的格式。在这种格式中碱基对或氨基酸用单个字母来编码，且允许在序列前添加序列名及注释。如

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chickenADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK*

Feature configuration

ngram_len:指定n的大小，用来划分gram，
granularity：ngram_len指定了多少个symbols作为一个特征，而granularity则指定了什么粒度作为一个symbol（包括“bytes”、“tokens”）（这里特别注意一下最后一种token n-grams这种形式是将几个tokens整合一个group中，以这个group最为特征）

=item I<tokens>The strings are partitioned into substrings (tokens) using a set ofdelimiter characters.  Such partitioning is typical for natural languageprocessing, where the delimiters are usually defined as white-space andpunctuation symbols.  An embedding using tokens is selected by choosingI<tokens> as granularity (B<granularity>), defining a set of delimitercharacters (B<token_delim>) and setting the n-gram length to 1(B<ngram_len>).=item I<byte n-grams>The strings are characterized by all possible byte sequences of a fixedlength n (byte n-grams).  These features are frequently used if noinformation about the structure of strings is available, such as inbioinformatics or computer security.  An embedding using byte n-grams isselected by choosing I<bytes> as granularity (B<granularity>) and definingthe n-gram length (B<ngram_len>).=item I<token n-grams>The strings are characterized by all possible token sequences of a fixedlength n (token n-grams).  These features require the definition of a set ofdelimiters and a length n.  They are often used in natural languageprocessing as a coarse way for capturing structure of text.  An embeddingusing token n-grams is selected by choosing I<tokens> as granularity(B<granularity>), defining a set of delimiter characters (B<token_delim>) andchoosing an n-gram length (B<ngram_len>).

ngram_delim: 进行token划分的分隔符
ngram_pos:在计算value的时候是否需要考虑位置，也就是只有位置相同（或者同一位置的某个偏移范围内）的gram才会计数
pos_shift：指定偏移范围

The parameter B<ngram_pos> can be used to enable positional n-grams. Incontrast to regular n-grams, these substrings of length n are associatedwith a position in the originating string.  Positional n-grams thus onlymatch if they appear at the same location in a string.  The additionalparameter B<pos_shift> can be used to add a shift to the n-grams.  If theparameter is set to I<k>, multiple positional n-grams are extracted with ashift from I<-k> to I<+k>.

vect_embed:向量的Embedding方式，包括三种“cnt”，“bin”，“tfidf”。

This parameter specifies how the features are embedded in the vectorspace. Supported values are "bin" for associating each dimension witha binary value, "cnt" for associating each dimension with a countvalue and "tfidf" for using a TF-IDF weighting.

Output Configuration

output_format:输出格式，支持“text”、“libsvm”、“Json”以及matlab格式

=item I<"text">The feature vectors of the embedded strings are stored as plain text.Each feature vector is represented as a list of dimensions, which is writtento I<output> in the following form    dimension:feature:value,... sourceI<dimension> specifies the index of the dimension, I<feature> a textualrepresentation of the feature and I<value> the value at the dimension.  Ifparameter B<explicit_hash> is not enabled in the configuration, the fieldI<feature> is empty.=item I<"stdout">The feature vectors of the embedded strings are written to standard output(stdout) as text.  Each feature vector is represented as a list ofdimensions in the following form:    dimension:feature:value,... sourceI<dimension> specifies the index of the dimension, I<feature> a textualrepresentation of the feature and I<value> the value at the dimension.  Ifparameter B<explicit_hash> is not enabled in the configuration, the fieldI<feature> is empty.=item I<"matlab">The feature vectors of the embedded strings are stored in Matlabformat (v5).  The vectors are stored as a 1 x n struct array with thefields: data, src, label and feat. The name of the output file isgiven as I<output> to B<sally>. Note that great care is required toefficiently operate with sparse vectors in Matlab. If the sparserepresentation is lost during computations, excessive run-time andmemory requirements are likely.=item I<"cluto">The feature vectors of the embedded strings are stored as a sparsematrix suitable for the clustering tool Cluto. The first line ofthe file is a header for Cluto, while the remaining lines correspondto feature vectors. The name of the output file is given as I<output>to B<sally>. Note that Cluto can not handle arbitrarily large vectorspaces and thus the B<"hash_bits"> should be set to values below 24.=item I<"json">The feature vectors of the embedded strings are stored as JSON objects.Each object contains a list of dimension indices denoted I<dim> andcorresponding values denoted as I<val>.  Depending on the configuration thesource for each object as well as the actual string feature associated witheach dimension are also stored in the JSON object.

（PS：看到这里终于明白输出的.sally文件到底是什么意思了，今天看了一天的源码也算是没有浪费时间，下一步需要想想这么做为什么能够提取出特征来，以及后面可以怎么使用这种方式提取的特征。这里给出一个输出格式是“text”以及“json”的图。

输出格式为“text”：

这里写图片描述

输出格式为“json”：

这里写图片描述

下面是一个sally的配置文件的示例

# Input configurationinput = {        # Input format. Supported types: "dir", "arc", "lines", "fasta"        input_format    = "lines";        # Number of strings to process in each chunk        chunk_size      = 4096;        # Regex for extracting labels from FASTA descriptions        fasta_regex     = " (\\+|-)?[0-9]+";        # Regex for extracting labels from text lines        lines_regex     = "^[ ]+";};# Feature configurationfeatures = {        # Length of n-grams.        ngram_len       = 1;    # d = '()<>@,:;\\\"/[]?={}&?\n\r \t'; from urllib import quote; quote(d)        # Delimiters for n-grams. An empty string triggers byte n-grams.        # ngram_delim     = "%0a%0d%20%09";        # Embedding mode for vectors. Supported types "cnt", "bin", "tfidf"        vect_embed      = "bin";        # Normalization mode for vectors. Supported types "l1", "l2", "none".        vect_norm       = "none";        # Number of hash bits to use.         hash_bits       = 32;        # Explicit hash table instead of hashing features only.        explicit_hash   = 1;        # File to store weighting vector for TFIDF embedding.        tfidf_file      = "tfidf.fv";};# Configuration of outputoutput = {        # Output format. Supported formats: "libsvm", "text", "matlab"        output_format   = "text";};

0 0