How do I read in a character variable with varying length in a space delimited dataset?

来源:互联网 发布:网络舆情特点 编辑:程序博客网 时间:2024/06/10 09:11


-------------------------------------------------------------------------------

SAS FAQ
How do I read in a character variable with varying length in a space delimited dataset?
This FAQ page demonstrates the use of traditional methods and introduces SAS special characters for reading in (messy) data with a character variable of varying length.
When the character variable contains only a single word
This half of the page shows how to read in a character variable with a single word with varying length when the dataset is space delimited. For our example we have a hypothetical website dataset with the following variables: age of page (age), the url (site) and the number of hits the site received (hits).
We start by reading in the dataset where our character variable, site, is read in with the default character format given by $.
data web;
input age site $ hits;
datalines;
12 http://www.site1.org/default.htm 123456
130 http://www.site2.com/index.htm 97654
254 http://www.site3.edu/department/index.htm 987654
;
proc print;
run;
 
Obs    age      site       hits
 
1      12    http://w    123456
2     130    http://w     97654
3     254    http://w    987654
Using the default method the variable site wasread only to the 8th character, the default length for character variables,which was not what we want. Next, we reformat the site variable by setting its format to be maximum length for the character variable across the observations, 41 columns wide. The format is specified by $41. after site in the input statement.
data web;
input age site $41. hits;
datalines;
12 http://www.site1.org/default.htm 123456
130 http://www.site2.com/index.htm 97654
254 http://www.site3.edu/department/index.htm 987654
;
proc print;
run;
 
Obs    age                      site                         hits
 
1      12    http://www.site1.org/default.htm 123456         130
2     254    http://www.site3.edu/department/index.htm    987654
This approach didn't work either; it read in two observations. 

There are three possible methods we can try next.
Method 1: The first method requires that prior to the input statement we use a length statement where we define the format of the character variable, and then in the input statement we format site with just $.
data web;
length site $41;
input age site $ hits;
datalines;
12 http://www.site1.org/default.htm 123456
130 http://www.site2.com/index.htm 97654
254 http://www.site3.edu/department/index.htm 987654
;
proc print;
run;
 
Obs                      site                       age     hits
 
1     http://www.site1.org/default.htm              12    123456
2     http://www.site2.com/index.htm               130     97654
3     http://www.site3.edu/department/index.htm    254    987654
Method 2: For the second method we use the SAS special character, the colon modifier ( : ), for the site variable format, :$41..The colon modifier tells SAS when it reads in site to do it until there is a break in the character and then stop. Note, when a character variable has more than one word, the colon modifier will take only the first word.
data web;
input age site :$41. hits;
datalines;
12 http://www.site1.org/default.htm 123456
130 http://www.site2.com/index.htm 97654
254 http://www.site3.edu/department/index.htm 987654
;
proc print;
run;
 
Obs    age                      site                        hits
 
1      12    http://www.site1.org/default.htm             123456
2     130    http://www.site2.com/index.htm                97654
3     254    http://www.site3.edu/department/index.htm    987654
Method 3: The final method, similar to the first, uses a SAS special character. The special character, & (ampersand), is set up in the same fashion as the colon special character. However, the special character assumes that the character variable ends only when it encounter a blank space that is two or more spaces long. Hence, a single space used to differentiate the character variable and the adjacent variable will be treated as one variable (and the data will be incorrectly read in). When the space to differentiate two variables is greater than two spaces, SAS begins to read in the next variable. The rationale for this rule is evident when the word contains one or more words. For this example, we make a slight modification to the raw data and put two or more spaces between the entries for site and the adjacent variable hits.
data web;
input age site & $41. hits;
datalines;
12  http://www.site1.org/default.htm   123456
130 http://www.site2.com/index.htm  97654
254  http://www.site3.edu/department/index.htm    987654
;
proc print;
run;
 
Obs    age                      site                        hits
 
1      12    http://www.site1.org/default.htm             123456
2     130    http://www.site2.com/index.htm                97654
3     254    http://www.site3.edu/department/index.htm    987654


When the character variable contains one or more words
The second half of this page shows how to read in a character variable when the character contains one or more words with varying length and the dataset is space delimited. For this example we create a hypothetical dataset containing the following variables; zip-code (zip), fruits produced in the zip code (produce) and pounds of fruit produced in the zip-code (pound).
The first example reads in the data from an external text file. Below is raw data used and the program used to read it in. Note that the quote marks around the character variable.
10034 "apples, grapes kiwi" 123456
92626 "oranges" 97654
25414 "pears apple" 987654
data fruit;
infile 'C:\messy.txt' delimiter = ' ';
length fruit $22;
input zip fruit $ pounds;
proc print;
run;
 
Obs    fruit         zip     pounds
 
1     "apples,     10034         .
2     "oranges"    92626     97654
3     "pears       25414         .
Clearly, our SAS data step did not correctly read in the data.Next we add the dsd option in the infile statement. The dsd option tells SAS that our delimiter, spaces, can be embedded in our character variable.
data fruit;
infile 'C:\messy.txt' delimiter = ' ' dsd;
length fruit $22;
input zip fruit $ pounds;
proc print;
run;
 
Obs    fruit                   zip     pounds
 
1     apples, grapes kiwi    10034    123456
2     oranges                92626     97654
3     pears apple            25414    987654
For the second example, we are going to read the data in within SAS and use the special character &. Once more, the special character assumes that the character variable ends only when it encounters a blank space that is two or more spaces long. Hence, a single space to differentiate the character variable and the adjacent variable will be ignored and the two variables will be treated as one variable. When the space to differentiate variables is greater than or equal to two spaces, SAS begins to read in the next variable. We make a slight modification to the raw data and put two or more spaces between the entries for fruit and pounds.
data fruit;
input zip fruit & $22. pounds;
datalines;
10034 apples, grapes kiwi  123456
92626  oranges  97654
25414 pears apple      987654
;
proc print;
run;
 
Obs     zip     fruit                  pounds
 
1     10034    apples, grapes kiwi    123456
2     92626    oranges                 97654
3     25414    pears apple            987654

0 0