How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

Asked 10/6, 2015 at 20:48 Answered 4/10, 2015 at 14:10

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:

data test1;
    input cust_id   $ 1
          month       3-8
          category  $ 10-12 
          status    $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;

proc export data = test1
    file = "test1.dta"
    dbms = stata replace;
quit;

the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.

This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.

Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.

Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.

Distinctly answered 10/6, 2015 at 20:48 Comment(9)

I suspect SAS is adding a string terminator, though why I have no clue. – Ibbetson 10/6, 2015 at 21:42

Looking at the Stata documentation, it does support \0 string terminator (for "Varchar" type operation). I suspect SAS just puts that after every string, if I had to guess. I would recommend putting a support ticket in with SAS tech support; they can likely a) confirm this and b) let you know if there is a workaround. I don't see one based on a brief look. – Ibbetson 10/6, 2015 at 21:56

And if you do hear back from SAS tech support - please add an answer with whatever information you get, so it's available for other searchers! – Ibbetson 10/6, 2015 at 21:57

Adding stata in case there is a stata-side workaround better than compress. – Ibbetson 10/6, 2015 at 21:58

@Ibbetson I contacted SAS tech support, so I'll post their solution (if any) when they get back to me. It seems clear that the problem is with SAS, because it occurs on the SAS side before I ever interact with Stata (SAS creates the file itself; it doesn't rely on Stata to create the Stata file). – Distinctly 10/6, 2015 at 22:17

Just out of curiosity, if you make the fields too long - say make them all $30. - does it still add a 31st byte? Or are they all then 30, or the same length as in the above test even though they are SAS length 30. – Ibbetson 10/6, 2015 at 22:17

Both SAS and STATA can in/export fixed field width text files. If passing your numerical data in digits with sufficient precision does not take too much space, this solves your problem. – Refrigeration 8/7, 2015 at 7:41

@DirkHorsten Is there an easy way to export a SAS dataset to a fixed-width data file without having to manually specify the width for each variable? I have datasets with hundreds of variables, and it's tedious to have to type out the put statements for each dataset (since they differ) when I need to export them. SAS associates a length with each variable, so is there a way to export and automatically use this length? – Distinctly 3/9, 2015 at 14:33

I suppose this is possible on an IBM mainframe, but that will probably not help you. It is however possible to write a SAS program that creates the correct put statement for you. Would that be the solution to your original problem? – Refrigeration 3/9, 2015 at 15:40

This is how you can do it using existing functions.

filename FT41F001 temp;
data _null_;
   file FT41F001;
   set test1;
   put 256*' ' @;
   __s=1;
   do while(1);
      length __name $32.;
      call vnext(__name);
      if missing(__name) or __name eq: '__' then leave;
      substr(_FILE_,__s) = vvaluex(__name);
      putlog _all_;
      __s = sum(__s,vformatwx(__name));
      end;
   _file_ = trim(_file_);
   put;
   format month f6.;
   run;

To avoid the use of _FILE_;

data _null_;
   file FT41F001;
   set test1;
   __s=1;
   do while(1);
      length __name $32. __value $128 __w 8;
      call vnext(__name);
      if missing(__name) or __name eq: '__' then leave;
      __value = vvaluex(__name);
      __w = vformatwx(__name);
      put __value $varying128. __w @;
      end;
   put;
   format month f6.;
   run;

Impossible answered 4/10, 2015 at 14:10 Comment(3)

call vnext and vformatwx are the puzzle pieces I was missing - it's much harder to put everything together neatly without them. However, it would still be better to avoid using _file_ as this limits each output line to 32767 characters. – Molokai 4/10, 2015 at 17:34

I would also suggest using substr instead of trim for the final tidying up step, as this will avoid removing any valid trailing whitespace from the last variable in the source dataset. – Molokai 4/10, 2015 at 17:41

Looking very good now - the only tweak I'd make would be to set lrecl=1000000 or similar to avoid long output lines being truncated to the default length. – Molokai 4/10, 2015 at 20:16

If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:

data test1;
    input cust_id   $ 1
          month       3-8
          category  $ 10-12 
          status    $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD  X
B 199912 D   C
;
run;

data _null_;
file "/folders/myfolders/test.txt";
set test1;
put @;
_FILE_ = cat(of _all_);
put;
run;

/* Print contents of the file to the log (for debugging only)*/
data _null_;
 infile "/folders/myfolders/test.txt";
 input;
 put _infile_;
run;

This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.

Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).

Molokai answered 4/10, 2015 at 11:22 Comment(2)

I like your idea of getting CAT to format the line but it does not work for numeric variables. You can't count on the numeric to character conversion to produce values with the same width. What would make this work (I think) is a CAT function version that formats all vars using the VVALUE function. – Impossible 4/10, 2015 at 13:46

@data_null_ - unfortunately no such hypothetical catv function exists, and vvalue and similar functions can't be used within a proc fcmp definition as they are only valid within a data step. Some arrays and macro logic would be needed to convert numeric vars to fixed width text. – Molokai 4/10, 2015 at 17:59

Recommended topics

Hot tags