All Articles

Getting Started with SAS Programming - Exploring Data

Explore and validate data with SAS procedures

Procedures for data exploration:

PROC PRINT

PROC PRINT lists all columns and rows in the input table. (The OBS option limits the number of rows listed. The VAR statement limits and orders the column listed.)

PROC PRINT DATA=input-table(OBS=n);
    VAR col-name(s);
RUN;

2021 09 09 21 57 56

Example:

proc print data=pg1.storm_summary (obs=10);
run;

obs=10 will limit the number of rows to 10. It also includes all the columns in the data. If we want to limit the columns, we can use the var statement:

/* list first 10 rows */
proc print data=pg1.storm_summary (obs=10);
    var Season Name Basin MaxWindMPH MinPressure StartDate EndDate;
run;

We can also use Ctrl + click and drag to paste the columns.

2021 09 09 22 07 10

PROC MEANS

It generates simple summary statistics for all the numeric columns in the input data.

PROC MEANS DATA=input-table;
    VAR col-name(s);
RUN;

2021 09 09 21 58 19

Example:

/* calculate summary statistics */
proc means data=pg1.storm_summary (obs=10);
    var MaxWindMPH MinPressure;
run;

2021 09 09 22 14 01

The columns in the var statement must be numeric for the PROC MEANS procedure.

PROC UNIVARIATE

It generates summary statistics for each numeric column in the data by default, but it includes more detailed statistics related to distribution and extreme values.

PROC UNIVARIATE DATA=input-table;
   VAR col-name(s);
run;

2021 09 09 21 59 25

Example:

/* examine extreme values */
proc univariate data=pg1.storm_summary (obs=10);
    var MaxWindMPH MinPressure;
run;

2021 09 09 22 17 25

2021 09 09 22 18 03 2021 09 09 22 18 46

Similar result we will get for MinWindPressure as well.

PROC FREQ

It creates a frequency table for each variable in the input table. Note the usage of TABLE (as opposed to VAR for the above procedures).

PROC FREQ DATA=input-table;
   TABLE col-name(s) <options>;
RUN;

2021 09 09 22 00 08

Example:

/* list unique values and frequencies */
proc freq data=pg1.storm_summary;
    tables Basin Type Season;
run;

2021 09 09 22 22 31

Filtering Rows

WHERE Expression

The WHERE expression is used to filter rows.

PROC procedure-name ...;
    WHERE expression;
run;

A WHERE expression consists of column, operator, and value.

Operators:

  • = or EQ (equal)
  • ^= or ~= or NE (not equal)
  • > or GT (greater than)
  • < or LT (less than)
  • >= or GE (greater than or equal)
  • <= or LE (less than or equal)

Examples:

  • Type = "SUV" or "Type EQ "SUV"
  • MSRP <= 30000 or MSRP LE 30000
  • Character values are case-sensitive and must be in (single or double) quotation marks.
  • Numeric values shall not be in quotation marks and must only include digits, decimal points, and negative signs
  • SAS date constant: "ddmmyyyy"d

    • Examples:
    • where date > “1jan15”d;
    • where date > “01JAN2015”d;

Combining Expressions in a WHERE statement

Multiple statements can be combined using the keywords AND or OR. Example:

proc print data=sashelp.cars;
    var Make Model Type MSRP MPG_City MPG_Highway
    where type="SUV" and MSRP <= 30000;
run;

2021 09 09 23 55 53

IN operator can be used to compare a list of values.

Examples:

  • where Type in ("SUV", "Truck", "Wagon"); instead of where Type="SUV" or Type="Truck" or Type="Wagon";
  • where Type not in ("SUV", "Truck", "Wagon"); to exclude the columns

Usage of WHERE ALSO

WHERE can clause can be augmented using the WHERE ALSO keyword. Without WHERE ALSO the previous WHERE clauses will be replaced.

PROC procedure-name ...;
    WHERE expression;
    where also where-expression; *First where condition is still valid
run;

Using Special WHERE Operators

  • To check missing values: WHERE col-name is missing (e.g., where age is missing)
  • To check values not missing WHERE col-name is not missing
  • There is also an is null operator (e.g., where Item is null;)
  • To filter values in a range, we can use BETWEEN AND operator (e.g., where Age between 20 and 39;). The endpoints of the ranges are inclusive
  • The LIKE operator enables us to do pattern matching: WHERE col-name LIKE "value";

    • % symbol is wildcard for any number of characters
    • _ is wildcard for a single character
    • Examples:
    • where City like "New%" will return: New York, New Delhi, Newport, Newcastle, New
    • where City like "Sant_%" will return: Santa Clara, Santa Cruz, Santo Domingo, Santo Tomas

Creating and Using Macro Variables

Macro variables help to use the values of same character/numeric data used in several procedures.

Creating the macro variable %LET macto-variable=value; (note that we don’t enclose with quotation marks). When using the macro variable in program, we have to use the syntax &macro-variable.

Example:

%let Cartype=Wagon;

proc print data=sashelp.cars;
    where Type="&CarType";
    var Type Make Model MSRP;
run;

proc means data=sashelp.cars;
    where Type="&CarType";
    var MSRP MPG_Highway;
run;

proc freq data=sashelp.cars;
    where Type="&CarType";
    tables Origin Make;
run;

The ampersand triggers SAS to look up the text string stored in the CarType macro variable and replace it with Wagon before it executes the code.

Notes:

  • Like libraries macro variables are temporary, so when we exit SAS they are deleted.
  • Macro variable must be placed at the beginning of the program.
  • Macro variable makes the program dynamic
  • Macro variable must follow SAS naming rules
  • If a macro variable reference is used inside quotation marks, double quotation marks must be used

Example of macro variable within a WHERE statement:

%let WindSpeed=100;
%let BasinCode=SP;
%let Date=01JAN2010;

proc print data=pg1.storm_summary;
    where MaxWindMPH>=&WindSpeed and Basin="&BasinCode" and StartDate>="&Date"d;
    var Basin Name StartDate EndDate MaxWindMPH;
run;

proc means data

Formatting Columns

Formatting Data Values in Results

PROC PRINT DATA=input-table;
    FORMAT col-name(s) format;
RUN;

Format statement does not change the data, it only impacts the values are displayed.

We specify the format using this syntax:

<$>format-name<w>.<d>

  • <$> indicates character format
  • <w> specifies the total format width
  • the period (.) is a requited delimiter
  • <d> denotes the number of decimal places

Example:

proc print data=pg1.class_birthdate;
    format Height Weight 3. Birthdate date9.;
run;

2021 09 10 01 45 14

Common Formats for Numeric Values

Format Name Example Value Format Applied Formatted Value Decimal Places Actual Width Maximum Width
w.d 12345.67 5. 12345 0 5 5
w.d 12345.67 8.1 12345.7 5 7 8
COMMAw.d 12345.67 COMMA8.1 12,345.7 1 8 8
DOLLARw.d 12345.67 DOLLAR10.2 $12,345.67 1 10 10
DOLLARw.d 12345.67 DOLLAR10. $12,346 0 7 10
YENw.d 12345.67 YEN7. ¥12,346 0 7 7
EUROXw.d 12345.67 EUROX10.2 €12,345.67 2 10 10

Note that, w must include the width for the currency symbol ($123.23 has a width of 6). Also, comma is there by default for the currency formats

/*the variable n_of_trees is numeric*/
format n_of_trees notf 5.0;
  • Zw.d format pads right-justified output with zero. For example, 1350 with Z8 will display 00001350.

Common Formats for Date Value

when the value is 21199

Value Format Formatted Value
21199 DATE7. 15JAN18
21199 DATE9. 15JAN2018
21199 MMDDYY10. 01/15/2018
21199 DDMMYY8. 15/01/18
21199 MONYY7. JAN2018
21199 MONNAME. January
21199 WEEKDATE. Monday, January 15, 2018

Sort a table

We may want to sort a table because of various reasons:

  • improve visual arrangement of the data
  • identify and remove duplicate rows
  • prepare data for certain data processing steps

PROC SORT

Using the SORT procedure we can sort one or more numeric or character columns:

PROC SORT DATA=input-table <OUT=output-table>;
    BY <DESCENDING> col-name(s);
RUN;
  • If we don’t include the OUT option, PROC SORT changes rows in the input table.
  • By default, SAS sorts in ascending order.
  • Columns that are in the BY statement are sometimes called BY columns or BY variables.
  • PROC SORT doesn’t generate a printed output

Example:

proc sort data=pg1.class_test2 out=test_sort;
    by Name TestScore;
run;

2021 09 10 19 05 27

proc sort data=pg1.class_test2 out=test_sort;
    by Name descending TestScore;
run;

2021 09 10 19 06 21

Identifying and Removing Duplicate rows in a table

By adding options to the PROC SORT statement, we can identify and remove duplicate rows.

PROC SORT DATA=input-table <OUT=output-table>
          NODUPRECS <DUPOUT=output-table>;
    BY _ALL_;
RUN;
  • The NODUPRECS option removes adjacent rows that are entirely duplicated. In other words, it removes rows that are next to each other in the data where the values for every column match. When using this option, it’s good to use the keyword _ALL_ in the BY statement instead of a column name. This sorts the data by all columns so that entirely duplicated rows are next to each other, and then the NODUPRECS option can do its job.
  • The DUPOUT= option creates an output table containing duplicates removed.

Example:

proc sort data=pg1.class_test3
          out=test_clean
          noduprecs
          dupout=test_dups;
    by _all_;
run;

2021 09 10 19 14 23

Another available option is NODUPKEY.

PROC SORT DATA=input-table <OUT=output-table>
          NODUPKEY <DUPOUT=output-table>;
    BY <DESCENDING> col-name(s);
RUN;

NODUPKEY option keeps only the first row for each unique value of the column(s) listed in the BY statement.

Example:

proc sort data=pg1.class_test2
          out=test_clean
          dupout=test_dups
          nodupkey;
    by Name;
run;

2021 09 10 19 18 49