Explore and validate data with SAS procedures
Procedures for data exploration:
PROC PRINT
PROC PRINT lists all columns and rows in the input table. (The OBS option limits the number of rows listed. The VAR statement limits and orders the column listed.)
PROC PRINT DATA=input-table(OBS=n);
VAR col-name(s);
RUN;
Example:
proc print data=pg1.storm_summary (obs=10);
run;
obs=10
will limit the number of rows to 10. It also includes all the columns in
the data. If we want to limit the columns, we can use the var
statement:
/* list first 10 rows */
proc print data=pg1.storm_summary (obs=10);
var Season Name Basin MaxWindMPH MinPressure StartDate EndDate;
run;
We can also use Ctrl + click and drag to paste the columns.
PROC MEANS
It generates simple summary statistics for all the numeric columns in the input data.
PROC MEANS DATA=input-table;
VAR col-name(s);
RUN;
Example:
/* calculate summary statistics */
proc means data=pg1.storm_summary (obs=10);
var MaxWindMPH MinPressure;
run;
The columns in the var
statement must be numeric for the PROC MEANS procedure.
PROC UNIVARIATE
It generates summary statistics for each numeric column in the data by default, but it includes more detailed statistics related to distribution and extreme values.
PROC UNIVARIATE DATA=input-table;
VAR col-name(s);
run;
Example:
/* examine extreme values */
proc univariate data=pg1.storm_summary (obs=10);
var MaxWindMPH MinPressure;
run;
Similar result we will get for MinWindPressure as well.
PROC FREQ
It creates a frequency table for each variable in the input table. Note the usage
of TABLE
(as opposed to VAR
for the above procedures).
PROC FREQ DATA=input-table;
TABLE col-name(s) <options>;
RUN;
Example:
/* list unique values and frequencies */
proc freq data=pg1.storm_summary;
tables Basin Type Season;
run;
Filtering Rows
WHERE Expression
The WHERE expression is used to filter rows.
PROC procedure-name ...;
WHERE expression;
run;
A WHERE expression consists of column, operator, and value.
Operators:
=
or EQ (equal)^=
or~=
or NE (not equal)>
or GT (greater than)<
or LT (less than)>=
or GE (greater than or equal)<=
or LE (less than or equal)
Examples:
Type = "SUV"
or"Type EQ "SUV"
MSRP <= 30000
orMSRP LE 30000
- Character values are case-sensitive and must be in (single or double) quotation marks.
- Numeric values shall not be in quotation marks and must only include digits, decimal points, and negative signs
-
SAS date constant:
"ddmmyyyy"d
- Examples:
- where date > “1jan15”d;
- where date > “01JAN2015”d;
Combining Expressions in a WHERE statement
Multiple statements can be combined using the keywords AND or OR. Example:
proc print data=sashelp.cars;
var Make Model Type MSRP MPG_City MPG_Highway
where type="SUV" and MSRP <= 30000;
run;
IN
operator can be used to compare a list of values.
Examples:
where Type in ("SUV", "Truck", "Wagon");
instead ofwhere Type="SUV" or Type="Truck" or Type="Wagon";
where Type not in ("SUV", "Truck", "Wagon");
to exclude the columns
Usage of WHERE ALSO
WHERE can clause can be augmented using the WHERE ALSO keyword. Without WHERE ALSO the previous WHERE clauses will be replaced.
PROC procedure-name ...;
WHERE expression;
where also where-expression; *First where condition is still valid
run;
Using Special WHERE Operators
- To check missing values:
WHERE col-name is missing
(e.g.,where age is missing
) - To check values not missing
WHERE col-name is not missing
- There is also an
is null
operator (e.g.,where Item is null;
) - To filter values in a range, we can use
BETWEEN AND
operator (e.g.,where Age between 20 and 39;
). The endpoints of the ranges are inclusive -
The
LIKE
operator enables us to do pattern matching:WHERE col-name LIKE "value";
%
symbol is wildcard for any number of characters_
is wildcard for a single character- Examples:
where City like "New%"
will return: New York, New Delhi, Newport, Newcastle, Newwhere City like "Sant_%"
will return: Santa Clara, Santa Cruz, Santo Domingo, Santo Tomas
Creating and Using Macro Variables
Macro variables help to use the values of same character/numeric data used in several procedures.
Creating the macro variable %LET macto-variable=value;
(note that we don’t enclose with quotation marks).
When using the macro variable in program, we have to use the syntax ¯o-variable
.
Example:
%let Cartype=Wagon;
proc print data=sashelp.cars;
where Type="&CarType";
var Type Make Model MSRP;
run;
proc means data=sashelp.cars;
where Type="&CarType";
var MSRP MPG_Highway;
run;
proc freq data=sashelp.cars;
where Type="&CarType";
tables Origin Make;
run;
The ampersand triggers SAS to look up the text string stored in the CarType macro variable and replace it with Wagon before it executes the code.
Notes:
- Like libraries macro variables are temporary, so when we exit SAS they are deleted.
- Macro variable must be placed at the beginning of the program.
- Macro variable makes the program dynamic
- Macro variable must follow SAS naming rules
- If a macro variable reference is used inside quotation marks, double quotation marks must be used
Example of macro variable within a WHERE statement:
%let WindSpeed=100;
%let BasinCode=SP;
%let Date=01JAN2010;
proc print data=pg1.storm_summary;
where MaxWindMPH>=&WindSpeed and Basin="&BasinCode" and StartDate>="&Date"d;
var Basin Name StartDate EndDate MaxWindMPH;
run;
proc means data
Formatting Columns
Formatting Data Values in Results
PROC PRINT DATA=input-table;
FORMAT col-name(s) format;
RUN;
Format statement does not change the data, it only impacts the values are displayed.
We specify the format using this syntax:
<$>format-name<w>.<d>
<$>
indicates character format<w>
specifies the total format width- the period (
.
) is a requited delimiter <d>
denotes the number of decimal places
Example:
proc print data=pg1.class_birthdate;
format Height Weight 3. Birthdate date9.;
run;
Common Formats for Numeric Values
Format Name | Example Value | Format Applied | Formatted Value | Decimal Places | Actual Width | Maximum Width |
---|---|---|---|---|---|---|
w.d | 12345.67 | 5. | 12345 | 0 | 5 | 5 |
w.d | 12345.67 | 8.1 | 12345.7 | 5 | 7 | 8 |
COMMAw.d | 12345.67 | COMMA8.1 | 12,345.7 | 1 | 8 | 8 |
DOLLARw.d | 12345.67 | DOLLAR10.2 | $12,345.67 | 1 | 10 | 10 |
DOLLARw.d | 12345.67 | DOLLAR10. | $12,346 | 0 | 7 | 10 |
YENw.d | 12345.67 | YEN7. | ¥12,346 | 0 | 7 | 7 |
EUROXw.d | 12345.67 | EUROX10.2 | €12,345.67 | 2 | 10 | 10 |
Note that, w
must include the width for the currency symbol ($123.23 has a width of 6). Also, comma is there by default for the currency formats
/*the variable n_of_trees is numeric*/
format n_of_trees notf 5.0;
- Zw.d format pads right-justified output with zero. For example, 1350 with Z8 will display 00001350.
Common Formats for Date Value
when the value is 21199
Value | Format | Formatted Value |
---|---|---|
21199 | DATE7. | 15JAN18 |
21199 | DATE9. | 15JAN2018 |
21199 | MMDDYY10. | 01/15/2018 |
21199 | DDMMYY8. | 15/01/18 |
21199 | MONYY7. | JAN2018 |
21199 | MONNAME. | January |
21199 | WEEKDATE. | Monday, January 15, 2018 |
Sort a table
We may want to sort a table because of various reasons:
- improve visual arrangement of the data
- identify and remove duplicate rows
- prepare data for certain data processing steps
PROC SORT
Using the SORT procedure we can sort one or more numeric or character columns:
PROC SORT DATA=input-table <OUT=output-table>;
BY <DESCENDING> col-name(s);
RUN;
- If we don’t include the OUT option, PROC SORT changes rows in the input table.
- By default, SAS sorts in ascending order.
- Columns that are in the BY statement are sometimes called BY columns or BY variables.
- PROC SORT doesn’t generate a printed output
Example:
proc sort data=pg1.class_test2 out=test_sort;
by Name TestScore;
run;
proc sort data=pg1.class_test2 out=test_sort;
by Name descending TestScore;
run;
Identifying and Removing Duplicate rows in a table
By adding options to the PROC SORT statement, we can identify and remove duplicate rows.
PROC SORT DATA=input-table <OUT=output-table>
NODUPRECS <DUPOUT=output-table>;
BY _ALL_;
RUN;
- The NODUPRECS option removes adjacent rows that are entirely duplicated. In other words, it removes rows that
are next to each other in the data where the values for every column match. When
using this option, it’s good to use the keyword
_ALL_
in the BY statement instead of a column name. This sorts the data by all columns so that entirely duplicated rows are next to each other, and then the NODUPRECS option can do its job. - The DUPOUT= option creates an output table containing duplicates removed.
Example:
proc sort data=pg1.class_test3
out=test_clean
noduprecs
dupout=test_dups;
by _all_;
run;
Another available option is NODUPKEY.
PROC SORT DATA=input-table <OUT=output-table>
NODUPKEY <DUPOUT=output-table>;
BY <DESCENDING> col-name(s);
RUN;
NODUPKEY option keeps only the first row for each unique value of the column(s) listed in the BY statement.
Example:
proc sort data=pg1.class_test2
out=test_clean
dupout=test_dups
nodupkey;
by Name;
run;