While the index notation for subsetting data frames is powerful for retrieving observations in known rows or columns, it is often more desirable to return rows corresponding to observations that meet a given criteria. For example, say we wanted to known which Canadian province or territory has a "Criminal Code traffic" rate that is greater than 500 per 100,000.
To begin, we will return the "Criminal Code Traffic" column and simply read off the corresponding rows:
>
|
data[ `Criminal Code Traffic` ];
|
Now this approach is fine for smaller data frames, but it is much easier to simply query a DataFrame using an element-wise logical operator to first see which (if any) observations match the criteria:
>
|
data[ `Criminal Code Traffic` ] >~ 500;
|
This returns a truth table, whose entries return a true, false, or FAIL result depending on if the given observation meets the criteria. In addition, if the DataFrame is indexed by a truth table, a filtered subset is returned:
>
|
data[ data[ `Criminal Code Traffic` ] >~ 500 ];
|
| (10) |
The with command is useful for simplifying the syntax for querying DataFrames. with creates named variables corresponding to each of the column labels in a given DataFrame.
| (11) |
Each column of the DataFrame can be called using its variable name:
>
|
`Criminal Code Traffic`;
|
With bound labels, the following returns the rows where the "Federal Statute" rate is less than or equal to 300 per 100,000:
>
|
`Federal Statute` <=~ 300;
|
>
|
data[ `Federal Statute` <=~ 300 ];
|
| (14) |
It is also possible to filter the DataFrame using multiple queries. When combining queries, the logical operators and and or are used to find either the intersection or union of truth tables, respectively. For example, the following returns the province or territory with "Violent Crime" less than 1000 and "Property Crime" greater than 3000.
>
|
`Violent Crime` <~ 1000 and `Property Crime` >~ 3000;
|
From the truth table, only Prince Edward Island matches this criteria.
>
|
data [ `Violent Crime` <~ 1000 and `Property Crime` >~ 3000 ];
|
| (16) |
It can be useful to find the union of queries by using the or logical operator. For example, the following returns observations for which the "Other Criminal Code" rate is greater than 2500 per 100,000 or the observations for which the "Criminal Code Traffic" rate is greater than 500 per 100,000:
>
|
data [ `Other Criminal Code` >~ 2500 or `Criminal Code Traffic` >~ 500 ];
|
| (17) |