vendors.csv
:
dtypes
correspond to similar Python types.
Strings are called str
in Python and object
in pandas.
Integers are called int
in Python and int64
in pandas, indicating that
pandas stores integers as 64-bit numbers.
Floating point numbers are called float
in Python and float64
in pandas,
also indicating that they are stored with 64 bits.
A boolean value, named for logician George Boole, can be either True or False.
These are called bool
in Python and bool
in pandas.
Pandas includes some data types with no corresponding native Python type:
datetime64
for date and time values, timedelta[ns]
for storing the
difference between two times as a number of nanoseconds, and category
where
each item is one of a list of strings.
Here we import the vendor data file and show the dtypes:
float
and not an int
. 2015 and 2016 sales, percent
growth, and open orders are stored as objects and not numbers. The month, day,
and year values should be converted to datetime64
, and the active column
should be converted to a boolean.
The data can be converted with the astype()
function, custom functions, or
pandas functions such as to_numeric()
or to_datetime()
.
astype()
function can convert the Vendor Number column to int
:
astype()
returns a copy, so an assignment statement will convert the
original data. This can be checked by showing the dtypes
.
float
or the Open Orders
column to an int
returns an error.
bool
completes with no
errors, but converts both Y and N values to True
.
astype()
works if the data is clean and can be interpreted simply as a
number, or if you want to convert a number to a string. Other conversions
require custom functions or pandas functions such as to_numeric()
or
to_datetime()
.
float
by first removing the comma (,
) and dollar sign ($
)
characters.
apply()
function:
dtypes
:
np.where()
function is a good way to convert the Active column to
bool
. This code converts “Y” values to True
and all other values to
False
, then shows the dtypes
:
astype()
on this column would produce an error, but the pd.to_numeric()
function
built in to pandas will convert the numeric values to numbers and any other
values to the “not a number” or “NaN” value built in to the floating point
number standard:
fillna()
:
pd.to_datetime()
function built in to pandas can convert the
Month Day and Year columns to datetime64[ns]
:
dtypes
:
dtype
and converters
in the pd.read_csv()
function.
Defining dtype
is like performing astype()
on the data.
A dtype
or a converter
can only be applied once to a specified column.
If you try to apply both to the same column, the dtype
is skipped.
After converting as much of the data as possible in pd.read_csv()
, use code
similar to the previous examples to convert the rest.
left
to use keys from the left frame only,
right
to use keys from the right frame only, outer
to use the union of
keys from both frames, and the default inner
to use the intersection of keys
from both frames.
This merge using the default inner join omits key combinations found
in only one of the source DataFrames:
key1
and key2
set to K0, K1
, K2, K1
, or K2, K0
.
Joins also copy information when necessary. The left DataFrame had one row with
the keys set to K1, K0
and the right DataFrame had two. The output DataFrame
has two, with the information from the left DataFrame copied into both rows.
The next example shows the results of a left, right, and outer merge on the same
inputs. Empty cells are filled in with NaN values.
MultiIndex
:
_x
and _y
but you can customize
them:
lsuffix
and rsuffix
.
combine_first()
method:
update()
overwrites values in a frame with values from another
frame:
query
method with a
boolean expression based on the column names.