Difference between revisions of "NumPy and Pandas"
Adelo Vieira (talk | contribs) |
Adelo Vieira (talk | contribs) (→Arrays) |
||
Line 376: | Line 376: | ||
<br /> | <br /> | ||
+ | |||
==Pandas== | ==Pandas== | ||
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order: | You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order: |
Latest revision as of 18:51, 3 July 2020
Contents
- 1 NumPy
- 1.1 Installation
- 1.2 Arrays
- 1.2.1 Methods for creating NumPy Arrays
- 1.2.2 From a Python List
- 1.2.3 From Built-in NumPy Methods
- 1.2.4 Others Array Attributes and Methods
- 1.2.5 Indexing and Selection
- 1.2.6 Bracket Indexing and Selection (Slicing)
- 1.2.7 Broadcasting
- 1.2.8 Get a copy of an Array
- 1.2.9 Important notes on Slices
- 1.2.10 Using brackets for selection based on comparison operators and booleans
- 1.2.11 Arithmetic operations
- 1.2.12 Universal Array Functions
- 2 Pandas
NumPy
- NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.
- Numpy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use Arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).
Installation
It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install.
If you have Anaconda, install NumPy by:
conda install numpy
If you are not using Anaconda distribution:
pip install numpy
Then, to use it:
import numpy as np
arr = np.arange(0,10)
Arrays
Method/Operation | Description/Comments | Example | |||
---|---|---|---|---|---|
import numpy as np
| |||||
Methods for creating NumPy Arrays |
From a Python List |
array()
|
We can create an array by directly converting a list or list of lists. | my_list = [1,2,3]
| |
From Built-in NumPy Methods |
arange()
|
Return evenly spaced values within a given interval. | np.arange(0,10)
| ||
zeros()
|
Generate arrays of zeros. | np.zeros(3)
| |||
ones()
|
Generate arrays of ones. | np.ones(3)
| |||
linspace()
|
Return evenly spaced numbers over a specified interval. | np.linspace(0,10,3)
| |||
eye()
|
Creates an identity matrix. | np.linspace(0,10,50)
| |||
random
|
rand()
|
Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1) .
|
np.random.rand(2)
np.random.rand(5,5)
# Another way to invoke a function:
from numpy.random import rand
# Then you can call the function directly
rand(5,5)
| ||
randn()
|
Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is uniform. | np.random.randn(2)
| |||
randint()
|
Return random integers from low (inclusive) to high (exclusive).
|
np.random.randint(1,100)
| |||
seed()
|
sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes. See s1 and s2. for explanation. | np.random.seed(101)
| |||
Others Array Attributes and Methods |
reshape()
|
Returns an array containing the same data with a new shape. | arr.reshape(5,5)
| ||
max() , min() , argmax() , argmin()
|
Finding max or min values. Or to find their index locations using argmin or argmax. | arr.max()
| |||
shape()
|
Shape is an attribute that arrays have (not a method). | NO LO ENTENDI.. REVISAR!
arr_length = arr2d.shape[1]
| |||
dtype()
|
You can also grab the data type of the object in the array. | arr.dtype
| |||
- | - | - | - | - | |
Indexing and Selection
|
Creating sample array for the following examples: import numpy as np
arr = np.arange(0,10)
# 1D Array:
arr = np.arange(0,11)
#Show
arr
Output: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# 2D Array
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
#Show
arr_2d
Output:
array([[ 5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
| ||||
Bracket Indexing and Selection (Slicing) |
Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method copy(). See important note below. | #Get a value at an index
arr[8]
#Get values in a range
arr[1:5]
slice_of_arr = arr[0:6]
#2D
arr_2d[1]
arr_2d[1][0]
arr_2d[1,0] # The same that above
#Shape (2,2) from top right corner
arr_2d[:2,1:]
#Output:
array([[10, 15],
[25, 30]])
#Shape bottom row
arr_2d[2,:]
| |||
Fancy Indexing: Fancy indexing allows you to select entire rows or columns out of order. Example:# Set up matrix
arr2d = np.zeros((10,10))
# Length of array
arr_length = arr2d.shape[1]
# Set up array
for i in range(arr_length):
arr2d[i] = i
arr2d
# Output:
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
# Fancy indexing allows the following
arr2d[[6,4,2,7]]
# Output:
array([[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])
| |||||
Broadcasting
|
Setting a value with index range:
Numpy arrays differ from a normal Python list because of their ability to broadcast. |
arr[0:5]=100 #Show arr Output: array([100, 100, 100, 100, 100, 5, 6, 7, 8, 9, 10]) | |||
#Setting all the values of an Array
arr[:]=99 | |||||
Get a copy of an Array |
copy()
|
Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method copy(). See important note below. | arr_copy = arr.copy() | ||
Important notes on Slices |
slice_of_arr = arr[0:6]
#Show slice
slice_of_arr
Output: array([0, 1, 2, 3, 4, 5])
#Making changes in slice_of_arr
slice_of_arr[:]=99
#Show slice
slice_of_arr
Output: array([99, 99, 99, 99, 99, 99])
#Now note the changes also occur in our original array!
#Show
arr
Output: array([99, 99, 99, 99, 99, 99, 6, 7, 8, 9, 10])
#When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems!
#To get a copy, need to use the method copy()
| ||||
Using brackets for selection based on comparison operators and booleans |
arr = np.arange(1,11)
arr > 4
# Output:
array([False, False, False, False, True, True, True, True, True,
True])
bool_arr = arr>4
bool_arr
# Output:
array([False, False, False, False, True, True, True, True, True,
True])
arr[bool_arr]
# Output:
array([ 5, 6, 7, 8, 9, 10])
arr[arr>2]
# Output:
array([ 3, 4, 5, 6, 7, 8, 9, 10])
x = 2
arr[arr>x]
# Output:
array([ 3, 4, 5, 6, 7, 8, 9, 10])
| ||||
- | - | - | - | - | |
Arithmetic operations |
arr + arr
|
Warning on division by zero, but not an error!
|
import numpy as np
arr = np.arange(0,10)
arr + arr
# Output:
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
arr**3
# Output:
array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])
| ||
Universal Array Functions |
np.sqrt(arr)
|
Taking Square Roots | np.sin(arr)
# Output:
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])
| ||
np.exp(arr)
|
Calcualting exponential (e^) | ||||
np.max(arr)
same as |
Max | ||||
np.sin(arr)
|
Sin | ||||
np.log(arr)
|
Natural logarithm |
Pandas
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:
Series
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Method/Operator | Description/Comments | Example | ||
---|---|---|---|---|
import pandas as pd
| ||||
Creating Pandas Series
You can convert a |
From a List |
pd.Series(my_list)
|
# Creating some test data:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
pd.Series(data=my_list)
pd.Series(my_list)
pd.Series(arr)
# Output:
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels)
pd.Series(my_list,labels)
pd.Series(arr,labels)
pd.Series(d)
# Output:
a 10
b 20
c 30
dtype: int64
| |
From a NumPy Array |
pd.Series(arr)
| |||
From a Dectionary |
pd.Series(d)
| |||
Data in a Series |
A pandas Series can hold a variety of object types. Even functions (although unlikely that you will use this)pd.Series(data=labels)
# Output:
0 a
1 b
2 c
dtype: object
# Holding «functions» into a Series
# Output:
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
| |||
Index in Series |
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
# Output:
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser1['USA']
# Output:
1
# Operations are then also done based off of index:
ser1 + ser2
# Output:
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
|
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
DataFrame Columns are just Series:
type(df['W'])
# Output:
pandas.core.series.Series
Method/
Operator |
Description/Comments | Example | ||
---|---|---|---|---|
Selection and Indexing
Let's learn the various methods to grab data from a DataFrame |
Standard systax |
df['']
|
# Pass a list of column names:
df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
| |
SQL syntax(NOT RECOMMENDED!) |
df.W
|
|||
Selecting Rows |
df.loc['']
|
df.loc['A']
# Or select based off of position instead of label :
df.iloc[2]
# Output:
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
| ||
Selecting subset of rows and columns |
df.loc['','']
|
df.loc['B','Y']
# Output:
-0.84807698340363147
df.loc[['A','B'],['W','Y']]
# Output:
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
| ||
Conditional Selection |
An important feature of pandas is conditional selection using bracket notation, very similar to numpy: df
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df>0
# Output:
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
df[df>0]
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]['Y']
# Output:
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
# Output:
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
# For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
# Output:
W X Y Z
E 0.190794 1.978757 2.605967 0.683509
| |||
Creating a new column |
df['new'] = df['W'] + df['Y']
| |||
Removing Columns |
df.drop()
|
df.drop('new',axis=1)
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Not inplace unless specified!
df
# Output:
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
df.drop('new',axis=1,inplace=True)
df
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Can also drop rows this way:
df.drop('E',axis=0,inplace=True)
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
| ||
Resetting the index |
Reset to default(0,1...n index) |
df.reset_index()
|
df
# Output:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df.reset_index()
# Output:
index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
| |
Setting index to something else |
df.set_index('')
|
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
# Output:
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States')
# Output:
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
df
# Output:
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
# We net to add «inplace=True»:
df.set_index('States',inplace=True)
df
# Output:
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
| ||
Multi-Indexed DataFrame |
Creating a Multi-Indexed DataFrame |
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
# Output:
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
# Output:
A B
G1 1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
G2 1 -1.549671 0.435253
2 1.259904 -0.447898
3 0.266207 0.412580
| ||
Multi-Index and Index Hierarchy |
df.loc['G1']
# Output:
A B
1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
df.loc['G1'].loc[1]
# Output:
A 0.153661
B 0.167638
Name: 1, dtype: float64
df.index.names
# Output:
FrozenList([None, None])
df.index.names = ['Group','Num']
df
# Output:
A B
Group Num
G1 1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
G2 1 -1.549671 0.435253
2 1.259904 -0.447898
3 0.266207 0.412580
df.xs('G1')
# Output:
A B
Num
1 0.153661 0.167638
2 -0.765930 0.962299
3 0.902826 -0.537909
df.xs(['G1',1])
# Output:
A 0.153661
B 0.167638
Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
# Output:
A B
Group
G1 0.153661 0.167638
G2 -1.549671 0.435253
|
Missing Data
https://www.geeksforgeeks.org/python-pandas-dataframe-dropna/
Pandas will recognise a value as null if it is a np.nan object, which will print as NaN in the DataFrame.
Let's show a few convenient methods to deal with Missing Data in pandas.
dropna()
method allows the user to analyze and drop Rows/Columns with Null values in different ways:
DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
fillna()
allows to fill Null fields with a given value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
# Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
'''By default, dropna() drop all the rows without Null values:'''
df.dropna()
df.dropna(axis=0) # Same as default
# Output:
A B C
0 1.0 5.0 1
'''If we want to display all the columns without Null values:'''
df.dropna(axis=1)
'''If we want to display all the rows that have at least 2 non-null values:'''
df.dropna(thresh=2)
# Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
'''Columns with at least 3 non-null values:'''
df.dropna(thresh=3)
# Output:
A B C
0 1.0 5.0 1
'''You can also use df.isnull() to check for Null values:
df.isnull()
# Output:
A B C
0 False False False
1 False True False
2 True True False
'''To fill null fields with a given value:'''
df.fillna(value='FILL VALUE')
# Output:
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
'''But many times what we want to do is to replace these null fields with, for example, the «mean» of the columns. We can do it this way:'''
df['A'].fillna(value=df['A'].mean())
# Output:
0 1.0
1 2.0
2 1.5 # *
Name: A, dtype: float64
'''* The Null field has been filled with the mean of the column'''
GroupBy
The groupby method allows you to group rows of data together and call aggregate functions
Now you can use the .groupby() method to group rows together based on a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
Method | Description/Example | |
---|---|---|
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
# Output:
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350
| ||
GroupBy
|
df.groupby('')
|
For instance let's group based off of Company:df.groupby('Company')
# Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2027fdd470>
'''You can save this object as a new variable:'''
by_comp = df.groupby("Company")
'''And then call aggregate methods off the object:'''
|
We can call aggregate methods on the
|
df.mean()
|
df.groupby('Company').mean()
by_comp.mean()
# Output:
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
|
df.std()
|
by_comp.std()
# Output:
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
| |
df.min() df.max()
|
by_comp.min()
# Output:
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
by_comp.max()
| |
df.count()
|
by_comp.count()
# Output:
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
| |
df.describe()
|
by_comp.describe()
# Output:
Sales
Company
FB count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
GOOG count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
MSFT count 2.000000
mean 232.000000
std 152.735065
min 124.000000
25% 178.000000
50% 232.000000
75% 286.000000
max 340.000000
by_comp.describe().transpose()
by_comp.describe().transpose()['GOOG']
# Output:
count mean std min 25% 50% 75% max
Sales 2.0 160.0 56.568542 120.0 140.0 160.0 180.0 200.0
|
Concatenation - Merging - Joining
Method | Description/Comments | Example | |||||||
---|---|---|---|---|---|---|---|---|---|
Concatenation |
pd.concat()
|
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use |
Example:
import pandas as pd
pd.concat([df1,df2,df3])
pd.concat([df1,df2,df3],ignore_index=True)
# Output:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1,df2,df3],axis=1)
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11
| ||||||
Merging |
pd.merge()
|
The |
Example 1:
import pandas as pd
pd.merge(left,right,how='inner',on='key')
# Output:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
| ||||||
Example 2:
import pandas as pd
pd.merge(left, right, on=['key1', 'key2'])
# Output:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
pd.merge(left, right, how='outer', on=['key1', 'key2'])
# Output:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='right', on=['key1', 'key2'])
# Output:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
pd.merge(left, right, how='left', on=['key1', 'key2'])
# Output:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
| |||||||||
Joining |
df.join(df)
|
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. |
Example:
left.join(right)
# Output:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
# Output:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
|
Comparison with SQL
- https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html
- https://www.edureka.co/blog/sql-joins-types
Some operations
Method/Operator | Description/Comments | Example | ||
---|---|---|---|---|
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
# Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
| ||||
Selecting Data |
Select from DataFrame using criteria from multiple columns. | newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
# Output:
col1 col2 col3
3 4 444 xyz
| ||
Info on Unique Values |
df.unique()
|
df['col2'].unique()
# Output:
array([444, 555, 666])
df['col2'].nunique()
# Output:
3
| ||
Count values |
df.value_counts()
|
df['col2'].value_counts()
# Output:
444 2
555 1
666 1
Name: col2, dtype: int64
| ||
Removing a Column |
del df['col']
|
del df['col1']
# Output:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
| ||
Get column and index names |
df.columns
|
df.columns
# Output:
Index(['col2', 'col3'], dtype='object')
df.index
# Output:
RangeIndex(start=0, stop=4, step=1)
| ||
Sorting and Ordering a DataFrame |
df.sort_values()
|
df.sort_values(by='col2') #inplace=False by default
# Output:
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi
| ||
Applying Functions |
df[''].apply(some_function)
|
def function():
|
We can define our own function | def times2(x):
return x*2
df['col1'].apply(times2)
# Output
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
|
len
|
df['col3'].apply(len)
# Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
| |||
df[''].sum()
|
Sum values in a column | df['col1'].sum()
# Output:
10
| ||
df.pivot_table()
|
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
# Output:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
# Output:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN
|
Data Input and Output
CSV
CSV Input
CSV Output
Excel
Excel Input
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
dart = pd.read_excel('dart.xls', index_col=0)
Excel Output
HTML
HTML Input
SQL