[Pandas] 판다스 요약정리

Updated: April 26, 2021

판다스(Pandas)

판다스는 파이썬에서 데이터 처리를 위해 존재하는 가장 인기 있는 라이브러리이다. 행과 열로 이루어진 2차원 데이터를 효율적으로 가공/처리할 수 있는 다양하고 훌륭한 기능을 제공한다.

판다스의 주요 구성 요소

DataFrame: Column X Rows 2차원 데이터셋
- 각 row를 구별할 수 있는 Key값 객체가 있음.
Series: 1개의 Column값으로만 구성된 1차원 데이터셋 (Column명이 없다.)

기본 API

read_csv(): csv 파일을 편리하게 DataFrame으로 로딩해준다.
head(): DataFrame의 맨 앞 일부 데이터만 추출한다.
pd.DataFrame(dic): 딕셔너리 자료형을 통해 DataFrame 생성 가능
컬럼영과 인덱스 반환 명령어
- df.columns
- df.index
- df.index.values
DataFrame에서 Series 추출
- series = titanic_df[‘Name’]
DataFrame에서 DataFrame 추출 (리스트로 감싸면 된다.)
- filtered_df = titanic_df[[‘Name’, ‘Age’]]
- one_col_df = titanic_df[[‘Name’]]
shape (행과 열 크기 반환)
- titanic_df.shape
info (DataFrame내의 컬럼명, 데이터 타입, Null건수, 데이터 건수 정보를 제공)
- titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

describe (평균, 표준편차, 4분위 분포도를 제공)
- titanic_df.describe()

value_counts() - 데이터값의 분포도를 확인할 때 많이 쓴다.
- value_counts()는 Series객체에서만 호출 될 수 있어서 DataFrame을 단일 컬럼으로 입력하여 Series로 변환한 뒤 호출한다.
- value_counts = titanic_df[‘Pclass’].value_counts()
sort_values() by=정렬컬럼, ascending=True 또는 False로 오름차순/내림차순으로 정렬
- titanic_df.sort_values(by=’Pclass’, ascending=False)
- titanic_df[[‘Name’, ‘Age’]].sort_values(by=’Age’)
- titanic_df[[‘Name’, ‘Age’, ‘Pclass’]].sort_values(by=[‘Pclass’, ‘Age’])

DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

변환 형태	설명
리스트를 DataFrame으로 변환	df_list1 = pd.DataFrame(list, columns=col_name1)
ndarray를 DataFrame으로 변환	df_array2 = pd.DataFrame(array2, columns=col_name2)
딕셔너리를 DataFrame으로 변환	dict = {‘col1’:[1, 11], ‘col2’:[2, 22], ‘col3’: [3,33]}
DataFrame을 ndarray로 변환	DataFrame 객체의 values 속성을 이용하여 ndarray 변환
DataFrame을 리스트로 변환	DataFrame 객체의 values 속성을 이용하여 먼저 ndarray로 변환 후 tolist()를 이용하여 list로 변환
DataFrame을 딕셔너리로 변환	DataFrame 객체의 to_dict()를 이용하여 변환

DataFrame 데이터 삭제

drop()
- DataFrame.drop(labels=None, axis=0, index= None, columns=None, level=None, inplace=False, errors=’raise’)
- axis: 로우를 삭제할 때는 axis=0, 컬럼을 삭제할 떄는 axis=1로 설정
- inplace: 새롭게 객체 변수로 받고 싶다면 inplace=False, 드롭된 결과를 적용할 경우에는 inplace=True를 적용.
- titanic_df = titanic_df.drop(‘Age_0’, axis=1, inplace=False)

Index

판다스의 Index 객체는 RDBMS의 PK와 유사하게 DataFrame, Series의 레코드를 고유하게 식별하는 객체
Index 객체 추출하려면 DataFrame.index 또는 Series.index 속성을 통해 가능하다.
Series 객체는 Index 객체를 포함하지만 Series 객체에 연산 함수를 적용할 때 Index는 연산에서 제외된다. Index는 오직 식별용으로만 사용된다.
DataFrame 및 Series에 reset_index() 메서드를 수행하면 새롭게 인덱스를 연속 숫자 형으로 할당하며 기존 인덱스는 ‘index’라는 새로운 컬럼 명으로 추가한다.

데이터 셀렉션 및 필터링

- 컬럼 기반 틸터링 또는 불린 인덱싱 필터링 제공
ix[ ], loc[ ], iloc[ ] - 명칭/위치 기반 인덱싱을 제공
불린 인덱싱 - 조건식에 따른 필터링을 제공

ix, loc. iloc

ix[ ] - 명칭 기반과 위치 기반 인덱싱을 함께 제공
loc[ ] - 명칭 기반 인덱싱
iloc[ ] - 위치 기반 인덱싱

Aggregation 함수

sum, max, min, count 등의 함수는 DataFrame/Series에서 집합연산을 수행.
DataFrame의 경우 DataFrame에서 바로 aggregation을 호출할 경우 모든 컬럼에 해당aggregation을 적용

DataFrame Group By

DataFrame은 Group by 연산을 위해 groupby() 메소드를 제공한다.
groupby( ) 메소드는 by 인자로 group by 하려는 컬럼명을 입력 받으면 DataFrameGroupBy 객체를 반환한다.
이렇게 반환된 DataFrameGroupBy 객체에 aggregation함수를 수행한다.

# agg 함수를 함께 적용
titanic_df.groupby('Pclass')['Age'].agg([max, min])

agg_format = {'Age': 'max', 'SibSp': 'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

결손 데이터 처리하기

isna(): DataFrame의 isna() 메소드는 주어진 컬럼값들이 NaN인지 True/False값을 반환한다.
fillna(): Missing 데이터를 인자로 주어진 값으로 대체한다.

titanic_df.isna().head(3)

# 컬럼별로 NaN 건수 구하기 
titanic_df.isna().sum()

판다스 람다식 적용하기

판다스는 apply 함수에 lambda 식을 결합해 DataFrame이나 Series의 레코드별로 데이터를 가공하는 기능을 제공한다. 판다스의 경우 컬럼에 일괄적으로 데이터 가공을 하는 것이 속도 면에서 더 빠르나 복잡한 데이터 가공이 필요할 경우 어쩔 수 없이 apply lambda를 이용한다.

titanic_df['Name_len'] =  titanic_df['Name'].apply(lambda x : len(x))

titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <= 15 else 'Adult')

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else ('Adult' if x <= 60 else 'Elderly'))

def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else: cat = 'Elderly'

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))

참고 자료

파이썬 머신러닝 완벽가이드

Share on

Twitter Facebook LinkedIn

이연걸