3주 3일차 TIL 정리

웅진 STARTERS 부트캠프 2023. 2. 22. 17:57

3주 3일차에는 dplyr을 활용한 데이터 전처리 및 시각화에 대해 학습하였다.

※ dplyr 패키지 ※

- 샘플데이터

id <- as.character(20211001:20211010)
math <- c(100,54,36,76,54,94,15,6,34,64)
english <- c(95,23,11,89,50,53,70,13,60,90)
science <- c(99,56,43,90,34,77,43,3,85,72)
exam <- data.frame(id,math,english,science)
print(exam)

1. 파이프 연산자 ( %>% )

a <- c(10,22,33)
print(paste((round(sqrt(a)+0.2,1)-1)/5,'%',sep=''))

# 위아래가 같다

a <- c(10,22,33)
div <- function(x,y){return(x/y)}   # 파이프연산자에서는 나눗셈, 곱셈연산자 사용불가
a %>% sqrt() %>% +0.2 %>% round(1) %>% -1 %>% div(5) %>% paste('%',sep='') %>% print()

- %>% (단축키 ctrl + shift + m ) 를 통해 식을 순차적으로 처리할 수 있다.

2. select 함수: select( )

# select함수
select(exam,science,english)

3. filter 함수: filter( )

filter(exam,math>60 & english>60)
filter(exam,math>60 | english>60)

- filter의 활용

# filter함수 활용
exam %>% filter(science>70 & math<50) %>% select(id) %>% as.numeric() %>% print()

4. mutate 함수: mutate( )

mutate(exam,average=round((math+english+science)/3,1))
exam <- mutate(exam,pnp=ifelse((math+english+science)/3>=70,'합격','불합격'))

exam

- 샘플데이터 연장

# 샘플데이터 연장
id <- as.character(rep(c(20211001:20211010),tims=2))
mid_math <- c(100,54,36,76,54,94,15,6,34,64)
final_math <- c(90,80,23,67,44,72,10,45,87,55)
math <- c(mid_math,final_math)
mid_english <- c(95,23,11,89,50,53,70,13,60,90)
final_english <- c(90,32,4,74,90,23,83,52,43,70)
english <- c(mid_english,final_english)
mid_science <- c(99,56,43,90,34,77,43,3,85,72)
final_science <- c(100,79,25,65,63,75,73,66,50,83)
science <- c(mid_science,final_science)
examTerm <- rep(c('중간','기말'),times=c(10,10))
exam2 <- data.frame(id,math,english,science,examTerm)
print(exam2)

- group_by, summarize: group으로 묶은 다음 그것을 summarize로 활용할 수 있다.

# 중간, 기말별 수학점수 평균
group_exam <- group_by(exam2,examTerm)
summarize(group_exam,mathAvg=mean(math))

※ ggplot 활용하지 않은 그래프그리기 ※

# 아메리카 대륙에서 12년동안 인구수가 3000만 이상인 세 국가를 연도별 시각화

## Amercia변수에 아메리카 대륙에 해당하는 데이터 할당
America <- gapminder %>% filter(continent=='Americas')

## 아메리카대륙에서 인구가 3000만 이상인 경우 카운트(연별)
America %>% filter(pop>=30000000) %>% count(country,sort=T)

## → 연도별 시각화
min <- gapminder %>% filter(country=='Brazil'|country=='Mexico'|country=='United States') %>% select(pop) %>% min()

max <- gapminder %>% filter(country=='Brazil'|country=='Mexico'|country=='United States') %>% select(pop) %>% max()

gapminder %>% filter(country=='Brazil') %>% select(year,pop) %>% plot(type='o',col='red',ylim=c(min,max))
gapminder %>% filter(country=='Mexico') %>% select(year,pop) %>% lines(type='o',col='blue')
gapminder %>% filter(country=='United States') %>% select(year,pop) %>% lines(type='o',col='green')

legend('topleft',legend=c('Brazil','Mexico','US'),fill=c('red','blue','green'))

※ ggplot2를 활용한 시각화 ※

1. ggplot2의 layer구조
- 7층. Theme: 그래프 제목, xlabel 등 데이터와는 관련없는 꾸미기 층
- 6층. Coordinate: 좌표계를 조작
- 5층. Statistics: 통계적 자료를 시각화
- 4층. Facets: 특정한 그래픽 자료를 특정한 컬럼에 범주값에 대해 화면분할해 보여줌
- 3층. Geometries: 시각화 요소 정의(점/선, 히스토그램 등)
- 2층. Aethetics: 데이터에 맵핑하고자 하는 설정(어떤 컬럼을 x축, 선의 모양 등..)
- 1층. Data: 데이터 층(반드시 df형)
(1,2,3층은 필수!)

레이어	설명	주요함수
Theme	그래프 꾸미기	ggtitle(), theme(..), theme_gray(), theme_bw()
Coordinate	좌표계 변환	coord_cartesian(xlim,ylim) cood_frip(), coord_polor()
Statistics	그래프에 통계값 시각화	stat_smooth(), stat_summary()
Facets	범주형 데이터의 서브그래프	facet_wrap() facet_grid()
Geometries	시각화 모형 정리	geom_point(), geom_line(), geom_bar(), geom_boxplot(), geom_histogram()
Aesthetics	x축, y축 데이터 매핑, 선스타일 등...	aex(x=, y=)
Data	데이터프레임

2. 레이어 이해 없이 빠르게 그리기: qplot()

# 샘플데이터
diamond=diamonds

qplot(data=diamonds,              # 들어갈 데이터: diamonds
      x=cut,                      # x축 데이터: cut컬럼
      y=price,                    # y축 데이터: price컬럼
      geom='boxplot',             # 그릴 그래프: boxplot
      aes(col=clarity))           # 컬러는 clarity컬럼의 데이터에 따라 분류

3. ggplot()을 이용해 그리기

# 1층 레이어: 데이터 쌓기 → 아무것도 안 나옴
ggplot(diamonds)

# 2층 레이어: df와 x,y축 넣기 → 축만 나옴
ggplot(diamonds,aes(x=carat,y=price))

# 3층 레이어: 어떤 지오메트리로 그릴지 추가 (point, 산점도)
ggplot(diamonds,aes(x=carat,y=price))+ 
  geom_point()
  
# 3층 확장: geom 꾸미기
## 컬러, 모양, 내부컬러 넣기
ggplot(diamonds,aes(x=carat,y=price))+ 
  geom_point(color='pink',shape=21,fill='red')
  
## ggplot에 지정하면 전역선언이 된다.
ggplot(diamonds,aes(x=carat,y=price,col=cut))+
  geom_point()
  
## geom에 지정하면 지역선언이 된다.
ggplot(diamonds,aes(x=carat,y=price,col=cut))+
  geom_point(shape=11,mapping = aes(col=cut))
  
# geom값은 line, bar, histogram, boxplot 등이 들어갈 수 있다.

# geom_line
ggplot(diamonds[1:100,],aes(x=carat,y=price))+ 
  geom_line(color='blue',size=1.3)

# geom_bar
ggplot(diamonds,aes(x=cut,fill=clarity))+ 
  geom_bar()
  
  # x값만 들어가는 것 주의

# geom_histogram
ggplot(diamonds,aes(x=carat,fill=cut))+ 
  geom_histogram()

# geom_histogram에 너비 추가
ggplot(diamonds,aes(x=carat,fill=cut))+ 
  geom_histogram(binwidth = 0.5)

# geom_boxplot
ggplot(diamonds,aes(x=cut,y=carat))+ 
  geom_boxplot()

# geom_boxplot에 fill값 지정
ggplot(diamonds,aes(x=cut,y=carat,fill=clarity))+ 
  geom_boxplot()

# 4층 레이어: 그래프 면분할 설정

## facet_wrap: 그래프 면분할
ggplot(diamonds[1:1000,],mapping=aes(x=carat, y=price)) + 
  geom_point() + 
  facet_wrap(~cut,labeller=label_both,nrow=1,ncol=5)

# 4층 레이어: 그래프 면분할 설정

## facet_grid: 두가지 범주형 컬럼에 대한 분할
ggplot(df,mapping=aes(x=최고기온,y=최고체감온도,col=자외선지수))+
  geom_point()+
  facet_grid(폭염영향예보~자외선지수,labeller=label_both)

(오류로 다른 데이터 예시)

# 5층 레이어: 
## stat_summary: x값에 대한 y값의 간단한 통계값을 그려줌

ggplot(diamonds[1:100,],aes(x=carat,y=price)) + 
  geom_line(color='blue',size=1.3) +
  stat_summary(fun.y = mean,color='red',size=2,geom='point')

# 5층 레이어:
## stat_smooth: 데이터의 회귀선을 그림. level변수로 신뢰구간 조정 가능.

ggplot(diamonds[1:100,],aes(x=carat,y=price)) + 
  geom_line(color='blue',size=1.3) +
  stat_smooth(level=0.95)

# 6층 레이어: coordinate layer의 함수
# coord_polar
ggplot(diamonds[1:1000,],mapping=aes(x='',y=cut,fill=cut))+
  geom_bar(stat='identity',width=1)
#  coord_polar(theta = 'y')
  # polar 적용 전 그래프 모습

ggplot(diamonds[1:1000,],mapping=aes(x='',y=cut,fill=cut))+
  geom_bar(stat='identity',width=1)+
  coord_polar(theta = 'y')
  # y축을 잡고 한 점으로 당긴 것

# 7층 함수: theme layer의 함수
ggplot(diamonds[1:100,],aes(x=carat,y=price)) + 
  geom_line(color='blue',size=1.3) +
  stat_smooth(level=0.95) +
  theme_bw() +
  theme(axis.title.x = element_text(color='red'),plot.title=element_text(color='blue'))+
  ggtitle('다이아몬드 캐럿당 가격 증가폭')

※ 그룹 토의 및 멘토링 내용 정리 ※

1. 그룹 토의 내용: facet_grid() 의 사용법과 한 번에 여러 줄을 주석처리하는 방법에 대해 공유하고 토의했다. 강의자료에 나와 있던 facet_grid()의 예시가 facet_wrap()을 활용한 예시로 잘못 적혀 있어 이를 확인했으나 해결을 하지 못했고, 여러 줄의 주석처리하는 기능( if(FALSE) ' ' )에 대해 공유하였다.

2. 멘토링 학습 내용: 데이터 마이닝의 경우 진도를 다 나가지 못해 필요한 정보를 메모만 했으며, 오늘 배운 시각화 및 dplyr에 대한 실습예제를 풀어보며 체화했다.

3. 리뷰: matplotlib과 용어는 공유하지만 다른 부분이 상당히 많아 까다로웠다. 혼자 연습 과정을 거쳐야 할 것 같다고 판단했다.

저작자표시 (새창열림)

'웅진 STARTERS 부트캠프' 카테고리의 다른 글

3주 5일차 TIL 정리 (0)	2023.02.24
3주 4일차 TIL 정리 (0)	2023.02.23
3주 2일차 TIL 정리 (0)	2023.02.21
3주 1일차 TIL 정리 (2)	2023.02.20
유데미 스타터스 취업 부트캠프 4기 - 데이터분석/시각화(태블로) 2주차 학습 일지 (0)	2023.02.19

ABOUT ME

WoodenStella WoodenStella

'웅진 STARTERS 부트캠프' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'웅진 STARTERS 부트캠프' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바