9.2 readr 패키지

9.2 `readr` 패키지

기본적으로 ?? 절에서 학습했던 read.table(), read.csv()와 거의 동일하게 작동하지만, 읽고 저장하는 속도가 base R에서 제공하는 기본 입출력 함수보다 월등히 뛰어남. 최근 readr 패키지에서 제공하는 입출력 함수보다 더 빠르게 데이터 입출력이 가능한 feather 패키지 (Wickham 2019a) 제공
데이터를 읽는 동안 사소한 문제가 있는 경우 해당 부분에 경고 표시 및 행, 관측 정보를 표시해줌 \(\rightarrow\) 데이터 디버깅에 유용
주요 함수¹⁰
- read_table(), write_table()
- read_csv(), write_csv()
readr vignette을 통해 더 자세한 예시를 살펴볼 수 있음

read_csv(
  file, # 파일 명
  col_names = TRUE, # 첫 번째 행를 변수명으로 처리할 것인지 여부
                    # read.table(), read.csv()의 header 인수와 동일
  col_types = NULL, # 열(변수)의 데이터 형 지정
                    # 기본적으로 데이터 유형을 자동으로 감지하지만, 
                    # 입력 텍스트의 형태에 따라 데이터 유형을 
                    # 잘못 추측할 수 있기 때문에 간혹 해당 인수 입력 필요
                    # col_* 함수 또는 campact string으로 지정 가능
                    # c=character, i=integer, n=number, d=double, 
                    # l=logical, f=factor, D=date, T=date time, t=time
                    # ?=guess, _/- skip column
  progress, # 데이터 읽기/쓰기  진행 progress 표시 여부
)

예시

# dataset/titanic3.csv 불러오기
titanic <- read_csv("dataset/titanic3.csv")

Rows: 1309 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
dbl (7): pclass, survived, age, sibsp, parch, fare, body

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

titanic

# read.csv와 비교
head(read.csv("dataset/titanic3.csv", header = T), 10)

# column type을 변경
titanic2 <- read_csv("dataset/titanic3.csv", 
                     col_types = "iicfdiicdcfcic")
titanic2

# 특정 변수만 불러오기
titanic3 <- read_csv("dataset/titanic3.csv", 
                     col_types = cols_only(
                       pclass = col_integer(), 
                       survived = col_integer(), 
                       sex = col_factor(), 
                       age = col_double()
                     ))
titanic3

# 대용량 데이터셋 읽어올 때 시간 비교
# install.packages("feather") # feather package
require(feather)

필요한 패키지를 로딩중입니다: feather

system.time(pulse <- read.csv("dataset/pulse.csv", header = T))

 사용자  시스템 elapsed 
  4.790   0.055   4.847

write_feather(pulse, "dataset/pulse.feather")
system.time(pulse <- readRDS("output/pulse.rds"))

 사용자  시스템 elapsed 
  0.088   0.000   0.088

system.time(pulse <- read_csv("dataset/pulse.csv"))

Rows: 69 Columns: 20000
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (20000): V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V1...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

 사용자  시스템 elapsed 
 15.478  81.200  70.006

system.time(pulse <- read_feather("dataset/pulse.feather"))

 사용자  시스템 elapsed 
  0.292   0.000   0.293

9.2.1 tibble 패키지

readr 또는 readxl 패키지에서 제공하는 함수를 이용해 외부 데이터를 읽어온 후, 확인할 때 기존 데이터 프레임과 미묘한 차이점이 있다는 것을 확인
프린트된 데이터의 맨 윗 부분을 보면 A tibble: 데이터 차원 이 표시된 부분을 볼 수 있음
tibble은 tidyverse 생태계에서 사용되는 데이터 프레임 \(\rightarrow\) 데이터 프레임을 조금 더 빠르고 사용하기 쉽게 수정한 버전의 데이터 프레임

tibble 생성하기

기본 R 함수에서 제공하는 as.* 계열 함수 처럼 as_tibble() 함수를 통해 기존 일반적인 형태의 데이터 프레임을 tibble 로 변환 가능

head(iris)

as_tibble(iris)

개별 벡터로부터 tibble 생성 가능
방금 생성한 변수 참조 가능
문자형 변수가 입력된 경우 데이터 프레임과 다르게 별다른 옵션이 없어도 강제로 factor로 형 변환을 하지 않음

# 벡터로부터 tibble 객체 생성
tibble(x = letters, y = rnorm(26), z = y^2)

# 데이터 프레임으로 위와 동일하게 적용하면?
data.frame(x = letters, y = rnorm(26), z = y^2)

Error in eval(expr, envir, enclos): 객체 'y'를 찾을 수 없습니다

# 벡터의 길이가 다른 경우
# 길이가 1인 벡터는 재사용 가능
tibble(x = 1, y = rep(0:1, each = 4), z = 2)

# 데이터 프레임과 마찬가지로 비정상적 문자를 변수명으로 사용 가능
# 역따옴표(``) 
tibble(`2000` = "year", 
       `:)` = "smile", 
       `:(` = "sad")

tribble() 함수 사용: transposed (전치된) tibble의 약어로 데이터를 직접 입력 시 유용

tribble(
   ~x, ~y,   ~z,
  "M", 172,  69,
  "F", 156,  45, 
  "M", 165,  73, 
)

`tibble()`과 `data.frame()`의 차이점

가장 큰 차이점은 데이터 처리의 속도 및 데이터의 프린팅
tibble이 데이터 프레임 보다 간결하고 많은 정보 확인 가능
str()에서 확인할 수 있는 데이터 유형 확인 가능

head(iris)

dd <- as_tibble(iris)
dd

References

Wickham, Hadley. 2019a. Feather: R Bindings to the Feather ’API’. https://CRAN.R-project.org/package=feather.

주요 함수들의 사용방법은 거의 유사하기 때문에 read_csv() 함수에 대해서만 살펴봄↩︎

9.2.1 tibble 패키지

tibble 생성하기

tibble()과 data.frame()의 차이점

References

9.2 `readr` 패키지

`tibble()`과 `data.frame()`의 차이점