Dingrui‘s Blog

Data Science · Accounting & Finance · Random Thoughts

标签: data

发现一个很有意思的package: drake – A Pipeline Toolkit for Reproducible Computation at Scale

先上官方的mannual 还特别贴心写了本书来教你怎么用

由于我接触这个包的时间较短,以下内容大约只覆盖了这个包5%的内容。

我最主要用的三个function,drake_plan,make以及vis_drake_graph

以下是一个简单的例子

library(drake)
library(data.table)

#随便写一些dummy 函数
#尽量只让`dataframe`作为唯一的parameter

#给任意一个data.table 加ID列
add_id <- function(dt){
  return(dt[,ID:=.I])
}

#iris数据集,选取每个种类最小值
get_min_measures <- function(dt){
  return(dt[,lapply(.SD,min),by=.(Species),.SDcols=c(1:4)])
}

#构筑workflow plan

my_plan <- drake_plan(raw_data = fread(file_in('iris.csv')), #读取input,需要用file_in()来告诉drake这是个input
                      indexed_dt = add_id(raw_data),#用上一步的名字作为argument
                      min_measures_species = get_min_measures(raw_data),
                      output = fwrite(indexed_dt,file_out('iris100.csv')))#同理,需要用file_out()来告诉drake这是个input

Read More

The Key Value of Data Analysis

A funny joke

Some people will ask,’What kind of data analysis is best?’ The ‘big data’? The ‘small data’? or ‘structured data’?

Honestly, I still can’t answer this question.

But I will ask another question ‘The black cat is better or the white cat?’

Read More