Dingrui‘s Blog

Data Science · Accounting & Finance · Random Thoughts

Category: R

Excel 数组公式 , Alteryx, VBA 与R

这个星期来PwC上班以后,第一次又重新干起了老本行,写VBA和做template (Excel Based)。

做template,最痛苦的事,莫过于design你的template。一方面要囊括尽可能多的,有用的信息给用户,另一方面又要考虑用户会怎么会去使用你的template (这一点我在悉尼大学工作的时候和Casey同学学习到了很多)。PwC又尤其注重你的效率,所以还得考虑到后续的功能增删所耗费的时间与精力。总之让我深刻体验到了,在非码农部门coding的痛苦。

  1. 没有产品设计手册
    咱们就来简化一点,也别来个手册文档了,能把要求说清楚就不错了,更别提能老老实实写在纸面上而且保证后面不赖账。说多了都是泪
  2. 不合理的预期
    Coding 并不是万能药,尤其是在这么一个快节奏的工作环境,经常就是4,5个小时内就要开发完成,想想看这也不太可能。更别提VBA那个屎一样的coding 感受,真是欲仙欲死。

吐槽完毕,开始聊聊我觉得有意识的地方。

数组公式

好久没写数组公式了,这几天才反应过来,数组公式和R其实写起来感觉差不多,尤其如果你比较习惯R里面向量化的写法的话。

Read More

发现一个很有意思的package: drake – A Pipeline Toolkit for Reproducible Computation at Scale

先上官方的mannual 还特别贴心写了本书来教你怎么用

由于我接触这个包的时间较短,以下内容大约只覆盖了这个包5%的内容。

我最主要用的三个function,drake_plan,make以及vis_drake_graph

以下是一个简单的例子

library(drake)
library(data.table)

#随便写一些dummy 函数
#尽量只让`dataframe`作为唯一的parameter

#给任意一个data.table 加ID列
add_id <- function(dt){
  return(dt[,ID:=.I])
}

#iris数据集,选取每个种类最小值
get_min_measures <- function(dt){
  return(dt[,lapply(.SD,min),by=.(Species),.SDcols=c(1:4)])
}

#构筑workflow plan

my_plan <- drake_plan(raw_data = fread(file_in('iris.csv')), #读取input,需要用file_in()来告诉drake这是个input
                      indexed_dt = add_id(raw_data),#用上一步的名字作为argument
                      min_measures_species = get_min_measures(raw_data),
                      output = fwrite(indexed_dt,file_out('iris100.csv')))#同理,需要用file_out()来告诉drake这是个input

Read More

Functional Programming in R (Using purrr Package)

I wrote a small article about purrr packge before.

Now I think it’s time to write a better article introducing the purrr package.

You can find the official website through this link.

Map Family

The map family is used to apply function or functions over a list or vector.

The “primary” function is the map function.

library(purrr)
#Remember: map always return a list rather than a vector
test_list <- list(a=c(1,2,3),
                  b=c(2,3,4),
                  c=c(3,4,5))

map(test_list,mean)
#> $a
#> [1] 2
#> 
#> $b
#> [1] 3
#> 
#> $c
#> [1] 4

Read More

Datacamp Certificates

I will show some certificates from Datacamp.

Datacamp is a really good website for studying data science no matter you want to study R or Python.

Certificates: (Last course was finished at 21 Jan 2019. 26 Courses were finished in total. )

Dingrui’s Useful R scripts

This blog will be updated from time to time. Please check it regularly

All scripts will be based on the following packages. I really appreciate the authors who develop these packages that make my life and work both interesting and easy.
  • tidyverse
  • data.table
  • readxl
  • writexl
  • lubridate
  • RMySQL
  • RSelenium (If you have trouble on installing RSelenium, please go to this link for further reference.)

Read More

R Scripts for Combining Excels Files

Excel format file might be the most common one you will face in the business or accouting job. Here are some tips on how to combine excels files using R.

Preparation

There are two packages we need-tidyverse and readxl all created by Hadley Wickham

If you are interested in more of them, feel free to go to their documentation readxl and tidyverse

Let’s Do It!

Read More