二、R语言语法基础

《区域水环境污染数据分析实践》
Data analysis practice of regional water environment pollution

苏命、王为东
中国科学院大学资源与环境学院
中国科学院生态环境研究中心

2024-04-09

数据类型

数值型

R中的数值型数据可以是整数或浮点数。

(x <- 10)

[1] 10

(y <- 1.23e-2)

[1] 0.0123

(z <- pi)

[1] 3.141593

数据类型

字符串

R 中的字符串用引号括起来，建议用双引号。
中文编码主要有GBK编码和UTF-8编码，可能遇到编码错误造成乱码。RStudio软件默认采用UTF-8编码，在R程序运行时字符串一般用UTF-8编码保存。

(str <- "Hello, World!")

[1] "Hello, World!"

(str <- 'Hello, World!')

[1] "Hello, World!"

(str <- 'He was very angry, and shouted: "Stop!"')

[1] "He was very angry, and shouted: \"Stop!\""

数据类型

逻辑

c(TRUE, FALSE)

[1]  TRUE FALSE

特殊值

NA: 这是最常见的NA类型，表示缺失值
NA_integer_: 这是NA的整数类型
NA_real_: 这是NA的实数类型
NA_character_: 这是NA的字符类型
NA_complex_: 这是NA的复数类型

pi

[1] 3.141593

NA

[1] NA

NA_character_

[1] NA

Inf

[1] Inf

特殊值

在 R 中，Inf 代表正无穷大（positive infinity），而 -Inf 则代表负无穷大（negative infinity）。这些值通常出现在数学计算中，例如除以零或对负数取对数等操作可能会导致无穷大的结果。

# 正无穷大
(x <- Inf)

[1] Inf

# 负无穷大
(y <- -Inf)

[1] -Inf

# 无穷大的运算
(a <- 5 / 0)

[1] Inf

(b <- log(0))

[1] -Inf

变量赋值

在 R 中，可以使用 <- 或 = 运算符将值赋给变量，建议用<-。

# 使用 `<-` 运算符
(x <- 10)

[1] 10

(y <- "hello")

[1] "hello"

# 使用 `=` 运算符
(z = c(1, 2, 3))

[1] 1 2 3

变量赋值

# 向量赋值
(vec <- c(1, 2, 3, 4, 5))

[1] 1 2 3 4 5

# 矩阵赋值
(mat <- matrix(1:9, nrow = 3))

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

变量赋值

数据框赋值

(df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Married = c(TRUE, FALSE, TRUE)
))

     Name Age Married
1   Alice  25    TRUE
2     Bob  30   FALSE
3 Charlie  35    TRUE

变量赋值

列表赋值

(lst <- list(
  numbers = c(1, 2, 3),
  strings = c("a", "b", "c"),
  matrix = matrix(1:9, nrow = 3)
))

$numbers
[1] 1 2 3

$strings
[1] "a" "b" "c"

$matrix
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

数学函数

round(pi, digits = 3)

[1] 3.142

log(10)

[1] 2.302585

abs(x): 返回 x 的绝对值
sqrt(x): 返回 x 的平方根
exp(x): 以e为底的指数函数值
log(x, base): 以指定底数的对数函数的值，默认底数为e
log10(x): 10为底的对数值
log2(x): 2为底的对数值
floor(x): 不大于x的最大整数
ceiling(x): 不小于x的最小整数

数学函数

sin(x), cos(x), tan(x): 返回 x 的正弦、余弦和正切值，其中 x 为弧度
asin(x), acos(x), atan(x): x 的反正弦、反余弦和反正切值，返回弧度
sinh(x), cosh(x), tanh(x): 返回 x 的双曲正弦、双曲余弦和双曲正切值
asinh(x), acosh(x), atanh(x): 反双曲正弦、反双曲余弦和反双曲正切值
round(x, digits): x 四舍五入，digits指定小数点后位数
trunc(x): 返回x截断值，即去掉小数部分
sign(x): 返回符号

统计函数

x <- c(5, 10, 15, 20, 25)
# 计算向量的平均值
mean(x)

[1] 15

# 计算向量的中位数
median(x)

[1] 15

# 计算向量的最小值
min(x)

[1] 5

# 计算向量的最大值
max(x)

[1] 25

# 计算向量的总和
sum(x)

[1] 75

统计函数

# 计算向量的标准差
sd(x)

[1] 7.905694

# 计算向量的方差
var(x)

[1] 62.5

# 计算向量的分位数
quantile(x, probs = c(0.25, 0.5, 0.75))

25% 50% 75% 
 10  15  20

# 统计向量的频数
(frequency <- table(x))

x
 5 10 15 20 25 
 1  1  1  1  1

统计函数

执行两样本或单样本 t 检验

y <- c(3, 8, 13, 18, 23)
t.test(x, y)


    Welch Two Sample t-test

data:  x and y
t = 0.4, df = 8, p-value = 0.6996
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.530021 13.530021
sample estimates:
mean of x mean of y 
       15        13

统计函数

Wilcoxon-Mann-Whitney检验

wilcox.test(x, y)


    Wilcoxon rank sum exact test

data:  x and y
W = 15, p-value = 0.6905
alternative hypothesis: true location shift is not equal to 0

统计函数

创建向量的直方图

hist(x)

函数调用-练习

题目：设有一组数据集合 x 包含了一些整数，请编写R语言代码计算并输出以下指标：

平均值（mean）
中位数（median）
最大值（maximum）
最小值（minimum）
数据集合中所有元素的和（sum）
数据集合的标准差（standard deviation）
数据集合 x 为：x <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)

要求：使用R语言编写函数，输入参数为数据集合 x，输出为以上指标的值。

控制流程

if-else 语句

x <- 10

if (x > 10) {
  print("x 大于 10")
} else {
  print("x 不大于 10")
}

[1] "x 不大于 10"

控制流程

for 循环

for (i in 1:5) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

自定义函数

定义函数

使用 function 关键字定义函数，并使用 return 关键字返回结果。

my_function <- function(x, y) {
  return(x + y)
}

调用函数

result <- my_function(3, 4)
print(result)

[1] 7

数据结构

向量

向量是一维数组，可以包含相同类型的元素。

(v <- c(1, 2, 3, 4, 5))

[1] 1 2 3 4 5

列表

列表可以包含不同类型的元素。

(l <- list(a = 1, b = "hello", c = TRUE))

$a
[1] 1

$b
[1] "hello"

$c
[1] TRUE

数值型向量

什么是数值型向量？

在 R 中，向量是一种基本的数据结构。
数值型向量包含相同类型的数值元素。

创建数值型向量

# 使用 c() 函数创建数值型向量
(numeric_vector <- c(1, 2, 3, 4, 5))

[1] 1 2 3 4 5

数值型向量

向量运算

# 创建两个数值型向量
(vector1 <- c(1, 2, 3))

[1] 1 2 3

(vector2 <- c(4, 5, 6))

[1] 4 5 6

# 执行向量加法
(result <- vector1 + vector2)

[1] 5 7 9

# 执行向量乘法
(result <- vector1 * vector2)

[1]  4 10 18

向量运算

向量求和

# 创建数值型向量
vector <- c(1, 2, 3, 4, 5)

# 求和
(sum_result <- sum(vector))

[1] 15

向量运算

向量平均值

# 创建数值型向量
vector <- c(1, 2, 3, 4, 5)

# 平均值
(mean_result <- mean(vector))

[1] 3

运算-数值运算

a 的平方。
b 的立方。
a 除以 b 的商和余数。

要求：使用R语言编写函数，输入参数为 a 和 b，输出为上述结果。

运算-逻辑运算

all(c(FALSE, 2, 1:3, 3) > 1)

[1] FALSE

any(c(FALSE, 2, 1:3, 3) > 1)

[1] TRUE

(flag1 <- FALSE)

[1] FALSE

(flag2 <- (3 > 2))

[1] TRUE

(flag3 <- TRUE * TRUE)

[1] 1

(flag4 <- TRUE * FALSE)

[1] 0

(flag5 <- TRUE & FALSE)

[1] FALSE

(flag6 <- TRUE | FALSE)

[1] TRUE

运算-逻辑运算

which

which(c(FALSE, TRUE, TRUE, FALSE, NA))

[1] 2 3

which((11:15) > 12)

[1] 3 4 5

identical

identical(c(1,2,3), c(1,2,NA))

[1] FALSE

identical(c(1L,2L,3L), c(1,2,3))

[1] FALSE

运算-字符型

特殊字符

c("abc", "", 'a cat', NA, '李明', "\n")

[1] "abc"   ""      "a cat" NA      "李明"  "\n"

paste

(users <- paste("ruser", 1:9))

[1] "ruser 1" "ruser 2" "ruser 3" "ruser 4" "ruser 5" "ruser 6" "ruser 7"
[8] "ruser 8" "ruser 9"

paste(users, collapse = ", ")

[1] "ruser 1, ruser 2, ruser 3, ruser 4, ruser 5, ruser 6, ruser 7, ruser 8, ruser 9"

运算-字符型

大小写

letters[1:5]

[1] "a" "b" "c" "d" "e"

toupper(letters[6:9])

[1] "F" "G" "H" "I"

tolower(month.abb)

 [1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov" "dec"

stringr::str_to_title(c("monday", "tuesday"))

[1] "Monday"  "Tuesday"

运算-字符型

字符串截取

substr("Monday", 1, 3)

[1] "Mon"

stringr::str_sub("Monday", 1, 3)

[1] "Mon"

运算-字符型

类型转换

[1] 100

as.character(100)

[1] "100"

as.numeric(c("0100", "0101"))

[1] 100 101

sprintf('renamedfile%03d.png', c(3, 99, 100))

[1] "renamedfile003.png" "renamedfile099.png" "renamedfile100.png"

运算-字符型

字符串替换

(mystr <- "He was wrong!")

[1] "He was wrong!"

gsub("wrong", "right", mystr)

[1] "He was right!"

索引

向量

# 创建一个向量
vector <- c("apple", "banana", "cherry", "date")
# 访问第三个元素
vector[3]

[1] "cherry"

# 访问多个元素
vector[c(2, 4)]

[1] "banana" "date"

vector[c(2:4)]

[1] "banana" "cherry" "date"

索引

向量

# 除了第2个元素
vector[-2]

[1] "apple"  "cherry" "date"

# 超界
vector[100]

[1] NA

# 更新数据
vector[7] <- "New Data"
vector

[1] "apple"    "banana"   "cherry"   "date"     NA         NA         "New Data"

索引

(x <- 1:10)

 [1]  1  2  3  4  5  6  7  8  9 10

x[x > 6]

[1]  7  8  9 10

x[x < 3] <- 99
x

 [1] 99 99  3  4  5  6  7  8  9 10

# which
which(x > 10)

[1] 1 2

which.max(x)

[1] 1

which.min(x)

[1] 3

索引

列表

# 创建一个列表
my_list <- list(fruit = c("apple", "banana", "cherry"),
                numbers = c(1, 2, 3, 4, 5))

# 访问列表中的第二个元素
my_list[[2]]

[1] 1 2 3 4 5

索引

数据框

# 创建一个数据框
df <- data.frame(fruit = c("apple", "banana", "cherry"),
                 quantity = c(5, 7, 3))

# 访问数据框中的第一个元素
df[1, 1]

[1] "apple"

# 第2-3行
df[2:3, ]

   fruit quantity
2 banana        7
3 cherry        3

日期和时间

`base` package

as.Date("2024-01-01")

[1] "2024-01-01"

as.POSIXct(1)

[1] "1970-01-01 08:00:01 CST"

as.Date(c("12/6/2022", "1/1/2023"), format="%m/%d/%Y")

[1] "2022-12-06" "2023-01-01"

日期和时间

`lubridate` package

lubridate::today()

[1] "2024-04-09"

require(lubridate)
now()

[1] "2024-04-09 08:21:45 CST"

ymd(c(20200321, 240404, "20181231"))

[1] "2020-03-21" "2024-04-04" "2018-12-31"

mdy(c("3-10-1998", "01-17-2018", "Feb 3, 2024"))

[1] "1998-03-10" "2018-01-17" "2024-02-03"

ymd_hms("1998-03-16 13:15:45", tz = "Asia/Shanghai")

[1] "1998-03-16 13:15:45 CST"

日期和时间

`lubridate` package

make_date(2028, 1, 30)

[1] "2028-01-30"

as_date("2000-01-01")

[1] "2000-01-01"

as_datetime("2000-01-01", tz = "Asia/Shanghai")

[1] "2000-01-01 CST"

as_datetime("2024-02-01 8:00:00", tz = "Asia/Shanghai")

[1] "2024-02-01 08:00:00 CST"

日期和时间

`lubridate` package

year(today())

[1] 2024

wday(today())

[1] 3

hour(now())

[1] 8

日期和时间

`lubridate` package

(x <- now())

[1] "2024-04-09 08:21:45 CST"

floor_date(x, unit = "day")

[1] "2024-04-09 CST"

floor_date(x, unit = "hour")

[1] "2024-04-09 08:00:00 CST"

floor_date(x, unit = "10 minutes")

[1] "2024-04-09 08:20:00 CST"

ceiling_date(x, unit = "10 minutes")

[1] "2024-04-09 08:30:00 CST"

因子（factor）

Factor是什么？

在R中，Factor是用来表示分类数据的特殊数据类型。
它将数据分成不同的水平(levels)，每个水平代表了一个类别。

因子（factor）

创建Factor

# 创建一个Factor
gender <- factor(c("Male", "Female", "Female", "Male"))
# 查看Factor的水平
levels(gender)

[1] "Female" "Male"

# 改变Factor的水平顺序
gender <- factor(gender, levels = c("Female", "Male"))
summary(gender) # 使用Factor进行分组

Female   Male 
     2      2

as.numeric(gender) # 因子转换为纯粹的整数值

[1] 2 1 1 2

as.character(gender) # 转为字符

[1] "Male"   "Female" "Female" "Male"

因子（factor）

Label of Factor

(x <- factor(1:12, label = month.abb))

 [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

factor(x, levels = month.abb[c(2:12, 1)])

 [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

分组

cut(1:20, breaks=c(0, 5, 10, 15, 18, 20))

 [1] (0,5]   (0,5]   (0,5]   (0,5]   (0,5]   (5,10]  (5,10]  (5,10]  (5,10] 
[10] (5,10]  (10,15] (10,15] (10,15] (10,15] (10,15] (15,18] (15,18] (15,18]
[19] (18,20] (18,20]
Levels: (0,5] (5,10] (10,15] (15,18] (18,20]

矩阵

1:20

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

(A <- matrix(1:20, nrow = 4, byrow = TRUE))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20

(B <- matrix(1:20, nrow = 4, byrow = FALSE))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

nrow(A)

[1] 4

ncol(B)

[1] 5

矩阵

高维矩阵

X <- array(1:12, dim = c(3, 2, 2))
dim(C)

NULL

X[1, , ]

     [,1] [,2]
[1,]    1    7
[2,]    4   10

X[1, , 1]

[1] 1 4

矩阵

`cbind`、`rbind`

cbind(X[1, , ], X[2, , ], X[3, , ])

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    7    2    8    3    9
[2,]    4   10    5   11    6   12

rbind(X[1, , ], X[2, , ], X[3, , ])

     [,1] [,2]
[1,]    1    7
[2,]    4   10
[3,]    2    8
[4,]    5   11
[5,]    3    9
[6,]    6   12

cbind(c(1,2), c(3,4), c(5,6))

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

数据框（data frame）

最主要的数据形式。

# 创建数据框
(df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Married = c(TRUE, FALSE, TRUE)
))

     Name Age Married
1   Alice  25    TRUE
2     Bob  30   FALSE
3 Charlie  35    TRUE

names(df)

[1] "Name"    "Age"     "Married"

colnames(df)

[1] "Name"    "Age"     "Married"

ncol(df); nrow(df)

[1] 3

[1] 3

数据框（data frame）

df[1, 1]

[1] "Alice"

df[2, ]

  Name Age Married
2  Bob  30   FALSE

df[, 1]

[1] "Alice"   "Bob"     "Charlie"

df$Age

[1] 25 30 35

df[["Age"]]

[1] 25 30 35

df[, "Age"]

[1] 25 30 35

数据框（data frame）

X <- matrix(1:9, nrow = 3)
class(X)

[1] "matrix" "array"

(Y <- as.data.frame(X))

names(Y)

[1] "V1" "V2" "V3"

names(Y) <- c("colA", "colB", "colC")

欢迎讨论！

苏命|https://drwater.rcees.ac.cn; https://drwater.rcees.ac.cn/bcard; Slides

mingsu@rcees.ac.cn;

二、R语言语法基础

数据类型

数值型

数据类型

字符串

数据类型

逻辑

特殊值

特殊值

变量赋值

变量赋值

变量赋值

数据框赋值

变量赋值

列表赋值

数学函数

数学函数

统计函数

统计函数

统计函数

执行两样本或单样本 t 检验

统计函数

Wilcoxon-Mann-Whitney检验

统计函数

创建向量的直方图

函数调用-练习

题目：设有一组数据集合 x 包含了一些整数，请编写R语言代码计算并输出以下指标：

控制流程

if-else 语句

控制流程

for 循环

自定义函数

定义函数

调用函数

数据结构

向量

列表

数值型向量

什么是数值型向量？

创建数值型向量

数值型向量

向量运算

向量运算

向量求和

向量运算

向量平均值

运算-数值运算

运算-逻辑运算

运算-逻辑运算

运算-字符型

运算-字符型

运算-字符型

运算-字符型

运算-字符型

索引

向量

索引

向量

索引

索引

列表

索引

数据框

日期和时间

base package

日期和时间

lubridate package

日期和时间

lubridate package

日期和时间

lubridate package

日期和时间

lubridate package

因子（factor）

Factor是什么？

因子（factor）

创建Factor

因子（factor）

Label of Factor

分组

`base` package

`lubridate` package

`lubridate` package

`lubridate` package

`lubridate` package

`cbind`、`rbind`