Oversampling method using R

I'm studying oversampling method using R. Let's say I want to do oversampling from the data df.

df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)), x1=rnorm(100), x2=rnorm(100))

Obviously, df has 10 No's and 90Yes's. So y is imbalanced. I tried to use ubBalance function to make y balanced, but it seems like that I cannot use it because I use R version 4. Is there a easy way to do oversampling in R version 4.

1 Answer

You could use a Random Walk Overslamping using the rwo function from the imbalance package:

Generates synthetic minority examples for a dataset trying to preserve the variance and mean of the minority class. Works on every type of dataset.

Here a reproducible example:

df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)), x1=rnorm(100), x2=rnorm(100))
library(imbalance)
colnames(df) <- c("Class", "x1", "x2")
new_df <-rwo(df, numInstances = 50)
new_df
#> Class x1 x2
#> 1 No -1.16439984 0.21856395
#> 2 No 1.20744623 0.28858048
#> 3 No 1.56528275 -0.07579441
#> 4 No -1.03733411 0.01835535
#> 5 No -0.70526984 -2.01477788
#> 6 No -0.80978490 0.64829995
#> 7 No 0.32493643 -0.05699719
#> 8 No -0.98764951 -1.72838623
#> 9 No -0.42004551 0.79171386
#> 10 No -2.02128473 0.41171867
#> 11 No -0.84667118 -1.31055008
#> 12 No -0.41447116 0.73619119
#> 13 No -0.59519331 -2.12420980
#> 14 No -1.87381529 0.36029347
#> 15 No -1.71772198 -0.67236749
#> 16 No -1.91984498 0.30281031
#> 17 No -0.30854811 1.07314736
#> 18 No -2.09342702 -0.33375116
#> 19 No -0.57984243 0.94788328
#> 20 No -1.04299574 0.97960623
#> 21 No -0.48914322 1.09651605
#> 22 No 1.95909036 0.62301445
#> 23 No 0.32071004 -2.08889830
#> 24 No -0.98998047 0.45250458
#> 25 No 0.78258023 -0.57429362
#> 26 No 0.04426842 -1.48160646
#> 27 No -1.61386524 -0.07911380
#> 28 No -0.54491597 0.24783255
#> 29 No -1.55084192 0.44819029
#> 30 No 0.40391743 -2.00554911
#> 31 No -0.57996600 -1.70075786
#> 32 No 0.34502429 -0.11452995
#> 33 No -1.42240697 -0.15749236
#> 34 No 0.56406328 -1.96536380
#> 35 No -0.99870646 0.16643333
#> 36 No 0.29262027 -1.86874500
#> 37 No 1.44551833 0.35333586
#> 38 No 1.69167557 0.16451481
#> 39 No -0.63712453 -2.37375325
#> 40 No -1.13339974 0.25853248
#> 41 No 1.60384482 0.21507984
#> 42 No -0.76946285 0.27068821
#> 43 No 0.58484861 -2.48727381
#> 44 No -1.33939478 -0.11824381
#> 45 No -1.01812834 -1.85177192
#> 46 No 0.57773883 -0.29486029
#> 47 No -1.11804972 -1.39796677
#> 48 No -1.79134432 -0.07027661
#> 49 No -0.56362892 -1.66805640
#> 50 No -1.61152940 0.06337827
plotComparison(df, rbind(df, new_df), attrs = names(new_df)[1:3])

Created on 2022-07-10 by the reprex package (v2.0.1)

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like