Problem:
The data set soep2000.csv contains, among other variables, the time until the end of unemployment (dauer) as well as a dummy variable for females (female). Use R to read the data and to solve the following tasks:
Estimate separately for each gender the survivor functions for the duration of unemployment (in months). Interpret the estimates and compare the two curves.
For both genders, obtain the median time until end of unemployment and the respective 95% confidence intervals.
Compute a log-rank test for the comparison of the two groups.
Note: Here experiencing an event means getting a job
head(soep2000)
library(survival)
km<-survfit(Surv(dauer,status)~ female,data= soep2000)
str(km)
List of 17
$ n : int [1:2] 1206 794
$ time : num [1:72] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:72] 1206 1036 901 785 680 ...
$ n.event : num [1:72] 138 109 92 81 38 48 38 24 21 15 ...
$ n.censor : num [1:72] 32 26 24 24 23 18 17 16 17 16 ...
$ surv : num [1:72] 0.886 0.792 0.711 0.638 0.602 ...
$ std.err : num [1:72] 0.0104 0.0149 0.0186 0.0222 0.0241 ...
$ cumhaz : num [1:72] 0.114 0.22 0.322 0.425 0.481 ...
$ std.chaz : num [1:72] 0.00974 0.01402 0.0176 0.02101 0.02288 ...
$ strata : Named int [1:2] 36 36
..- attr(*, "names")= chr [1:2] "female=0" "female=1"
$ type : chr "right"
$ logse : logi TRUE
$ conf.int : num 0.95
$ conf.type: chr "log"
$ lower : num [1:72] 0.868 0.77 0.686 0.611 0.575 ...
$ upper : num [1:72] 0.904 0.816 0.738 0.666 0.632 ...
$ call : language survfit(formula = Surv(dauer, status) ~ female, data = soep2000)
- attr(*, "class")= chr "survfit"
library(GGally)
ggsurv(km)+
geom_hline(yintercept = 0.5, linetype = 3)+
geom_vline(xintercept=8,linetype=3)+
geom_vline(xintercept=14,linetype=3)+
labs(x="Time",y="S(t)",title = "Duration between the gender of unemployment")+
ylim(c(0,1))+
theme_bw()
summary_km<-(summary(km,rmean = "none"))$table
summary_km
records n.max n.start events median 0.95LCL 0.95UCL
female=0 1206 1206 1206 726 8 7 10
female=1 794 794 794 396 14 12 16
Analysis: After 8 months, 50% of the unemployed males (female=0) found a job After 14 months, 50% of the unemployed females (female=1) found a job Probability to find a job is higher for males compared to females
quantile_25<-as.data.frame(quantile(km,.25))
colnames(quantile_25)<-c("estimate","lower band","upper band")
quantile_50<-as.data.frame(quantile(km,.50))
colnames(quantile_50)<-c("estimate","lower band","upper band")
#another method:
#
#q25<-do.call(cbind.data.frame,quantile(km,.25))
#
quantile_25
quantile_50
NA
After 3 Months, 25% of the unemployed males found a job After 5 months, 25% of the unemployed females found a job
logrank<-survdiff(Surv(dauer,status)~ female,data= soep2000)
logrank
Call:
survdiff(formula = Surv(dauer, status) ~ female, data = soep2000)
N Observed Expected (O-E)^2/E (O-E)^2/V
female=0 1206 726 651 8.62 22.1
female=1 794 396 471 11.92 22.1
Chisq= 22.1 on 1 degrees of freedom, p= 3e-06
Wilcoxon<-survdiff(Surv(dauer,status)~ female,data= soep2000,rho=1)
Wilcoxon
Call:
survdiff(formula = Surv(dauer, status) ~ female, data = soep2000,
rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
female=0 1206 536 473 8.31 28
female=1 794 273 336 11.69 28
Chisq= 28 on 1 degrees of freedom, p= 1e-07
We can see that p<0.05,0.01, so we can reject that Null hypothesis stating that survival function and hazard rate of both the group is different.
Conclusion: By campring both gender itβs proven that males has higher chances of getting the job back and the unemployment rate is very less when compared to females.