軽く分析
ということで、軽くデータに触っていきます。あんまり時間ないのですっげーてきとーです。
> install.packages("data.table") > library(data.table) > system.time(dt<-fread("./data/nkf_SE.csv")) ユーザ システム 経過 1.380 0.324 2.816 #最新のデータははいっていないので除外。 > src<-dt[V3<20150721,] #東京競馬場に絞り込み > tkysrc<-dt[V6==5,] #さらに馬番、確定順位、後3ハロンタイム列に絞り込み、同時に順位が2位以内を1、それ以外を0に。 > umaban_3f<-tkysrc[,list(V11,ifelse(tkysrc$V41<3,1,0),V59)] > head(umaban_3f) V11 V2 V59 1: 1 1 383 2: 2 0 391 3: 3 0 381 4: 4 0 395 5: 5 0 395 6: 6 0 392 > str(umaban_3f) Classes ‘data.table’ and 'data.frame': 43775 obs. of 3 variables: $ V11: int 1 2 3 4 5 6 7 8 9 10 ... $ V2 : num 1 0 0 0 0 0 0 0 0 1 ... $ V59: int 383 391 381 395 395 392 387 389 418 379 ... - attr(*, ".internal.selfref")=<externalptr> > summary(umaban_3f) V11 V2 V59 Min. : 1.00 Min. :0.0000 Min. : 0 1st Qu.: 4.00 1st Qu.:0.0000 1st Qu.:347 Median : 8.00 Median :0.0000 Median :363 Mean : 8.11 Mean :0.1628 Mean :357 3rd Qu.:12.00 3rd Qu.:0.0000 3rd Qu.:380 Max. :18.00 Max. :1.0000 Max. :999 #チェックしたところ、V59に999があるので除外処理。 > umaban_3f<-umaban_3f[V59<999,] > summary(umaban_3f) V11 V2 V59 Min. : 1.000 Min. :0.0000 Min. : 0.0 1st Qu.: 4.000 1st Qu.:0.0000 1st Qu.:347.0 Median : 8.000 Median :0.0000 Median :363.0 Mean : 8.108 Mean :0.1547 Mean :350.8 3rd Qu.:12.000 3rd Qu.:0.0000 3rd Qu.:379.0 Max. :18.000 Max. :1.0000 Max. :954.0
data.tableいいっすね。とりあえず前処理完了。いよいよ分析。
> res<-glm(umaban_3f$V2~.,umaban_3f[,list(V11,V59)],family=binomial) > summary(res) Call: glm(formula = umaban_3f$V2 ~ ., family = binomial, data = umaban_3f[, list(V11, V59)]) Deviance Residuals: Min 1Q Median 3Q Max -1.2346 -0.5516 -0.5109 -0.4668 2.1937 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2670386 0.0607073 20.871 < 2e-16 *** V11 -0.0091667 0.0031009 -2.956 0.00312 ** V59 -0.0085181 0.0001617 -52.684 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 37357 on 43356 degrees of freedom Residual deviance: 34282 on 43354 degrees of freedom AIC: 34288 Number of Fisher Scoring iterations: 4 > > xp<-seq(min(umaban_3f[,V11]),max(umaban_3f[,V11]),0.1) > yp<-seq(min(umaban_3f[,V59]),max(umaban_3f[,V59]),0.1) > grid<-expand.grid(V11=xp,V59=yp) > head(grid) V11 V59 1 1.0 0 2 1.1 0 3 1.2 0 4 1.3 0 5 1.4 0 6 1.5 0 > length(xp) [1] 171 > Z<-predict(res,grid,type="response") > plot(umaban_3f[,V11],umaban_3f[,V59],col=ifelse(umaban_3f[,V2],"red","blue")) > contour(xp,yp,matrix(Z,length(xp)),add=T,levels=0.5,col="green")
ありゃ、、まぁ距離で分けてないからタイムで別れるのは当然か。。。しかし、0は除外しないとな。。
ということで、短距離について再度トライ。
> umaban_3f_short<-umaban_3f[V59<200,] > umaban_3f_short<-umaban_3f_short[V59>0,] > summary(umaban_3f_short) V11 V2 V59 Min. : 1.00 Min. :0.0000 Min. :130.0 1st Qu.: 4.00 1st Qu.:0.0000 1st Qu.:135.0 Median : 7.00 Median :0.0000 Median :137.0 Mean : 7.33 Mean :0.1562 Mean :137.5 3rd Qu.:11.00 3rd Qu.:0.0000 3rd Qu.:139.0 Max. :14.00 Max. :1.0000 Max. :155.0 > > res2<-glm(umaban_3f_short$V2~.,umaban_3f_short[,list(V11,V59)],family=binomial) > summary(res2) Call: glm(formula = umaban_3f_short$V2 ~ ., family = binomial, data = umaban_3f_short[, list(V11, V59)]) Deviance Residuals: Min 1Q Median 3Q Max -1.2915 -0.6271 -0.4517 -0.2113 2.8943 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 46.57868 4.78751 9.729 <2e-16 *** V11 -0.03023 0.01937 -1.560 0.119 V59 -0.35216 0.03527 -9.985 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1253.9 on 1446 degrees of freedom Residual deviance: 1110.4 on 1444 degrees of freedom AIC: 1116.4 Number of Fisher Scoring iterations: 5 > > xp<-seq(min(umaban_3f_short[,V11]),max(umaban_3f_short[,V11]),0.1) > yp<-seq(min(umaban_3f_short[,V59]),max(umaban_3f_short[,V59]),0.1) > grid<-expand.grid(V11=xp,V59=yp) > length(xp) [1] 131 > Z<-predict(res2,grid,type="response") > plot(umaban_3f_short[,V11],umaban_3f_short[,V59],col=ifelse(umaban_3f_short[,V2],"red","blue")) > contour(xp,yp,matrix(Z,length(xp)),add=T,levels=0.5,col="green")
うーん、なんか全然ダメだな。てか、そもそも馬番入れたのがまずかったか・・・でも、この中枠の奴らが3f早いけど入賞してないってのはミクロに見ていくと面白いかも。ということで、次は単純にlmやる。
> rank_3f<-tkysrc[,list(V41,V59)] > head(rank_3f) V41 V59 1: 1 383 2: 8 391 3: 9 381 4: 10 395 5: 13 395 6: 6 392 > rank_3f<-rank_3f[V59<200,] > rank_3f<-rank_3f[V59>0,] > head(rank_3f[,V59]) [1] 137 136 143 139 137 140 > length(rank_3f[,V59]) [1] 1447 > res3<-lm(rank_3f[,V41]~rank_3f[,V59]) > summary(res3) Call: lm(formula = rank_3f[, V41] ~ rank_3f[, V59]) Residuals: Min 1Q Median 3Q Max -8.5462 -2.5187 0.0017 2.3922 8.4813 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -72.1541 3.4389 -20.98 <2e-16 *** rank_3f[, V59] 0.5754 0.0250 23.01 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.239 on 1445 degrees of freedom Multiple R-squared: 0.2682, Adjusted R-squared: 0.2677 F-statistic: 529.6 on 1 and 1445 DF, p-value: < 2.2e-16 > plot(rank_3f[,V59],rank_3f[,V41]) > abline(a=-72.1541,b=0.5754,col="red",lty=2)
あーんー、そーだよね。こーなるよね。まぁこれも後3ハロンタイムが速いほど、そのまま上位になりやすいってのはわかるけど、やっぱ来週はちゃんと腰を据えてやろう。。