軽く分析

ということで、軽くデータに触っていきます。あんまり時間ないのですっげーてきとーです。

> install.packages("data.table")
> library(data.table)
> system.time(dt<-fread("./data/nkf_SE.csv"))
   ユーザ   システム       経過  
     1.380      0.324      2.816 
#最新のデータははいっていないので除外。
> src<-dt[V3<20150721,]
#東京競馬場に絞り込み
> tkysrc<-dt[V6==5,]

#さらに馬番、確定順位、後3ハロンタイム列に絞り込み、同時に順位が2位以内を1、それ以外を0に。
> umaban_3f<-tkysrc[,list(V11,ifelse(tkysrc$V41<3,1,0),V59)]
> head(umaban_3f)
   V11 V2 V59
1:   1  1 383
2:   2  0 391
3:   3  0 381
4:   4  0 395
5:   5  0 395
6:   6  0 392

> str(umaban_3f)
Classes ‘data.table’ and 'data.frame':	43775 obs. of  3 variables:
 $ V11: int  1 2 3 4 5 6 7 8 9 10 ...
 $ V2 : num  1 0 0 0 0 0 0 0 0 1 ...
 $ V59: int  383 391 381 395 395 392 387 389 418 379 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> summary(umaban_3f)
      V11              V2              V59     
 Min.   : 1.00   Min.   :0.0000   Min.   :  0  
 1st Qu.: 4.00   1st Qu.:0.0000   1st Qu.:347  
 Median : 8.00   Median :0.0000   Median :363  
 Mean   : 8.11   Mean   :0.1628   Mean   :357  
 3rd Qu.:12.00   3rd Qu.:0.0000   3rd Qu.:380  
 Max.   :18.00   Max.   :1.0000   Max.   :999  

#チェックしたところ、V59に999があるので除外処理。
> umaban_3f<-umaban_3f[V59<999,]
> summary(umaban_3f)
      V11               V2              V59       
 Min.   : 1.000   Min.   :0.0000   Min.   :  0.0  
 1st Qu.: 4.000   1st Qu.:0.0000   1st Qu.:347.0  
 Median : 8.000   Median :0.0000   Median :363.0  
 Mean   : 8.108   Mean   :0.1547   Mean   :350.8  
 3rd Qu.:12.000   3rd Qu.:0.0000   3rd Qu.:379.0  
 Max.   :18.000   Max.   :1.0000   Max.   :954.0  

data.tableいいっすね。とりあえず前処理完了。いよいよ分析。

> res<-glm(umaban_3f$V2~.,umaban_3f[,list(V11,V59)],family=binomial)
> summary(res)

Call:
glm(formula = umaban_3f$V2 ~ ., family = binomial, data = umaban_3f[, 
    list(V11, V59)])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2346  -0.5516  -0.5109  -0.4668   2.1937  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.2670386  0.0607073  20.871  < 2e-16 ***
V11         -0.0091667  0.0031009  -2.956  0.00312 ** 
V59         -0.0085181  0.0001617 -52.684  < 2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37357  on 43356  degrees of freedom
Residual deviance: 34282  on 43354  degrees of freedom
AIC: 34288

Number of Fisher Scoring iterations: 4
>
> xp<-seq(min(umaban_3f[,V11]),max(umaban_3f[,V11]),0.1)
> yp<-seq(min(umaban_3f[,V59]),max(umaban_3f[,V59]),0.1)
> grid<-expand.grid(V11=xp,V59=yp)
> head(grid)
  V11 V59
1 1.0   0
2 1.1   0
3 1.2   0
4 1.3   0
5 1.4   0
6 1.5   0
> length(xp)
[1] 171
> Z<-predict(res,grid,type="response")
> plot(umaban_3f[,V11],umaban_3f[,V59],col=ifelse(umaban_3f[,V2],"red","blue"))
> contour(xp,yp,matrix(Z,length(xp)),add=T,levels=0.5,col="green")

f:id:funizou:20150802113758p:plain:w400

ありゃ、、まぁ距離で分けてないからタイムで別れるのは当然か。。。しかし、0は除外しないとな。。
ということで、短距離について再度トライ。

> umaban_3f_short<-umaban_3f[V59<200,]
> umaban_3f_short<-umaban_3f_short[V59>0,]
> summary(umaban_3f_short)
      V11              V2              V59       
 Min.   : 1.00   Min.   :0.0000   Min.   :130.0  
 1st Qu.: 4.00   1st Qu.:0.0000   1st Qu.:135.0  
 Median : 7.00   Median :0.0000   Median :137.0  
 Mean   : 7.33   Mean   :0.1562   Mean   :137.5  
 3rd Qu.:11.00   3rd Qu.:0.0000   3rd Qu.:139.0  
 Max.   :14.00   Max.   :1.0000   Max.   :155.0  
>
> res2<-glm(umaban_3f_short$V2~.,umaban_3f_short[,list(V11,V59)],family=binomial)
> summary(res2)

Call:
glm(formula = umaban_3f_short$V2 ~ ., family = binomial, data = umaban_3f_short[, 
    list(V11, V59)])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2915  -0.6271  -0.4517  -0.2113   2.8943  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 46.57868    4.78751   9.729   <2e-16 ***
V11         -0.03023    0.01937  -1.560    0.119    
V59         -0.35216    0.03527  -9.985   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1253.9  on 1446  degrees of freedom
Residual deviance: 1110.4  on 1444  degrees of freedom
AIC: 1116.4

Number of Fisher Scoring iterations: 5

>
> xp<-seq(min(umaban_3f_short[,V11]),max(umaban_3f_short[,V11]),0.1)
> yp<-seq(min(umaban_3f_short[,V59]),max(umaban_3f_short[,V59]),0.1)
> grid<-expand.grid(V11=xp,V59=yp)
> length(xp)
[1] 131
> Z<-predict(res2,grid,type="response")
> plot(umaban_3f_short[,V11],umaban_3f_short[,V59],col=ifelse(umaban_3f_short[,V2],"red","blue"))
> contour(xp,yp,matrix(Z,length(xp)),add=T,levels=0.5,col="green")

f:id:funizou:20150802113942p:plain:w400

うーん、なんか全然ダメだな。てか、そもそも馬番入れたのがまずかったか・・・でも、この中枠の奴らが3f早いけど入賞してないってのはミクロに見ていくと面白いかも。ということで、次は単純にlmやる。

> rank_3f<-tkysrc[,list(V41,V59)]
> head(rank_3f)
   V41 V59
1:   1 383
2:   8 391
3:   9 381
4:  10 395
5:  13 395
6:   6 392
> rank_3f<-rank_3f[V59<200,]
> rank_3f<-rank_3f[V59>0,]
> head(rank_3f[,V59])
[1] 137 136 143 139 137 140
> length(rank_3f[,V59])
[1] 1447
> res3<-lm(rank_3f[,V41]~rank_3f[,V59])
> summary(res3)

Call:
lm(formula = rank_3f[, V41] ~ rank_3f[, V59])

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5462 -2.5187  0.0017  2.3922  8.4813 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -72.1541     3.4389  -20.98   <2e-16 ***
rank_3f[, V59]   0.5754     0.0250   23.01   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.239 on 1445 degrees of freedom
Multiple R-squared:  0.2682,	Adjusted R-squared:  0.2677 
F-statistic: 529.6 on 1 and 1445 DF,  p-value: < 2.2e-16

> plot(rank_3f[,V59],rank_3f[,V41])
> abline(a=-72.1541,b=0.5754,col="red",lty=2)

f:id:funizou:20150802114023p:plain:w400

あーんー、そーだよね。こーなるよね。まぁこれも後3ハロンタイムが速いほど、そのまま上位になりやすいってのはわかるけど、やっぱ来週はちゃんと腰を据えてやろう。。