分析の練習

R環境が整ったところで、そろそろ実際の分析の練習をしてみようと思います。
Rにはすでにいろんなデータが入っているので、それらを使ってやってみようと思います。
まずは、どんなデータが入っているのか確認。

> data()
Data sets in package ‘datasets’:

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BJsales.lead (BJsales)
                        Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock
                        Indices, 1991-1998
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level of Lake Huron 1875-1972
LifeCycleSavings        Intercountry Life-Cycle Savings Data
Loblolly                Growth of Loblolly pine trees
Nile                    Flow of the River Nile
Orange                  Growth of Orange Trees
OrchardSprays           Potency of Orchard Sprays
PlantGrowth             Results from an Experiment on Plant Growth
Puromycin               Reaction Velocity of an Enzymatic Reaction
Seatbelts               Road Casualties in Great Britain 1969-84
Theoph                  Pharmacokinetics of Theophylline
Titanic                 Survival of passengers on the Titanic
ToothGrowth             The Effect of Vitamin C on Tooth Growth in
                        Guinea Pigs
UCBAdmissions           Student Admissions at UC Berkeley
UKDriverDeaths          Road Casualties in Great Britain 1969-84
UKgas                   UK Quarterly Gas Consumption
USAccDeaths             Accidental Deaths in the US 1973-1978
USArrests               Violent Crime Rates by US State
USJudgeRatings          Lawyers' Ratings of State Judges in the US
                        Superior Court
USPersonalExpenditure   Personal Expenditure Data
UScitiesD               Distances Between European Cities and Between
                        US Cities
VADeaths                Death Rates in Virginia (1940)
WWWusage                Internet Usage per Minute
WorldPhones             The World's Telephones
ability.cov             Ability and Intelligence Tests
airmiles                Passenger Miles on Commercial US Airlines,
                        1937-1960
airquality              New York Air Quality Measurements
anscombe                Anscombe's Quartet of 'Identical' Simple Linear
                        Regressions
attenu                  The Joyner-Boore Attenuation Data
attitude                The Chatterjee-Price Attitude Data
austres                 Quarterly Time Series of the Number of
                        Australian Residents
beaver1 (beavers)       Body Temperature Series of Two Beavers
beaver2 (beavers)       Body Temperature Series of Two Beavers
cars                    Speed and Stopping Distances of Cars
chickwts                Chicken Weights by Feed Type
co2                     Mauna Loa Atmospheric CO2 Concentration
crimtab                 Student's 3000 Criminals Data
discoveries             Yearly Numbers of Important Discoveries
esoph                   Smoking, Alcohol and (O)esophageal Cancer
euro                    Conversion Rates of Euro Currencies
euro.cross (euro)       Conversion Rates of Euro Currencies
eurodist                Distances Between European Cities and Between
                        US Cities
faithful                Old Faithful Geyser Data
fdeaths (UKLungDeaths)
                        Monthly Deaths from Lung Diseases in the UK
freeny                  Freeny's Revenue Data
freeny.x (freeny)       Freeny's Revenue Data
freeny.y (freeny)       Freeny's Revenue Data
infert                  Infertility after Spontaneous and Induced
                        Abortion
iris                    Edgar Anderson's Iris Data
iris3                   Edgar Anderson's Iris Data
islands                 Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
                        Monthly Deaths from Lung Diseases in the UK
lh                      Luteinizing Hormone in Blood Samples
longley                 Longley's Economic Regression Data
lynx                    Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
                        Monthly Deaths from Lung Diseases in the UK
morley                  Michelson Speed of Light Data
mtcars                  Motor Trend Car Road Tests
nhtemp                  Average Yearly Temperatures in New Haven
nottem                  Average Monthly Temperatures at Nottingham,
                        1920-1939
npk                     Classical N, P, K Factorial Experiment
occupationalStatus      Occupational Status of Fathers and their Sons
precip                  Annual Precipitation in US Cities
presidents              Quarterly Approval Ratings of US Presidents
pressure                Vapor Pressure of Mercury as a Function of
                        Temperature
quakes                  Locations of Earthquakes off Fiji
randu                   Random Numbers from Congruential Generator
                        RANDU
rivers                  Lengths of Major North American Rivers
rock                    Measurements on Petroleum Rock Samples
sleep                   Student's Sleep Data
stack.loss (stackloss)
                        Brownlee's Stack Loss Plant Data
stack.x (stackloss)     Brownlee's Stack Loss Plant Data
stackloss               Brownlee's Stack Loss Plant Data
state.abb (state)       US State Facts and Figures
state.area (state)      US State Facts and Figures
state.center (state)    US State Facts and Figures
state.division (state)
                        US State Facts and Figures
state.name (state)      US State Facts and Figures
state.region (state)    US State Facts and Figures
state.x77 (state)       US State Facts and Figures
sunspot.month           Monthly Sunspot Data, from 1749 to "Present"
sunspot.year            Yearly Sunspot Data, 1700-1988
sunspots                Monthly Sunspot Numbers, 1749-1983
swiss                   Swiss Fertility and Socioeconomic Indicators
                        (1888) Data
treering                Yearly Treering Data, -6000-1979
trees                   Girth, Height and Volume for Black Cherry Trees
uspop                   Populations Recorded by the US Census
volcano                 Topographic Information on Auckland's Maunga
                        Whau Volcano
warpbreaks              The Number of Breaks in Yarn during Weaving
women                   Average Heights and Weights for American Women

いやまぁたくさんはいってますね。

ということで、今回はよく使われているairqualityを使いたいと思います。

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> str(airquality)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

head()はそのデータの最初の方をチラ見する関数ですね。str()ってのはstructure、データの構造を見るやつらしいです。ふむふむ。

で、このデータを使っていろいろやってきます。
まず、Rすげーってなるよしなに関数からいきます。

  • plot()
> plot(airquality)

f:id:funizou:20150802112422p:plain:w400

おおお!これがplot()の真価・・・すげーそれっぽい・・・これだけ見てもおもしろいですね〜

  • summary()
> summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  

おおーー、もうこれでいいじゃんって感じですね。

でだ、早速、分析っぽいことをやってみようと思う。まずは単回帰分析。
はて、、という感じですが、今回は、オゾン量?と気温の相関を調べます。
なんか、回帰分析ってのは、「1つの目的変数を1つの説明変数で予測する手法」だそうで、ぶっちゃけ簡単に言うと、
目的変数(オゾン量)=定数1×説明変数(気温)+変数2
って形で表す感じで、過去のデータから線形グラフを求めて、未来のデータを予測するというものだそうです。
で、説明変数が複数になると「重回帰分析」と呼ぶそう。


-P値
ちなみに、定数のことを「偏回帰係数」と言います。ただ、偏回帰係数はそれぞれの変数が目的変数に
どの程度影響を与えるか、ってのを直接表すものじゃない(cmをmにすると変わるし)ってことで、
じゃ、影響度合いを調べる指標ないのよ?ってことで、「P値」があります。詳細は省略しますが、
だいたい5%未満であれば、その係数は統計的に意味があると言えるそうです。
-マルチコ(多重共線性)
重回帰分析の場合、説明変数同士が高い相関関係にある場合、分析結果が不安定になるそうです。
これをマルチコリニアリティ(multicollinearity)と言います。なので、重回帰分析をする場合は、
各説明変数同士の相関関係を調べてからやったほうがいいみたいですね。具体的には、説明変数同士の
相関係数が0.5以上なら注意、0.7以上ならNGだそうです。


ということで、前振りが長くなりましたが、いざ分析。
オゾン量を気温で説明します。ハイ。

> cor(airquality$Ozone,airquality$Temp,use="complete.obs")
[1] 0.6983603
> result <- lm(Ozone~Temp,data=airquality)
> summary(result)

Call:
lm(formula = Ozone ~ Temp, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.729 -17.409  -0.587  11.306 118.271 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -146.9955    18.2872  -8.038 9.37e-13 ***
Temp           2.4287     0.2331  10.418  < 2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.71 on 114 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.4877,	Adjusted R-squared:  0.4832 
F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

まずは、cor()でオゾン量と気温の相関を求めています(use="complete.obs"はNAの時の扱い。この場合は、NAのある項目は排除する設定。)
で、相関が0.7くらいなのでそこそこの正の相関があるようです。つまり、分析する価値あり。
で、lmってのが回帰分析の関数らしいです。
これで、分析した結果をsummaryします。
結果の真ん中あたりのCoefficients:に偏回帰係数(Estimate)が出てます。つまり、
目的変数(オゾン量)=2.429×説明変数(気温)-146.995
ってことですね。
(ちなみに、Coefficientsの後ろの方の*が多いほど、有意らしいです。MAX星3つなので、気温変数は三ツ星獲得です!
で、これで線形のグラフがかけますので、plotしてみます。

> plot(b,a)
> abline(a=-146,b=2.4,col="red",lty=2)

f:id:funizou:20150802112531p:plain:w400

いや〜素敵ですね笑
なんかそれっぽいじゃないっすか!