分析の練習
R環境が整ったところで、そろそろ実際の分析の練習をしてみようと思います。
Rにはすでにいろんなデータが入っているので、それらを使ってやってみようと思います。
まずは、どんなデータが入っているのか確認。
> data() Data sets in package ‘datasets’: AirPassengers Monthly Airline Passenger Numbers 1949-1960 BJsales Sales Data with Leading Indicator BJsales.lead (BJsales) Sales Data with Leading Indicator BOD Biochemical Oxygen Demand CO2 Carbon Dioxide Uptake in Grass Plants ChickWeight Weight versus age of chicks on different diets DNase Elisa assay of DNase EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998 Formaldehyde Determination of Formaldehyde HairEyeColor Hair and Eye Color of Statistics Students Harman23.cor Harman Example 2.3 Harman74.cor Harman Example 7.4 Indometh Pharmacokinetics of Indomethacin InsectSprays Effectiveness of Insect Sprays JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share LakeHuron Level of Lake Huron 1875-1972 LifeCycleSavings Intercountry Life-Cycle Savings Data Loblolly Growth of Loblolly pine trees Nile Flow of the River Nile Orange Growth of Orange Trees OrchardSprays Potency of Orchard Sprays PlantGrowth Results from an Experiment on Plant Growth Puromycin Reaction Velocity of an Enzymatic Reaction Seatbelts Road Casualties in Great Britain 1969-84 Theoph Pharmacokinetics of Theophylline Titanic Survival of passengers on the Titanic ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs UCBAdmissions Student Admissions at UC Berkeley UKDriverDeaths Road Casualties in Great Britain 1969-84 UKgas UK Quarterly Gas Consumption USAccDeaths Accidental Deaths in the US 1973-1978 USArrests Violent Crime Rates by US State USJudgeRatings Lawyers' Ratings of State Judges in the US Superior Court USPersonalExpenditure Personal Expenditure Data UScitiesD Distances Between European Cities and Between US Cities VADeaths Death Rates in Virginia (1940) WWWusage Internet Usage per Minute WorldPhones The World's Telephones ability.cov Ability and Intelligence Tests airmiles Passenger Miles on Commercial US Airlines, 1937-1960 airquality New York Air Quality Measurements anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions attenu The Joyner-Boore Attenuation Data attitude The Chatterjee-Price Attitude Data austres Quarterly Time Series of the Number of Australian Residents beaver1 (beavers) Body Temperature Series of Two Beavers beaver2 (beavers) Body Temperature Series of Two Beavers cars Speed and Stopping Distances of Cars chickwts Chicken Weights by Feed Type co2 Mauna Loa Atmospheric CO2 Concentration crimtab Student's 3000 Criminals Data discoveries Yearly Numbers of Important Discoveries esoph Smoking, Alcohol and (O)esophageal Cancer euro Conversion Rates of Euro Currencies euro.cross (euro) Conversion Rates of Euro Currencies eurodist Distances Between European Cities and Between US Cities faithful Old Faithful Geyser Data fdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK freeny Freeny's Revenue Data freeny.x (freeny) Freeny's Revenue Data freeny.y (freeny) Freeny's Revenue Data infert Infertility after Spontaneous and Induced Abortion iris Edgar Anderson's Iris Data iris3 Edgar Anderson's Iris Data islands Areas of the World's Major Landmasses ldeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK lh Luteinizing Hormone in Blood Samples longley Longley's Economic Regression Data lynx Annual Canadian Lynx trappings 1821-1934 mdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK morley Michelson Speed of Light Data mtcars Motor Trend Car Road Tests nhtemp Average Yearly Temperatures in New Haven nottem Average Monthly Temperatures at Nottingham, 1920-1939 npk Classical N, P, K Factorial Experiment occupationalStatus Occupational Status of Fathers and their Sons precip Annual Precipitation in US Cities presidents Quarterly Approval Ratings of US Presidents pressure Vapor Pressure of Mercury as a Function of Temperature quakes Locations of Earthquakes off Fiji randu Random Numbers from Congruential Generator RANDU rivers Lengths of Major North American Rivers rock Measurements on Petroleum Rock Samples sleep Student's Sleep Data stack.loss (stackloss) Brownlee's Stack Loss Plant Data stack.x (stackloss) Brownlee's Stack Loss Plant Data stackloss Brownlee's Stack Loss Plant Data state.abb (state) US State Facts and Figures state.area (state) US State Facts and Figures state.center (state) US State Facts and Figures state.division (state) US State Facts and Figures state.name (state) US State Facts and Figures state.region (state) US State Facts and Figures state.x77 (state) US State Facts and Figures sunspot.month Monthly Sunspot Data, from 1749 to "Present" sunspot.year Yearly Sunspot Data, 1700-1988 sunspots Monthly Sunspot Numbers, 1749-1983 swiss Swiss Fertility and Socioeconomic Indicators (1888) Data treering Yearly Treering Data, -6000-1979 trees Girth, Height and Volume for Black Cherry Trees uspop Populations Recorded by the US Census volcano Topographic Information on Auckland's Maunga Whau Volcano warpbreaks The Number of Breaks in Yarn during Weaving women Average Heights and Weights for American Women
いやまぁたくさんはいってますね。
ということで、今回はよく使われているairquality
を使いたいと思います。
> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 > str(airquality) 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
head()
はそのデータの最初の方をチラ見する関数ですね。str()
ってのはstructure、データの構造を見るやつらしいです。ふむふむ。
で、このデータを使っていろいろやってきます。
まず、Rすげーってなるよしなに関数からいきます。
- plot()
> plot(airquality)
おおお!これがplot()
の真価・・・すげーそれっぽい・・・これだけ見てもおもしろいですね〜
- summary()
> summary(airquality) Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 Month Day Min. :5.000 Min. : 1.0 1st Qu.:6.000 1st Qu.: 8.0 Median :7.000 Median :16.0 Mean :6.993 Mean :15.8 3rd Qu.:8.000 3rd Qu.:23.0 Max. :9.000 Max. :31.0
おおーー、もうこれでいいじゃんって感じですね。
でだ、早速、分析っぽいことをやってみようと思う。まずは単回帰分析。
はて、、という感じですが、今回は、オゾン量?と気温の相関を調べます。
なんか、回帰分析ってのは、「1つの目的変数を1つの説明変数で予測する手法」だそうで、ぶっちゃけ簡単に言うと、
目的変数(オゾン量)=定数1×説明変数(気温)+変数2
って形で表す感じで、過去のデータから線形グラフを求めて、未来のデータを予測するというものだそうです。
で、説明変数が複数になると「重回帰分析」と呼ぶそう。
-P値 ちなみに、定数のことを「偏回帰係数」と言います。ただ、偏回帰係数はそれぞれの変数が目的変数に どの程度影響を与えるか、ってのを直接表すものじゃない(cmをmにすると変わるし)ってことで、 じゃ、影響度合いを調べる指標ないのよ?ってことで、「P値」があります。詳細は省略しますが、 だいたい5%未満であれば、その係数は統計的に意味があると言えるそうです。 -マルチコ(多重共線性) 重回帰分析の場合、説明変数同士が高い相関関係にある場合、分析結果が不安定になるそうです。 これをマルチコリニアリティ(multicollinearity)と言います。なので、重回帰分析をする場合は、 各説明変数同士の相関関係を調べてからやったほうがいいみたいですね。具体的には、説明変数同士の 相関係数が0.5以上なら注意、0.7以上ならNGだそうです。
ということで、前振りが長くなりましたが、いざ分析。
オゾン量を気温で説明します。ハイ。
> cor(airquality$Ozone,airquality$Temp,use="complete.obs") [1] 0.6983603 > result <- lm(Ozone~Temp,data=airquality) > summary(result) Call: lm(formula = Ozone ~ Temp, data = airquality) Residuals: Min 1Q Median 3Q Max -40.729 -17.409 -0.587 11.306 118.271 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -146.9955 18.2872 -8.038 9.37e-13 *** Temp 2.4287 0.2331 10.418 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 23.71 on 114 degrees of freedom (37 observations deleted due to missingness) Multiple R-squared: 0.4877, Adjusted R-squared: 0.4832 F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
まずは、cor()
でオゾン量と気温の相関を求めています(use="complete.obs"
はNAの時の扱い。この場合は、NAのある項目は排除する設定。)
で、相関が0.7くらいなのでそこそこの正の相関があるようです。つまり、分析する価値あり。
で、lmってのが回帰分析の関数らしいです。
これで、分析した結果をsummaryします。
結果の真ん中あたりのCoefficients:
に偏回帰係数(Estimate)が出てます。つまり、
目的変数(オゾン量)=2.429×説明変数(気温)-146.995
ってことですね。
(ちなみに、Coefficientsの後ろの方の*が多いほど、有意らしいです。MAX星3つなので、気温変数は三ツ星獲得です!
で、これで線形のグラフがかけますので、plotしてみます。
> plot(b,a) > abline(a=-146,b=2.4,col="red",lty=2)
いや〜素敵ですね笑
なんかそれっぽいじゃないっすか!