In this post we try to analyse a dataset that was acquired by the National Institute of Diabetes and Digestive and Kidney Diseases. This data set consists of records of 768 women of ages at least 21 years who might or might not have diabetes. This data set was acquired in the year 1990. The observations here belong to 768 women of the Pima Indian tribe of Arizona. These people live along the Gella and Salt rivers in Arizona.
The data set consists of variables such as blood pressure, glucose levels, insulin levels, number of pregnancies, skin thickness, body mass index and outcome(positive/negative).
Age and Diabetes
In the above plot, the factor level 1 denotes the onset of diabetes. We see that the median ages for these two categories differ. The median age at which diabetes occur for this data set is much higher. This could be attributed to the lack of physical movement as we get older. The diet can also play a huge role here.
Skin Thickness and Blood Pressure
There seems to be no apparent pattern here with respect to the skin thickness and the blood pressure of individuals here. Let’s take a better look at this plot. There are individuals who have skin thickness of 0 ! This is not possible. Data collection errors would have occurred that have not been rectified.
What would happen if we removed these points?
We see a very weak correlation that conveys a weak relationship between the skin thickness and blood pressure. In the above plot, we have removed outliers from both the columns of observations to better understand what we might find out. The relationship isn’t linear here.
BMI and Skin Thickness
The above plot is trying to tell us that the relationship might be linear. The above points have been coloured in such a way that we can demarcate the positive results from the negative ones. The points that are denoted by triangles are positive.
A simple correlation calculation tells us that the relationship between skin thickness is linear. To put it in numbers is higher than 0.5 (0.631 to be exact).
When the individual has diabetes, the SkinThickness can be measured by the following model :
The other model when the diabetes results are negative is:
The above model tells us that for every 1 unit increase in the BMI of an individual who has diabetes , there is an increase of about 0.8089 units in skin thickness. This increase is about 0.93297 for those who do not have diabetes. This could be because of other factors such as cholesterol levels, exercise etc.
The above models can be visualized by the following plot.
We see from the above plot that the regression line that defines the data points that denote individuals without diabetes has a slightly higher slope.
Glucose and Diabetes
In the above plot, we find that the distribution of glucose levels differ for individuals who have and who do not have diabetes. The mean glucose level tends to be higher for those who have diabetes. This is true for the sample size of 768 that we have. But is this true for the population? For this we have to do a one sided t test.
- The distributions are skewed but the sample is greater than 30
- The individuals are independent of each other
- Both the groups are independent of each other
- The sample size is lesser than 10% of the population
Let’s formulate a hypothesis to assess the mean difference of glucose levels between the positive and negative groups.H0:0
The p -value for this hypothesis test is 3.390826e-33. We see that this favours the hypothesis that the mean difference is greater than zero. Hence, the diabetes population has a mean glucose level greater than that of people who do not have diabetes.
Age and Pregnancies
We see that most individuals were adults and young adults.The oldest individual in this set is of the age 81. The highest average number of pregnancies tend to be achieved by women of age 51. In the above table, we have displayed ages that have appeared more than 5 times.
Logistic regression is a regression model where the dependent variable is categorical. In this case, its the onset of diabetes in individuals. The output of the model we built is in the form of probabilities. Taking the threshold as 0.5 , we get an accuracy of about 79% .
Changing the threshold
We achieve an accuracy of 82% in predicting diabetes by using 0.51 as the threshold.
We have been able to build a model that predicts diabetes given the variables to quite a fair amount of accuracy. This data set , assuming the conditions mentioned also proves that the mean amount of glucose in individuals with diabetes is higher than the ones without diabetes.
The code for this article can be found here