

Bellabeat is a manufacturer of health tracking devices that is looking to become a more dominant player in this space. They offer #different devices that monitor data activity such as sleep, stress, walking activity, and calories burned. They have provided a data #set to analyze in order to reveal opportunities to explore for potential business growth.

  • Business Task

    Identify potential opportunities for growth and make recommendations for the Bellabeat marketing strategy based on third party data, trends and research.
  • Data Source

    Source Used: FitBit Fitness Tracker Data (CC0: Public Domain Kaggle dataset made available through Mobius, 04-12-2016, - 05-12-2016.
    Original data set provided as 18 csv files.

  • Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder
  • Bellabeat Analytics Team

1 Install prerequisite software.

rm(list = ls()) 

# Install prereq software if needed.
# install.packages(c("plyr","ggpubr","ggrepel","RColorBrewer","plotly","waffle","scales","viridis","janitor","skimr","lubridate",

#   Load Library

## 1.0 Import databases from csv files

daily_activity <- read_csv("./data/dailyActivity_merged.csv")
2 Preview Data

2.1 Cleaning and Formatting Data - Show unique items

2.2 Check for Duplicates.

### Shows 3 duplicates.

2.3 Remove duplicates and N/A

daily_activity <- daily_activity %>%
  distinct() %>%

daily_sleep <- daily_sleep %>%
  distinct() %>%

hourly_calories <- hourly_calories %>%
  distinct() %>%

hourly_intensities <- hourly_intensities %>%
  distinct() %>%

hourly_steps <- hourly_steps %>%
  distinct() %>%

2.4 Verify the cleaning process above and Check for duplicates and NA

####  Check for N/A.

2.5 Format time

hourly_calories$ActivityHour=as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
hourly_intensities$ActivityHour=as.POSIXct(hourly_intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
daily_sleep$SleepDay=as.POSIXct(daily_sleep$SleepDay, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
daily_activity$ActivityDate=as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y",tz = Sys.timezone())
###obtaining Day of Week from date
daily_activity <- daily_activity %>% 
  mutate(Day = format(ymd(ActivityDate), format = '%a'))

2.6 Verify time was adjusted to POSIXct POSIXt

2.7 Merging Tables

#merging hourly data frames (steps,calories,intensities)----
hourlies_df <- hourly_steps %>% 
  left_join(hourly_calories, by = c("Id", "ActivityHour")) %>% 
  left_join(hourly_intensities, by = c("Id", "ActivityHour")) %>% 
  separate(ActivityHour, sep = " ", into = c("date","time")) %>% 
  mutate(day = format(ymd(date), format = '%a')) %>% 
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M")) %>% 
  mutate(date = as.POSIXct(date))
## # ... with 2 more variables: AverageIntensity <dbl>, day <chr>

3 Export new Data Frames to Excel (backup)

write_xlsx(hourlies_df, "E:\\Websites\\R-Projects\\new_data\\hourlies_df.xlsx")
write_xlsx(daily_activity, "E:\\Websites\\R-Projects\\new_data\\daily_activity.xlsx")
write_xlsx(daily_sleep, "E:\\Websites\\R-Projects\\new_data\\daily_sleep.xlsx")
write_xlsx(hourly_calories, "E:\\Websites\\R-Projects\\new_data\\hourly_calories.xlsx")
write_xlsx(hourly_intensities, "E:\\Websites\\R-Projects\\new_data\\hourly_intensities.xlsx")
write_xlsx(hourly_steps, "E:\\Websites\\R-Projects\\new_data\\hourly_steps.xlsx")

3.1 Find the amount of time the watches are used. Low, MEDIUM, HIGH

daily_use2 <- daily_activity %>%
  filter(TotalSteps >200 ) %>% 
  group_by(Id) %>%
  dplyr::summarize(ActivityDate=sum(n())) %>%
  mutate(Usage = case_when(
    ActivityDate >= 1 & ActivityDate <= 15 ~ "Low Use",
    ActivityDate >= 16 & ActivityDate <= 22 ~ "Moderate Use", 
    ActivityDate >= 23 & ActivityDate <= 31 ~ "High Use")) %>% 
  mutate(Usage = factor(Usage, level = c('Low Use','Moderate Use','High Use'))) %>% 
  rename(days_used = ActivityDate) %>% 
## 8 2320127002        31 High Use

4 Analizing the Data.

daily_use <- daily_activity %>% 
  left_join(daily_use2, by = 'Id') %>%
  group_by(Usage) %>% 
  summarise(participants = n_distinct(Id)) %>% 
  mutate(perc = participants/sum(participants)) %>% 
  arrange(perc) %>% 
  mutate(perc = scales::percent(perc))

ggplot(daily_use,aes(fill=Usage ,y = participants, x="")) +
  geom_bar(stat="identity", width=2, color="white") +
  coord_polar("y", start=0)+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5,vjust= -5, size = 20, face = "bold")) +
  geom_text(aes(label = perc, x=1.2),position = position_stack(vjust = 0.5))+
  labs(title="Tracker use Percentage", tag = "figure 1")+
  guides(fill = guide_legend(title = "Usage Amount"))

  options(repr.plot.width = 2, repr.plot.height = 1)

4.1 For the high use group, they were used the most, and the low use group, used the least as expected.

4.2 Creating step data

#data manipulation to add Usage Types to 'daily_activity' df
daily_activity_usage <- daily_activity %>% 
  left_join(daily_use2, by = 'Id') %>% 
  mutate(day = format(ymd(ActivityDate), format = '%a')) %>% 
  mutate(total_minutes_worn = SedentaryMinutes+LightlyActiveMinutes+
           FairlyActiveMinutes+VeryActiveMinutes) %>% 
  mutate(total_hours = seconds_to_period(total_minutes_worn * 60))
#data for steps 
steps_hour <- daily_activity_usage %>% 
  group_by(day) %>%   
  summarise(mean_steps = round(mean(TotalSteps))) %>%
 mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
### * plot for avg steps by day 
ggplot(steps_hour, aes(x = day, y= mean_steps, fill = mean_steps)) +
    geom_col(color="darkblue", size = 0.1) +  
 scale_fill_gradientn(limits=c(0,9000), breaks=seq(0,9000, by = 1500), colors = brewer.pal(11,"Spectral")) + 
  scale_y_continuous(limits=c(0,9000), breaks=seq(0, 9000, by = 1500))+ 
labs(title= ("Average Steps"), tag = "figure 2", subtitle = ('By Day'), x="" , y="Steps")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(legend.position = "top")+
guides(fill = guide_colourbar(barwidth = 12))

options(repr.plot.width = 10, repr.plot.height = 8)

5 Observations 01:

  • The highest step days are Saturday, followed by Tuesday and Monday. Trailing off over the rest of the weekdays.
  • Unsurprisingly Sunday is the lowest, a rest day.
### Average steps by Group

stepsbygroup <- daily_activity_usage %>% 
group_by(day,Usage) %>%
dplyr::select(Usage, TotalSteps, day) %>%
    mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
stepsbygroup %>%
  ggplot(aes(x= Usage, y= TotalSteps, fill= Usage)) +
    geom_boxplot() +
    scale_y_continuous(limits=c(0,38000), breaks=seq(0,38000, by = 4000))+
      plot.title = element_text(size=11)    ) +
    ggtitle("A boxplot with jitter") +
    xlab("") +
    labs(title= ("Average Steps by Group"), tag = "figure 3",x="average per week" , y="Steps")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(legend.position = "top")+

options(repr.plot.width = 5, repr.plot.height = 6)

6 Average Steps -(week) by Group

caloriesbp <- daily_activity_usage %>% 
group_by(day,Usage) %>%
dplyr::select(Usage, Calories, day) %>%
    mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
  caloriesbp %>%
  ggplot(aes(x=day , y= Calories, fill= Usage)) +
   geom_boxplot() +
   scale_y_continuous(limits=c(0,3000), breaks=seq(0,3000, by = 400))+
   theme(legend.position ="element_text(angle = 90))", 
   plot.title = element_text(size=11)) +
    ggtitle("A boxplot with jitter") +
    xlab("") +
    labs(title= ("Average Calories"), tag = "figure 4", subtitle = ('By Usage Group'), x="Day" , y="Calories")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(angle = 90, size=12, hjust= 0, vjust = 0.3))+
    theme(axis.title.x = element_text(margin = margin(t = 18, r = 0, b = 0, l = 3)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+   
# Observation 02 - Average Calories Per Group ### Observations 02:

  • Average calories burned for the low use group was around 2100.
  • Tracking calories, the High use group averaged the most calories, which would be expected.
  • The moderate group were the least consistent, which brought their average down below the low use group due to
  •  some individuals being unmotivated on some days.  Maybe some type of motivations alert may be helpful for them.
  • The Day of the week didn’t show any particular pattern between the groups.
  • The low use group exercised the most on Tuesday, the medium use group was on Saturday, and Tuesday was the most active day
  •  in the high use group.
  • The high use group at least doubled the use of the low use group.
  • The thing the sticks out is that the Moderate group had the most inconsistent users in it’s group, that brought down the rest of the group totals.
    Again, maybe an inactive alert can be used to try and motivate the inconsistent users.
###   Merging hourly data frames (steps,calories,intensities)
hourlies_df <- hourly_steps %>% 
  left_join(hourly_calories, by = c("Id", "ActivityHour")) %>% 
  left_join(hourly_intensities, by = c("Id", "ActivityHour")) %>% 
  separate(ActivityHour, sep = " ", into = c("date","time")) %>% 
  mutate(day = format(ymd(date), format = '%a')) %>% 
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M")) %>% 
  mutate(date = as.POSIXct(date))
stephr <-  left_join(hourlies_df,daily_use2, by = 'Id','Usage')
  mutate(day = factor(day,level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun'))) %>% 
  group_by(Usage, time, day) %>% 
  summarize(steps = round(mean(StepTotal),2))
## --  Make time as hour with only 2 characters in new hour column  --
stephr$hour =substr(stephr$time,1,2)
stephr$hour <- as.numeric(stephr$hour)
sapply(stephr, class)
###   Grouped
stephr %>%
ggplot(aes(x = day, y = StepTotal, fill = Usage))+
       geom_bar(stat = "identity", position = "dodge")+
      labs(title ="",tag = "figure 5")+
  ggtitle("The Relationship between Usage, Steps and Day of the week.")


7.0.1 Relationship between Usage, Steps and Day of the week.

  • There is not significant difference between the day of the week in any of the user groups.
  • The High use group has more then double the low usage group.
  • That seems to be an anomaly more than anything else, as it was only by a small amount.
  • The moderate group varied the most from around 3750 steps to around 7600 steps. ###
###    Lollipop Workout Chart

intensity1 <- hourlies_df %>% 
    filter(TotalIntensity > 0) %>%
  group_by(day) %>%   
  summarise(mean_intensity = round(mean(TotalIntensity)),
            std_mean_intensity = round(sd(TotalIntensity))) %>%
 mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
ggplot(intensity1,aes(x=day, y=mean_intensity, fill = mean_intensity, group = 1)) +
  geom_segment( aes(x=day, xend=day,y=0, yend=mean_intensity)) +
  scale_y_continuous(limits=c(0,25), breaks=seq(0,26, by = 2))+
  geom_point( size=5, color="red", fill=alpha("orange", 0.3), alpha=0.7, shape=21, stroke=2)+
  labs(title ="",tag = "figure 6")+
xlab("Days of the week")+
ylab("Workout Intensity")+
ggtitle("Work out intensity")

# Workout Intensity ### Grouped workout chart

### Join database hourlies_df and Daily_use2 to make group_intensity

group_intensity <- hourlies_df %>%
  left_join(daily_use2, by = 'Id') %>% 
  mutate(day = factor(day,level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun'))) %>% 
  filter(TotalIntensity > 0) %>%
  group_by(Usage, day) %>%
  summarise(intensity = round(mean(TotalIntensity, .groups = 'keep'),1))
ggplot(group_intensity, aes(x = day, y= intensity, fill = intensity)) +
    geom_col(color="pink", size = 0.1)+
    scale_fill_gradientn(limits=c(0,25), breaks=seq(0,25, by = 5), colours = brewer.pal(5, "YlOrRd")) + 
    scale_y_continuous(limits=c(0,25), breaks=seq(0,25, by = 5))+
    labs(title= ("Average Intensity"),tag = "Figure 7", subtitle = ('By Days, Groups'), x="Days" , y="Intensity")+
    theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(legend.position = "top")+
guides(fill = guide_colourbar(barwidth = 12))+
  coord_flip() +

### Saturdays seem to be intense for all groups, with Monday being the next insense workout in all groups. ### The high intensity group were pretty consistent over all. Nothing that unusualy with these plots, as they ### showed what would be expected for each group segment.

7.0.2 Another Plot for Minutes Worn

# Time worn

7.1 Sleep vs distance covered

8 Analysis and Suggestions:

8.0.1 * There is a relationship that shows users who do over 7.5 miles, sleep less than users who do less distance.

8.0.2 * The sweet spot seems to be around 6-7 miles to obtain enough exerciser before it starts to affect sleep.

9 This John Hopkins sleep study suggests adults between 18-64 years of Age need between 7-9 hours of sleep per day.

John Hopkins Sleep Study

Final Analysis

This study is limited to a small set of user (30), as such, it would not be an ideal situation for a case study. In any case, there should be enough data to show small patterns that we may be able to show a co-relationship that we can make educated decisions to help the stakeholders of BellaBeat.

From this data, we can see that there may be an opportunity for the marketing team to increase sales by making some small changes to the devies and marketing. The last chart (figure 9) shows a health benefit from getting between 7-9 hours of sleep. It also shows that any increase of over 8 miles in steps has an inverse relationship to sleep.

With that data combined, with the other data, and most notably the amount of time the users wear this type of device (Highest with the moderate users)(Figure 8), it seems there would be an opportunity to market towards a moderate user, instead of a hard core athlete.

Moderate users seem to wear the devices longer, do not tend to be the “cross trainer” type of athlete, but someone who would want such a device to improve a moderate or sedentary life style.

There would appear to be a marketing opportunity to market towards improving overall health, improve sleep, and to monitor daily activities for a gradual overall improvement in health.

Maybe add software to alert the user when they would become sedentary, or even make a scoring system where a user can compete daily with themselves.

The data seems to suggest the person interested in wearing such a device for a long period of time (figure 8) would be a more “middle of the road” type of person looking to make small changes.

It sounds like an opportunity that could be rewarding to this type of user that seems to have been overlooked in the marketplace of activity trackers.


