*

Summary


Bellabeat is a manufacturer of health tracking devices that is looking to become a more dominant player in this space. They offer #different devices that monitor data activity such as sleep, stress, walking activity, and calories burned. They have provided a data #set to analyze in order to reveal opportunities to explore for potential business growth.

  • Business Task


    Identify potential opportunities for growth and make recommendations for the Bellabeat marketing strategy based on third party data, trends and research.
  • Data Source


    Source Used: FitBit Fitness Tracker Data (CC0: Public Domain Kaggle dataset made available through Mobius, 04-12-2016, - 05-12-2016.
    Original data set provided as 18 csv files.


  • Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder
  • Bellabeat Analytics Team

1 Install prerequisite software.

rm(list = ls()) 

# Install prereq software if needed.
# install.packages(c("plyr","ggpubr","ggrepel","RColorBrewer","plotly","waffle","scales","viridis","janitor","skimr","lubridate",
#"tidyverse","readr","dplyr","glmnet","plyr","writexl","kableExtra","highcharter"))

#   Load Library

#install.packages("extrafont")
library(extrafont)
## Registering fonts with R
font_import(pattern = "lmroman*")
## Importing fonts may take a few minutes, depending on the number of fonts and the speed of the system.
## Continue? [y/n]
## Exiting.
loadfonts(device = "win")
par(family = "LM Roman 10")

gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 488014 26.1    1071408 57.3   645834 34.5
## Vcells 857961  6.6    8388608 64.0  2324874 17.8
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)   # For transforming numbers in percentage
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(RColorBrewer)  # Pallete colors
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(RColorBrewer)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
## 1.0 Import databases from csv files

daily_activity <- read_csv("./data/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_calories <- read_csv("./data/hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_intensities <- read_csv("./data/hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("./data/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("./data/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

2 Preview Data

head(daily_activity,3)
## # A tibble: 3 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(daily_sleep,3)
## # A tibble: 3 x 5
##           Id SleepDay           TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##        <dbl> <chr>                         <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:0~                1              327            346
## 2 1503960366 4/13/2016 12:00:0~                2              384            407
## 3 1503960366 4/15/2016 12:00:0~                1              412            442
head(hourly_calories,3)
## # A tibble: 3 x 3
##           Id ActivityHour          Calories
##        <dbl> <chr>                    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366 4/12/2016 1:00:00 AM        61
## 3 1503960366 4/12/2016 2:00:00 AM        59
head(hourly_intensities,3)
## # A tibble: 3 x 4
##           Id ActivityHour          TotalIntensity AverageIntensity
##        <dbl> <chr>                          <dbl>            <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM             20            0.333
## 2 1503960366 4/12/2016 1:00:00 AM               8            0.133
## 3 1503960366 4/12/2016 2:00:00 AM               7            0.117
head(hourly_steps,3)
## # A tibble: 3 x 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151

2.1 Cleaning and Formatting Data - Show unique items

unique(hourly_calories)
## # A tibble: 22,099 x 3
##            Id ActivityHour          Calories
##         <dbl> <chr>                    <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM       81
##  2 1503960366 4/12/2016 1:00:00 AM        61
##  3 1503960366 4/12/2016 2:00:00 AM        59
##  4 1503960366 4/12/2016 3:00:00 AM        47
##  5 1503960366 4/12/2016 4:00:00 AM        48
##  6 1503960366 4/12/2016 5:00:00 AM        48
##  7 1503960366 4/12/2016 6:00:00 AM        48
##  8 1503960366 4/12/2016 7:00:00 AM        47
##  9 1503960366 4/12/2016 8:00:00 AM        68
## 10 1503960366 4/12/2016 9:00:00 AM       141
## # ... with 22,089 more rows
unique(hourly_steps)
## # A tibble: 22,099 x 3
##            Id ActivityHour          StepTotal
##         <dbl> <chr>                     <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM       373
##  2 1503960366 4/12/2016 1:00:00 AM        160
##  3 1503960366 4/12/2016 2:00:00 AM        151
##  4 1503960366 4/12/2016 3:00:00 AM          0
##  5 1503960366 4/12/2016 4:00:00 AM          0
##  6 1503960366 4/12/2016 5:00:00 AM          0
##  7 1503960366 4/12/2016 6:00:00 AM          0
##  8 1503960366 4/12/2016 7:00:00 AM          0
##  9 1503960366 4/12/2016 8:00:00 AM        250
## 10 1503960366 4/12/2016 9:00:00 AM       1864
## # ... with 22,089 more rows
unique(hourly_intensities)
## # A tibble: 22,099 x 4
##            Id ActivityHour          TotalIntensity AverageIntensity
##         <dbl> <chr>                          <dbl>            <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM             20            0.333
##  2 1503960366 4/12/2016 1:00:00 AM               8            0.133
##  3 1503960366 4/12/2016 2:00:00 AM               7            0.117
##  4 1503960366 4/12/2016 3:00:00 AM               0            0    
##  5 1503960366 4/12/2016 4:00:00 AM               0            0    
##  6 1503960366 4/12/2016 5:00:00 AM               0            0    
##  7 1503960366 4/12/2016 6:00:00 AM               0            0    
##  8 1503960366 4/12/2016 7:00:00 AM               0            0    
##  9 1503960366 4/12/2016 8:00:00 AM              13            0.217
## 10 1503960366 4/12/2016 9:00:00 AM              30            0.5  
## # ... with 22,089 more rows
unique(daily_sleep)
## # A tibble: 410 x 5
##            Id SleepDay          TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##         <dbl> <chr>                        <dbl>            <dbl>          <dbl>
##  1 1503960366 4/12/2016 12:00:~                1              327            346
##  2 1503960366 4/13/2016 12:00:~                2              384            407
##  3 1503960366 4/15/2016 12:00:~                1              412            442
##  4 1503960366 4/16/2016 12:00:~                2              340            367
##  5 1503960366 4/17/2016 12:00:~                1              700            712
##  6 1503960366 4/19/2016 12:00:~                1              304            320
##  7 1503960366 4/20/2016 12:00:~                1              360            377
##  8 1503960366 4/21/2016 12:00:~                1              325            364
##  9 1503960366 4/23/2016 12:00:~                1              361            384
## 10 1503960366 4/24/2016 12:00:~                1              430            449
## # ... with 400 more rows
unique(daily_activity)
## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

2.2 Check for Duplicates.

sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(hourly_calories))
## [1] 0
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
### Shows 3 duplicates.

2.3 Remove duplicates and N/A

daily_activity <- daily_activity %>%
  distinct() %>%
  drop_na()

daily_sleep <- daily_sleep %>%
  distinct() %>%
  drop_na()

hourly_calories <- hourly_calories %>%
  distinct() %>%
  drop_na()

hourly_intensities <- hourly_intensities %>%
  distinct() %>%
  drop_na()

hourly_steps <- hourly_steps %>%
  distinct() %>%
  drop_na()

2.4 Verify the cleaning process above and Check for duplicates and NA

sum(duplicated(hourly_calories))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 0
sum(duplicated(daily_activity))
## [1] 0
####  Check for N/A.

sum(is.na(hourly_calories))
## [1] 0
sum(is.na(hourly_steps))
## [1] 0
sum(is.na(hourly_intensities))
## [1] 0
sum(is.na(daily_sleep))
## [1] 0
sum(is.na(daily_activity))
## [1] 0

2.5 Format time

##      ---  formatting date and time for Hourly items.  ----

hourly_calories$ActivityHour=as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
##hourly_steps
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
##hourly_intensities
hourly_intensities$ActivityHour=as.POSIXct(hourly_intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
##sleep_day 
daily_sleep$SleepDay=as.POSIXct(daily_sleep$SleepDay, format = "%m/%d/%Y %I:%M:%S %p",tz = Sys.timezone())
##daily_activity 
daily_activity$ActivityDate=as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y",tz = Sys.timezone())
###obtaining Day of Week from date
daily_activity <- daily_activity %>% 
  mutate(Day = format(ymd(ActivityDate), format = '%a'))

2.6 Verify time was adjusted to POSIXct POSIXt

class(hourly_calories$ActivityHour)
## [1] "POSIXct" "POSIXt"
class(hourly_intensities$ActivityHour)
## [1] "POSIXct" "POSIXt"
class(hourly_steps$ActivityHour)
## [1] "POSIXct" "POSIXt"
class(daily_sleep$SleepDay)
## [1] "POSIXct" "POSIXt"
class(daily_activity$ActivityDate)
## [1] "POSIXct" "POSIXt"

2.7 Merging Tables

#merging hourly data frames (steps,calories,intensities)----
gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1237489 66.1    1967202 105.1  1967202 105.1
## Vcells 2358604 18.0    8388608  64.0  5118317  39.1
hourlies_df <- hourly_steps %>% 
  left_join(hourly_calories, by = c("Id", "ActivityHour")) %>% 
  left_join(hourly_intensities, by = c("Id", "ActivityHour")) %>% 
  separate(ActivityHour, sep = " ", into = c("date","time")) %>% 
  mutate(day = format(ymd(date), format = '%a')) %>% 
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M")) %>% 
  mutate(date = as.POSIXct(date))
head(hourlies_df,3)
## # A tibble: 3 x 8
##           Id date                time  StepTotal Calories TotalIntensity
##        <dbl> <dttm>              <chr>     <dbl>    <dbl>          <dbl>
## 1 1503960366 2016-04-12 00:00:00 00:00       373       81             20
## 2 1503960366 2016-04-12 00:00:00 01:00       160       61              8
## 3 1503960366 2016-04-12 00:00:00 02:00       151       59              7
## # ... with 2 more variables: AverageIntensity <dbl>, day <chr>

3 Export new Data Frames to Excel (backup)

library(writexl)
write_xlsx(hourlies_df, "E:\\Websites\\R-Projects\\new_data\\hourlies_df.xlsx")
write_xlsx(daily_activity, "E:\\Websites\\R-Projects\\new_data\\daily_activity.xlsx")
write_xlsx(daily_sleep, "E:\\Websites\\R-Projects\\new_data\\daily_sleep.xlsx")
write_xlsx(hourly_calories, "E:\\Websites\\R-Projects\\new_data\\hourly_calories.xlsx")
write_xlsx(hourly_intensities, "E:\\Websites\\R-Projects\\new_data\\hourly_intensities.xlsx")
write_xlsx(hourly_steps, "E:\\Websites\\R-Projects\\new_data\\hourly_steps.xlsx")

3.1 Find the amount of time the watches are used. Low, MEDIUM, HIGH

daily_use2 <- daily_activity %>%
  filter(TotalSteps >200 ) %>% 
  group_by(Id) %>%
  dplyr::summarize(ActivityDate=sum(n())) %>%
  mutate(Usage = case_when(
    ActivityDate >= 1 & ActivityDate <= 15 ~ "Low Use",
    ActivityDate >= 16 & ActivityDate <= 22 ~ "Moderate Use", 
    ActivityDate >= 23 & ActivityDate <= 31 ~ "High Use")) %>% 
  mutate(Usage = factor(Usage, level = c('Low Use','Moderate Use','High Use'))) %>% 
  rename(days_used = ActivityDate) %>% 
  group_by(Usage)
head(daily_use2,8)
## # A tibble: 8 x 3
## # Groups:   Usage [3]
##           Id days_used Usage       
##        <dbl>     <int> <fct>       
## 1 1503960366        30 High Use    
## 2 1624580081        31 High Use    
## 3 1644430081        30 High Use    
## 4 1844505072        17 Moderate Use
## 5 1927972279        15 Low Use     
## 6 2022484408        31 High Use    
## 7 2026352035        31 High Use    
## 8 2320127002        31 High Use

4 Analizing the Data.

daily_use <- daily_activity %>% 
  left_join(daily_use2, by = 'Id') %>%
  group_by(Usage) %>% 
  summarise(participants = n_distinct(Id)) %>% 
  mutate(perc = participants/sum(participants)) %>% 
  arrange(perc) %>% 
  mutate(perc = scales::percent(perc))

ggplot(daily_use,aes(fill=Usage ,y = participants, x="")) +
  geom_bar(stat="identity", width=2, color="white") +
  coord_polar("y", start=0)+
  scale_fill_brewer(palette='Set1')+
  theme_void()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5,vjust= -5, size = 20, face = "bold")) +
  geom_text(aes(label = perc, x=1.2),position = position_stack(vjust = 0.5))+
  labs(title="Tracker use Percentage", tag = "figure 1")+
  guides(fill = guide_legend(title = "Usage Amount"))

  options(repr.plot.width = 2, repr.plot.height = 1)

4.1 For the high use group, they were used the most, and the low use group, used the least as expected.

4.2 Creating step data

#data manipulation to add Usage Types to 'daily_activity' df
daily_activity_usage <- daily_activity %>% 
  left_join(daily_use2, by = 'Id') %>% 
  mutate(day = format(ymd(ActivityDate), format = '%a')) %>% 
  mutate(total_minutes_worn = SedentaryMinutes+LightlyActiveMinutes+
           FairlyActiveMinutes+VeryActiveMinutes) %>% 
  mutate(total_hours = seconds_to_period(total_minutes_worn * 60))
head(daily_activity_usage,6)
## # A tibble: 6 x 21
##           Id ActivityDate        TotalSteps TotalDistance TrackerDistance
##        <dbl> <dttm>                   <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12 00:00:00      13162          8.5             8.5 
## 2 1503960366 2016-04-13 00:00:00      10735          6.97            6.97
## 3 1503960366 2016-04-14 00:00:00      10460          6.74            6.74
## 4 1503960366 2016-04-15 00:00:00       9762          6.28            6.28
## 5 1503960366 2016-04-16 00:00:00      12669          8.16            8.16
## 6 1503960366 2016-04-17 00:00:00       9705          6.48            6.48
## # ... with 16 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   Day <chr>, days_used <int>, Usage <fct>, day <chr>,
## #   total_minutes_worn <dbl>, total_hours <Period>
#data for steps 
steps_hour <- daily_activity_usage %>% 
  group_by(day) %>%   
  summarise(mean_steps = round(mean(TotalSteps))) %>%
 mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
head(steps_hour)
## # A tibble: 6 x 2
##   day   mean_steps
##   <fct>      <dbl>
## 1 Fri         7448
## 2 Mon         7781
## 3 Sat         8153
## 4 Sun         6933
## 5 Thu         7406
## 6 Tue         8125
### * plot for avg steps by day 
ggplot(steps_hour, aes(x = day, y= mean_steps, fill = mean_steps)) +
    geom_col(color="darkblue", size = 0.1) +  
 scale_fill_gradientn(limits=c(0,9000), breaks=seq(0,9000, by = 1500), colors = brewer.pal(11,"Spectral")) + 
  scale_y_continuous(limits=c(0,9000), breaks=seq(0, 9000, by = 1500))+ 
labs(title= ("Average Steps"), tag = "figure 2", subtitle = ('By Day'), x="" , y="Steps")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(legend.position = "top")+
    theme(legend.title=element_text(size=12))+
    theme(legend.text=element_text(size=8))+
guides(fill = guide_colourbar(barwidth = 12))

options(repr.plot.width = 10, repr.plot.height = 8)

5 Observations 01:

  • The highest step days are Saturday, followed by Tuesday and Monday. Trailing off over the rest of the weekdays.
  • Unsurprisingly Sunday is the lowest, a rest day.
### Average steps by Group

stepsbygroup <- daily_activity_usage %>% 
group_by(day,Usage) %>%
dplyr::select(Usage, TotalSteps, day) %>%
    mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
head(stepsbygroup,8)
## # A tibble: 8 x 3
## # Groups:   day, Usage [7]
##   Usage    TotalSteps day  
##   <fct>         <dbl> <fct>
## 1 High Use      13162 Tue  
## 2 High Use      10735 Wed  
## 3 High Use      10460 Thu  
## 4 High Use       9762 Fri  
## 5 High Use      12669 Sat  
## 6 High Use       9705 Sun  
## 7 High Use      13019 Mon  
## 8 High Use      15506 Tue
stepsbygroup %>%
  ggplot(aes(x= Usage, y= TotalSteps, fill= Usage)) +
    geom_boxplot() +
    scale_y_continuous(limits=c(0,38000), breaks=seq(0,38000, by = 4000))+
    theme(legend.position="none",
      plot.title = element_text(size=11)    ) +
    ggtitle("A boxplot with jitter") +
    xlab("") +
    labs(title= ("Average Steps by Group"), tag = "figure 3",x="average per week" , y="Steps")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())+
    theme(legend.position = "top")+
    theme(legend.title=element_text(size=12))+
    theme(legend.text=element_text(size=8))+
facet_grid(~Usage)

options(repr.plot.width = 5, repr.plot.height = 6)

6 Average Steps -(week) by Group

caloriesbp <- daily_activity_usage %>% 
group_by(day,Usage) %>%
dplyr::select(Usage, Calories, day) %>%
    mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
head(caloriesbp)
## # A tibble: 6 x 3
## # Groups:   day, Usage [6]
##   Usage    Calories day  
##   <fct>       <dbl> <fct>
## 1 High Use     1985 Tue  
## 2 High Use     1797 Wed  
## 3 High Use     1776 Thu  
## 4 High Use     1745 Fri  
## 5 High Use     1863 Sat  
## 6 High Use     1728 Sun
  caloriesbp %>%
  ggplot(aes(x=day , y= Calories, fill= Usage)) +
   geom_boxplot() +
   scale_y_continuous(limits=c(0,3000), breaks=seq(0,3000, by = 400))+
   theme(legend.position ="element_text(angle = 90))", 
   plot.title = element_text(size=11)) +
  
    ggtitle("A boxplot with jitter") +
    xlab("") +
    labs(title= ("Average Calories"), tag = "figure 4", subtitle = ('By Usage Group'), x="Day" , y="Calories")+
theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(angle = 90, size=12, hjust= 0, vjust = 0.3))+
    theme(axis.title.x = element_text(margin = margin(t = 18, r = 0, b = 0, l = 3)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+   
    theme(legend.title=element_text(size=8))+
    theme(legend.text=element_text(size=8))+
facet_grid(~Usage)
## Warning: Removed 153 rows containing non-finite values (stat_boxplot).

# Observation 02 - Average Calories Per Group ### Observations 02:

  • Average calories burned for the low use group was around 2100.
  • Tracking calories, the High use group averaged the most calories, which would be expected.
  • The moderate group were the least consistent, which brought their average down below the low use group due to
  •  some individuals being unmotivated on some days.  Maybe some type of motivations alert may be helpful for them.
  • The Day of the week didn’t show any particular pattern between the groups.
  • The low use group exercised the most on Tuesday, the medium use group was on Saturday, and Tuesday was the most active day
  •  in the high use group.
  • The high use group at least doubled the use of the low use group.
  • The thing the sticks out is that the Moderate group had the most inconsistent users in it’s group, that brought down the rest of the group totals.
    Again, maybe an inactive alert can be used to try and motivate the inconsistent users.
###   Merging hourly data frames (steps,calories,intensities)
hourlies_df <- hourly_steps %>% 
  left_join(hourly_calories, by = c("Id", "ActivityHour")) %>% 
  left_join(hourly_intensities, by = c("Id", "ActivityHour")) %>% 
  separate(ActivityHour, sep = " ", into = c("date","time")) %>% 
  mutate(day = format(ymd(date), format = '%a')) %>% 
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M")) %>% 
  mutate(date = as.POSIXct(date))
head(hourlies_df,3)
## # A tibble: 3 x 8
##           Id date                time  StepTotal Calories TotalIntensity
##        <dbl> <dttm>              <chr>     <dbl>    <dbl>          <dbl>
## 1 1503960366 2016-04-12 00:00:00 00:00       373       81             20
## 2 1503960366 2016-04-12 00:00:00 01:00       160       61              8
## 3 1503960366 2016-04-12 00:00:00 02:00       151       59              7
## # ... with 2 more variables: AverageIntensity <dbl>, day <chr>
stephr <-  left_join(hourlies_df,daily_use2, by = 'Id','Usage')
  
  stephr%>%
  mutate(day = factor(day,level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun'))) %>% 
  group_by(Usage, time, day) %>% 
  summarize(steps = round(mean(StepTotal),2))
## `summarise()` has grouped output by 'Usage', 'time'. You can override using the `.groups` argument.
## # A tibble: 504 x 4
## # Groups:   Usage, time [72]
##    Usage   time  day   steps
##    <fct>   <chr> <fct> <dbl>
##  1 Low Use 00:00 Mon    0   
##  2 Low Use 00:00 Tue    1.64
##  3 Low Use 00:00 Wed    0.36
##  4 Low Use 00:00 Thu    5.64
##  5 Low Use 00:00 Fri    6   
##  6 Low Use 00:00 Sat    0   
##  7 Low Use 00:00 Sun    9.88
##  8 Low Use 01:00 Mon    0   
##  9 Low Use 01:00 Tue    0.64
## 10 Low Use 01:00 Wed   33   
## # ... with 494 more rows
## --  Make time as hour with only 2 characters in new hour column  --
stephr$hour =substr(stephr$time,1,2)
stephr$hour <- as.numeric(stephr$hour)
sapply(stephr, class)
## $Id
## [1] "numeric"
## 
## $date
## [1] "POSIXct" "POSIXt" 
## 
## $time
## [1] "character"
## 
## $StepTotal
## [1] "numeric"
## 
## $Calories
## [1] "numeric"
## 
## $TotalIntensity
## [1] "numeric"
## 
## $AverageIntensity
## [1] "numeric"
## 
## $day
## [1] "character"
## 
## $days_used
## [1] "integer"
## 
## $Usage
## [1] "factor"
## 
## $hour
## [1] "numeric"
###   Grouped
stephr %>%
ggplot(aes(x = day, y = StepTotal, fill = Usage))+
       geom_bar(stat = "identity", position = "dodge")+
      labs(title ="",tag = "figure 5")+
  ggtitle("The Relationship between Usage, Steps and Day of the week.")

7 OBSERVATION 03:

7.0.1 Relationship between Usage, Steps and Day of the week.

  • There is not significant difference between the day of the week in any of the user groups.
  • The High use group has more then double the low usage group.
  • That seems to be an anomaly more than anything else, as it was only by a small amount.
  • The moderate group varied the most from around 3750 steps to around 7600 steps. ###
###    Lollipop Workout Chart

hrbrthemes::import_roboto_condensed()
## You will likely need to install these fonts on your system as well.
## 
## You can find them in [C:/R_library/hrbrthemes/fonts/roboto-condensed]
intensity1 <- hourlies_df %>% 
    filter(TotalIntensity > 0) %>%
  group_by(day) %>%   
  summarise(mean_intensity = round(mean(TotalIntensity)),
            std_mean_intensity = round(sd(TotalIntensity))) %>%
 mutate(day = factor(day, level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun')))
head(intensity1)
## # A tibble: 6 x 3
##   day   mean_intensity std_mean_intensity
##   <fct>          <dbl>              <dbl>
## 1 Fri               20                 23
## 2 Mon               21                 26
## 3 Sat               22                 25
## 4 Sun               21                 25
## 5 Thu               20                 23
## 6 Tue               20                 24
ggplot(intensity1,aes(x=day, y=mean_intensity, fill = mean_intensity, group = 1)) +
  geom_segment( aes(x=day, xend=day,y=0, yend=mean_intensity)) +
  
  scale_y_continuous(limits=c(0,25), breaks=seq(0,26, by = 2))+
  
  geom_point( size=5, color="red", fill=alpha("orange", 0.3), alpha=0.7, shape=21, stroke=2)+
  labs(title ="",tag = "figure 6")+
xlab("Days of the week")+
ylab("Workout Intensity")+
ggtitle("Work out intensity")

# Workout Intensity ### Grouped workout chart

### Join database hourlies_df and Daily_use2 to make group_intensity

group_intensity <- hourlies_df %>%
  left_join(daily_use2, by = 'Id') %>% 
  mutate(day = factor(day,level = c('Mon', 'Tue', 'Wed','Thu', 'Fri', 'Sat', 'Sun'))) %>% 
  filter(TotalIntensity > 0) %>%
  group_by(Usage, day) %>%
  summarise(intensity = round(mean(TotalIntensity, .groups = 'keep'),1))
## `summarise()` has grouped output by 'Usage'. You can override using the `.groups` argument.
ggplot(group_intensity, aes(x = day, y= intensity, fill = intensity)) +
    geom_col(color="pink", size = 0.1)+
    scale_fill_gradientn(limits=c(0,25), breaks=seq(0,25, by = 5), colours = brewer.pal(5, "YlOrRd")) + 
    scale_y_continuous(limits=c(0,25), breaks=seq(0,25, by = 5))+
    labs(title= ("Average Intensity"),tag = "Figure 7", subtitle = ('By Days, Groups'), x="Days" , y="Intensity")+
    theme(plot.title=element_text(size = 16,hjust = 0))+
    theme(plot.subtitle=element_text(size = 14,hjust = 0))+
    theme(axis.text.y=element_text(size=14)) +
    theme(axis.text.x=element_text(size=14,hjust= 0.5))+
    theme(axis.title.x = element_text(margin = margin(t = 14, r = 0, b = 0, l = 0)))+
    theme(axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
    theme(legend.position = "top")+
    theme(legend.title=element_text(size=12))+
    theme(legend.text=element_text(size=8))+
guides(fill = guide_colourbar(barwidth = 12))+
  coord_flip() +
facet_grid(~Usage)

### Saturdays seem to be intense for all groups, with Monday being the next insense workout in all groups. ### The high intensity group were pretty consistent over all. Nothing that unusualy with these plots, as they ### showed what would be expected for each group segment.

7.0.2 Another Plot for Minutes Worn

# Time worn

## Warning: Removed 1 rows containing missing values (position_stack).

## In this graph, we can see the users’ average minutes of sleep vs average distance traveled. A majority of these users sleep between 320 to 530 minutes (5.33 to 8.83 hours)

7.1 Sleep vs distance covered

8 Analysis and Suggestions:

8.0.1 * There is a relationship that shows users who do over 7.5 miles, sleep less than users who do less distance.

8.0.2 * The sweet spot seems to be around 6-7 miles to obtain enough exerciser before it starts to affect sleep.

9 This John Hopkins sleep study suggests adults between 18-64 years of Age need between 7-9 hours of sleep per day.

John Hopkins Sleep Study

Final Analysis

This study is limited to a small set of user (30), as such, it would not be an ideal situation for a case study. In any case, there should be enough data to show small patterns that we may be able to show a co-relationship that we can make educated decisions to help the stakeholders of BellaBeat.

From this data, we can see that there may be an opportunity for the marketing team to increase sales by making some small changes to the devies and marketing. The last chart (figure 9) shows a health benefit from getting between 7-9 hours of sleep. It also shows that any increase of over 8 miles in steps has an inverse relationship to sleep.

With that data combined, with the other data, and most notably the amount of time the users wear this type of device (Highest with the moderate users)(Figure 8), it seems there would be an opportunity to market towards a moderate user, instead of a hard core athlete.

Moderate users seem to wear the devices longer, do not tend to be the “cross trainer” type of athlete, but someone who would want such a device to improve a moderate or sedentary life style.

There would appear to be a marketing opportunity to market towards improving overall health, improve sleep, and to monitor daily activities for a gradual overall improvement in health.

Maybe add software to alert the user when they would become sedentary, or even make a scoring system where a user can compete daily with themselves.

The data seems to suggest the person interested in wearing such a device for a long period of time (figure 8) would be a more “middle of the road” type of person looking to make small changes.

It sounds like an opportunity that could be rewarding to this type of user that seems to have been overlooked in the marketplace of activity trackers.

```

More about this website.