library(stringr)
library(readr)
library(tidyverse)
library(lubridate)

I have been curious on what makes an interesting post on instagram based on a larger dataset of images that have been tagged with #generativeart. Some of this is just data discovery, this could seem that there may be a correlation between the tags that have been used and the amount of likes there are.

# Extract hashtags
patt <- regex("#\\S+")
genart <- read_csv("~/InstaCrawlR/table-generativeart-2020-10-11 13:35:33.csv")
genart_db <- genart %>% select(ID, Likes, Owner, Date, Text) %>% mutate(hashtags = str_extract_all(Text,patt) )
genart_db_table <- genart_db %>% unnest(cols = "hashtags") %>% mutate(Year = year(Date), Month = month(Date), DayOfWeek = wday(Date),  Day = day(Date), Hour = hour(Date))
genart_db_table %>% head()

That will generate a rather large dataset

genart_db_table %>% count()

Instead of having too muc

  exclude_tags <- c("#fishart","#artphotohraphy", "#marinephotography","#underwaterscenes","#sharkattack","#акула","#вебпанк","#сос","#seaphotography","fishart","#sharks","#enhancedvitimins")
genart_db_hashtag_mean <- genart_db_table %>% 
  group_by(hashtags) %>% 
  summarize(AverageLikes = mean(Likes), Count = n()) %>% 
  filter(!hashtags %in% exclude_tags, Count > 50) %>% 
  arrange(-AverageLikes)
 genart_db_hashtag_mean %>% head(30) %>% ggplot(aes(reorder(hashtags,AverageLikes), AverageLikes)) + geom_col() + coord_flip() + labs(title = "Most Common Hashtags in Generative Art Posts", 
                                                                                 x
                                                                                 = "Hashtag")
 
  genart_db_hashtag_mean %>% filter(AverageLikes < 200, Count > 200, Count < 40000) %>%  ggplot(aes(AverageLikes, Count, label = hashtags)) + geom_point() + geom_text(check_overlap = TRUE, angle = 45, hjust =0)
  

This leads to some very interesting issues of certain tags that may need to be removed from the set due to their

genart_db_table %>% 
  group_by(hashtags) %>% 
  summarize(Count = n()) %>% 
  arrange(-Count) %>% head(30)  %>% 
  ggplot(aes(reorder(hashtags,Count), Count)) + geom_col() + coord_flip() + labs(title = "Most Common Hashtags in Generative Art Posts", 
                                                                                 x
                                                                                 = "Hashtag")

So it appears that the #generativeart tag is the greatest here which would make sense…

genart_db_table %>% filter(hashtags %in% c("#generativeart")) %>% 
  group_by(hashtags, Hour) %>% summarize(Count = n()) %>% 
    ggplot(aes(Hour, Count)) + geom_col() +labs(title = "Most Common Occurence in Generative Art Posts",  x = "Hour")

But we might want to see if there is a difference at a more granular level.

genart_db_table %>% filter(hashtags %in% c("#generativeart")) %>% 
  group_by(hashtags, Hour, DayOfWeek) %>% summarize(Count = n()) %>% 
    ggplot(aes(Hour, Count)) + geom_col() + facet_grid(DayOfWeek ~ .) +labs(title = "Most Common Hashtags in #generativeart Posts by Day", 
                                                                                 x = "Hashtag")

None really when looking at the detail here.

genart_db_table %>% filter(hashtags %in% c("#generativeart", "#digitalart","#creativecoding", "#generative", "#codeart","#abstractart")) %>% 
  group_by(hashtags, Hour) %>% summarize(Count = n()) %>% 
  mutate(PercentTotal = Count / sum(Count)) %>% 
    ggplot(aes(Hour, PercentTotal/5, fill = hashtags)) + geom_col()  +labs(title = "Most Common Hashtags in #generativeart Posts by Day (%)", 
                                                                                 x = "Hashtag") 

Now we want to make sure we don’t just take that to mean that generativeart need to

TODO

  • Add a graph of the