ideas.md

   1 # Ideas/Questions for future code
   2
   3 > Brainstorming here :P
   4
   5 ### MongoDB
   6
   7 * Main database for crawled content
   8 * YouTube is mostly non-relational
   9   (except channels ↔ videos)
  10 * Users can change videos (title, etc.):
  11   Support for multiple crawls needed
  12     * In one document as array?
  13       Like `{videoid: xxx, crawls: []…`
  14         * Pros: Easy history query
  15         * Cons: (Title) indices might be harder to maintain
  16     * Or as separate documents?
  17       Like `{videoid: xxx, crawldate: …`
  18         * Pros: Race conditions less likely
  19         * Cons: Duplicates more likely?
  20     * Avoiding duplicates?
  21         * If the user hasn't changed video metadata,
  22           crawling it again is a waste of disk space
  23         * Rescan score: Should a video be rescanned?
  24             * Viral videos should be crawled more often
  25             * New videos shouldn't be instantly crawled again
  26             * Very old videos are unlikely to change
  27             * Maybe focus on views per week
  28             * Machine learning?
  29         * Hashing data from crawls to detect changes?
  30             * Invalidates old data on API upgrade
  31             * Could be used as an index tho
  32 * Live data
  33     * like views/comments/subscribers per day
  34     * vs more persistent data: Title/Description/video Formats
  35     * Are they worth crawling
  36 * Additional data
  37     * Like subtitles and annotations
  38     * Need separate crawls
  39     * Not as important as main data
  40
  41 ### Types of bot
  42
  43 * __Discover bots__
  44     * Find and push new video IDs to the queue
  45     * Monitor channels for new content
  46     * Discover new videos
  47 * __Maintainer bots__
  48     * Occasionally look at the database and
  49       push backups/freezes to drive
  50     * Decide which old video IDs to re-add to the queue
  51 * __Worker bots__
  52     * Get jobs from the Redis queue and crawl YT
  53     * Remove processed entries from the queue
  54
  55 ### Redis queue
  56
  57 * A redis queue lists video IDs that have been
  58   discovered, but not crawled
  59 * Discover bots bots push IDs if they find new ones
  60     * Implement queue priority?
  61 * Maintainer bots push IDs if they likely need rescans
  62 * States of queued items
  63     1. _Queued:_ Processing required
  64        (no worker bot picked them up yet)
  65     2. _Assigned:_ Worker claimed ID and processes it.
  66        If the worker doesn't mark the ID as done in time
  67        it gets tagged back as _Queued_ again
  68        (should be hidden from other workers)
  69     3. _Done:_ Worker submitted crawl to the database
  70        (can be deleted from the queue)
  71 * Single point of failure
  72     * Potentially needs ton of RAM
  73         * 800 M IDs at 100 bytes per entry = 80 GB
  74     * Shuts down entire crawl system on failure
  75     * Persistence: A crash can use all discovered IDs
  76 * Alternative implementations
  77     * SQLite in-memory?