1 # Ideas/Questions for future code
3 > Brainstorming here :P
7 * Main database for crawled content
8 * YouTube is mostly non-relational
9 (except channels ↔ videos)
10 * Users can change videos (title, etc.):
11 Support for multiple crawls needed
12 * In one document as array?
13 Like `{videoid: xxx, crawls: []…`
14 * Pros: Easy history query
15 * Cons: (Title) indices might be harder to maintain
16 * Or as separate documents?
17 Like `{videoid: xxx, crawldate: …`
18 * Pros: Race conditions less likely
19 * Cons: Duplicates more likely?
20 * Avoiding duplicates?
21 * If the user hasn't changed video metadata,
22 crawling it again is a waste of disk space
23 * Rescan score: Should a video be rescanned?
24 * Viral videos should be crawled more often
25 * New videos shouldn't be instantly crawled again
26 * Very old videos are unlikely to change
27 * Maybe focus on views per week
29 * Hashing data from crawls to detect changes?
30 * Invalidates old data on API upgrade
31 * Could be used as an index tho
33 * like views/comments/subscribers per day
34 * vs more persistent data: Title/Description/video Formats
35 * Are they worth crawling
37 * Like subtitles and annotations
38 * Need separate crawls
39 * Not as important as main data
44 * Find and push new video IDs to the queue
45 * Monitor channels for new content
48 * Occasionally look at the database and
49 push backups/freezes to drive
50 * Decide which old video IDs to re-add to the queue
52 * Get jobs from the Redis queue and crawl YT
53 * Remove processed entries from the queue
57 * A redis queue lists video IDs that have been
58 discovered, but not crawled
59 * Discover bots bots push IDs if they find new ones
60 * Implement queue priority?
61 * Maintainer bots push IDs if they likely need rescans
62 * States of queued items
63 1. _Queued:_ Processing required
64 (no worker bot picked them up yet)
65 2. _Assigned:_ Worker claimed ID and processes it.
66 If the worker doesn't mark the ID as done in time
67 it gets tagged back as _Queued_ again
68 (should be hidden from other workers)
69 3. _Done:_ Worker submitted crawl to the database
70 (can be deleted from the queue)
71 * Single point of failure
72 * Potentially needs ton of RAM
73 * 800 M IDs at 100 bytes per entry = 80 GB
74 * Shuts down entire crawl system on failure
75 * Persistence: A crash can use all discovered IDs
76 * Alternative implementations