Friday, April 8, 2011

script overview

1) run_daily_timelines.sh
-hive_daily_timelines.sql

2) run_daily_trends.sh
-hive_daily_trends.sql
-outputs daily_trends.txt

3) create dropindex procedure

4) run_daily_load.sh (grabs people/companies)
INPUT: /mnt/pages.txt, /mnt/daily_timelines (for two days, they have 1012222 records)
a. verifies if the new_pages table exists
b. checks the last date from daily_timelines
c. if new_pages staging table exists, load data; if not, backup tables to new_pages
TODO rename "new_*" tables to "staging_*"
d. fetch people & companies data
TODO find the script that actually builds these files
e. create dropindex procedure
TODO should be moved to top of script
f. load "new tables" via load_history.sql
.truncates people/companies tables and loads recent people/company data
.drops/recreates people/companies index
.truncates staging tables (new_pages, new_daily_timelines)
..in dev, those tables have no data anyway???
.here is where the dropindex procedure is..I yanked out into separate create_dropindex.sql
.dropindexes on new_pages (6), new_daily_timelines (2)
.disable primary keys on new_pages, new_daily_timelines
.load /mnt/pages -> new_pages, /mnt/timelines -> new_daily_timelines
.enable primary keys on new_pages, new_daily_timeline:s
.create indexes on new_pages (6), new_daily_timelines (2)
g. load "trends" table via load_trends.sql
.truncates new_daily_trends table
..in dev, zero data anyway
.load /mnt/daily_trends.txt (zero rows..I believe because we don't have enough days (need 45)
.(Yes, after loading 45 days..we have trends)
.enable keys on new_daily_trends
h. find max date of the trendsdb
i. load featured_pages
.new_featured_pages not working..probably because file from s3 hasn't been updated
j. alter new_pages
.add featured
.create index, drop index, create index
k. archive to s3
l. archive to s3 by date

After run of daily_load.sh
Companies
1448
Daily Timelines
100
Daily Trends
100
Featured Pages
0
Pages
100
People
345603
New Daily Timelines
1012230
New Daily Trends
2714791
New Featured Pages
841
New Pages
506111
Weekly Trends
100


New Featured Pages, New Timelines, New Trends and New Pages have data..must copy to live prod via last step:
m. swap table to go live


-execute run_daily_load.sh (grabs people/companies) up until s3cmd
linux-z6tw:/home/sfrase/trendingtopics/lib/scripts # ./daily_load.sh
Enter password:
create index page' at line 1

real 3m56.042s
user 0m0.021s
sys 0m0.457s
create index pages_id_index on new_pages (id);
-- Query OK, 2804203 rows affected (9 min 57.82 sec)

-- create index pages_autocomp_trend_index on new_pages (title(64), monthly_trend);
-- Query OK, 2783939 rows affected (6 min 20.95 sec)
-- Records: 2783939 Duplicates: 0 Warnings: 0

-- for main pagination
create index pages_trend_index on new_pages (monthly_trend);
-- Query OK, 2783939 rows affected (1 min 25.65 sec)
-- Records: 2783939 Duplicates: 0 Warnings: 0

-- for sparklines
create index timeline_pageid_index on new_daily_timelines (page_id);
-- Query OK, 2804057 rows affected (22 min 33.80 sec)
-- Records: 2804057 Duplicates: 0 Warnings: 0

-rename_backup_to_new.sql (only if there is no new_pages table)
-load_history.sql
-load_trends.sql (loads /mnt/daily_trends.txt and indexes the table)
linux-z6tw:/home/sodo/trendingtopics/lib/sql # mysql -u root -p --local-infile trendingtopics_development < load_trends.sql
Enter password:

-load_featured_pages.sql (/mnt/featured_pages created by daily_load.sh)
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # ./daily_load.sh
Enter password:
loading featured pages

-archives trendsdb (again, and by date)
-rename_new_to_live.sql (swaps new tables to go live automatically)

linux-z6tw:/home/sodo/trendingtopics/lib/sql # time mysql -u root -p trendingtopics_development --local-infile < /mnt/app/current/lib/sql/load_featured_pages.sql
Enter password:

real 0m1.831s
user 0m0.012s
sys 0m0.011s


Unknown
run_daily_merge.sh

linux-z6tw:/home/sodo/trendingtopics/lib/scripts # ./daily_load.sh
staging tables exist, loading data
loading history tables
loading trends table
Enter password:
loading featured pages
Traceback (most recent call last):
File "/mnt/app/current/lib/scripts/generate_featured_pages.py", line 14, in
from BeautifulSoup import BeautifulSoup
ImportError: No module named BeautifulSoup

References
http://www.crummy.com/software/BeautifulSoup/
Install BeautifulSoup
sudo python setup.py install
running install
running build
running build_py
running install_lib
copying build/lib/BeautifulSoupTests.py -> /usr/local/lib/python2.7/site-packages
copying build/lib/BeautifulSoup.py -> /usr/local/lib/python2.7/site-packages
byte-compiling /usr/local/lib/python2.7/site-packages/BeautifulSoupTests.py to BeautifulSoupTests.pyc
byte-compiling /usr/local/lib/python2.7/site-packages/BeautifulSoup.py to BeautifulSoup.pyc
running install_egg_info
Writing /usr/local/lib/python2.7/site-packages/BeautifulSoup-3.2.0-py2.7.egg-info

Install MySQLdb python
-need python-setuptools
-then installed the whole shebang

sodo@linux-z6tw:~/Downloads/MySQL-python-1.2.3> sudo python setup.py install
running install
Checking .pth file support in /usr/local/lib64/python2.7/site-packages/
/usr/bin/python -E -c pass
TEST PASSED: /usr/local/lib64/python2.7/site-packages/ appears to support .pth files
running bdist_egg
running egg_info
writing MySQL_python.egg-info/PKG-INFO
writing top-level names to MySQL_python.egg-info/top_level.txt
writing dependency_links to MySQL_python.egg-info/dependency_links.txt
...
Installed /usr/local/lib64/python2.7/site-packages/MySQL_python-1.2.3-py2.7-linux-x86_64.egg
Processing dependencies for MySQL-python==1.2.3
Finished processing dependencies for MySQL-python==1.2.3

No comments:

Post a Comment