mirror of
https://github.com/dart-lang/sdk
synced 2024-10-03 11:11:05 +00:00
74da0c7286
R=jmesserly@google.com, lrn@google.com, nweiz@google.com, rnystrom@google.com Review URL: https://codereview.chromium.org//23596007 git-svn-id: https://dart.googlecode.com/svn/branches/bleeding_edge/dart@26789 260f80e4-7a28-3924-810f-c04153c831b5 |
||
---|---|---|
.. | ||
data | ||
crawl.js | ||
database.json | ||
extract.dart | ||
extract.sh | ||
extractRunner.js | ||
full_run.sh | ||
obsolete.json | ||
postProcess.dart | ||
prettyPrint.dart | ||
README.txt | ||
search.js | ||
util.dart |
***** Current status Currently it runs all the way through, but the database.json has all members[] lists empty. Most entries are skipped for "Suspect title"; some have ".pageText not found". Currently only works on Linux; OS X (or other) will need minor path changes. You will need a reasonably modern node.js installed. 0.5.9 is too old; 0.8.8 is not too old. I needed to add my own "DumpRenderTree_resources/missingImage.gif", for some reason. For the reasons above, we're currently just using the checked-in database.json from Feb 2012, but it has some bogus entries. In particular, the one for UnknownElement would inject irrelevant German text into our docs. So a hack in apidoc.dart (_mdnTypeNamesToSkip) works around this. ***** Overview Here's a rough walkthrough of how this works. The ultimate output file is database.filtered.json. full_run.sh executes all of the scripts in the correct order. search.js - read data/domTypes.json - for each dom type: - search for page on www.googleapis.com - write search results to output/search/<type>.json . this is a list of search results and urls to pages crawl.js - read data/domTypes.json - for each dom type: - for each output/search/<type>.json: - for each result in the file: - try to scrape that cached MDN page from webcache.googleusercontent.com - write mdn page to output/crawl/<type><index of result>.html - write output/crawl/cache.json . it maps types -> search result page urls and titles extract.sh - compile extract.dart to js - run extractRunner.js - read data/domTypes.json - read output/crawl/cache.json - read data/dartIdl.json - for each scraped search result page: - create a cleaned up html page in output/extract/<type><index>.html that contains the scraped content + a script tag that includes extract.dart.js. - create an args file in output/extract/<type><index>.html.json with some data on how that file should be processed - invoke dump render tree on that file - when that returns, parse the console output and add it to database.json - add any errors to output/errors.json - save output/database.json extract.dart - xhr output/extract/<type><index>.html.json - all sorts of shenanigans to actually pull the content out of the html - build a JSON object with the results - do a postmessage with that object so extractRunner.js can pull it out - run postProcess.dart - go through the results for each type looking for the best match - write output/database.html - write output/examples.html - write output/obsolete.html - write output/database.filtered.json which is the best matches ***** Process for updating database.json using these scripts. TODO(eub) when I get the scripts to work all the way through.