Morning Hubski! Welcome to a Thursday edition.
We're still trying to get the configuration settings for our new "sync" (think SQL replication without all the benefits of having it done for you) functionality pinned down tight. Which essentially means that you have have to go each table, column by column and determine how a 'key' can be derived from the data inside (its really tricky! Sometimes you can forget exactly what users can change and your left with a bunch of 'dupes' once your keys go bad). Luckily, I've been doing this kind of thing in the database now for over eight months; I've becoming pretty good at being able to eye what a good key will be. Duplicates are a man made construct anyway.
Bugs have been steadily rolling in. We went live with around 20 rigs in the past four months. After doing my little part for the overall sync configuration, I've been tracking this one bug that is spewing orphaned records into a couple of tables at a pretty high rate. As of right now I suspect 1/3rd (530,000 some odd rows) of the table is actually garbage. I plan to exterminate the bad data, and correct the module(s) that are causing it. I think its just a stored procedure with a couple of joins that are not explicit enough.
On the side, I'm working on a database driven web scraper that I'm hoping will be able to be pretty dynamic and configurable. The overall idea being that you'd be able to configure the scraper to go to websites,and it will go and try to scrape out whatever you've defined as the domain model for it. If the website changes, no big deal, just go back and re-evaluate your xpaths and redefine the node commands for it and viola, its back up and running. I'm hoping that I can spring up dynamic restful services to serve as an ad hoc API for websites that don't have one. It has a long way to go.
What are you working on today?