Morton Fox (mortonfox) wrote,
Morton Fox



Car was all done by late this morning, so at lunchtime, I went to pick it up and to return the Toyota Yaris. (The Yaris is a fuel-efficient car but it makes me nervous merging out onto Route 17 because I can't seem to step on the gas pedal hard enough to make it go to highway speed quickly.) That's when I got caught by the detour. For the past six weeks, there has been a sign posted on Lake Street saying that the road will be closed soon. They finally closed it today and, from what I saw, it's to dig up the road to put in some large pipes. So it'll be closed for a while.

Anyway, the detour took me back south to Saddle River and north again via Route 17 to Ramsey. If I'd known, I could've saved 3 miles of driving by just taking Allendale Road. On the plus side, and there always seems to be a plus side, I saw a Green Acres park that I hadn't seen before along the detour. It might be worth checking to see if there is a place to hide a geocache in there. Until Lake Street is reopened, however, it will take me three times as long to go from home to Pathmark or anywhere in Ramsey, so I probably should start going to the Park Ridge area again.

I didn't go out anywhere the past two evenings, so I've been crossing out some of the bigger items on my to-do list. (The one with more work on it than work.) First, I finished processing and uploading to Flickr the HDSA Hoop-a-thon and Shea Stadium Relay for Life mascot gig pictures. I'd been saving these until I was done with all my other photos and those took over a month to get through! This is similar to what happened in December, after which it took most of January to get up to date with the photos.

I also rewrote the stats scripts for the 10 million photos Flickr group. One big problem in the old stats script (which is actually the second version of this script) is the skewing that happens when people add photos to the group. The script steps through the group pool one page at a time. When someone adds photos to the group, that pushes all the existing photos in the group up through the pages. So the script will skip some photos.

The situation is a lot worse overnight. You see, I don't keep the computer on all the time. So the script runs only when I'm using the computer. The script is written in such a way that I can stop it at any time and the next time I start it, it will resume at the page where it was stopped. The problem is page N today is not the same as page N tomorrow. The 10 million photos group grows rapidly enough that tens of thousands of photos could've been added in the interim.

The solution? Make the script resume at a point based on the "dateadded" field. Every photo posted to the group has this timestamp field showing when it was added to the group. As long as the script goes from oldest to newest photo in the group, it can always keep track of what timestamp it was last looking at and continue from there. How does it know which page to continue from? For now, I implemented that using binary search. (Thank goodness for Wikipedia, so I don't have to drag out my old college textbooks. :) ) So essentially, the script will start by reading a bunch of pages from the group pool to figure out where to begin. Then it will go page by page as usual. (For now, I'm ignoring any skewing that happens while the script is running. I think Flickr does impose a limit to how fast a user can add photos to a group, so it probably won't be as bad as overnight skewing. Also, an interpolation search will probably do a lot better than binary search but I'll leave that for the next version.)

As a plus, the script can now update its records incrementally. Let's say it has finished a run. A week later, when there are 100,000 or so more photos in the pool, all I need to do is run the script again and because of the "dateadded" binary search technique, it will figure out where to begin in order to read just those 100,000 photos. After the first full run, I may never have to do a full run of the script again, unless there have been lots of people removing their photos from the pool. (That's a secondary skewing problem.)

I also made a bunch of other changes to improve robustness when dealing with weird Flickr API errors. Also improved modularity and maintainability, not that I expect anyone else to work on this code. :)
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded