Odd issue with favicon.icoJanuary 16th, 2010Awhile ago, I added a feature that would try and add the favicon.ico for a site next to its feed name. Most sites have this image in a common location at http://www.site.com/favicon.ico. It’s also possible to set another location with a tag in the head of the html page. Since I didn’t feel that it was critical to have this, just a nice to have, I added in the img tags with an onError handler that would hide the image if it wasn’t in the default location. These images created a bit of an issue though. Several sites would serve a large 404 page with ads and everything if the favicon wasn’t in the default location. The ReadPath page would do the correct thing and hide the image, but behind the scenes it turns out that the browser could do quite a bit of excessive downloading if the 404 pages were too large. Instead of loading a minimal 16 x 16px image it could end up loading a huge 404 page. With 20 stories on the page it’s possible to load up a whole lot of things that will never be displayed and just bog down the browser. I’ve since changed the behavior so that the favicon images are only loaded after the page itself has finished loading. This should cure the speed issues, but still leaves the possibility of the browser loading a lot of things that aren’t necessary. I thought about caching a copy of the image and keeping a flag on whether the image existed, this would sidestep all of the performance issues, but creates all sorts of other issues. So for the time being I’ll leave it with the delayed load. |
||
Related categoriesJanuary 13th, 2010
|
||
Link graph roll backJanuary 13th, 2010ReadPath needs a way to keep track of which items link to other items. To accomplish this there is some code that builds up a link graph between the items. The original implementation of this was based on java bdb and worked quite well for a long time. After awhile, I had to move it from bdb to MySQL because of some issues with maintainability. Now, the current system is based on an innodb database where each url uses a hash of the url as a key and has a list of the items that reference it. This system is actually very similar to the reverse index used in a search engine. I’ve contemplated using lucene as a base, but there were some areas that didn’t match up. Over the last week I attempted the conversion of the link graph system from MySQL to HBase. I was having some scalability issues and was constantly having to clean the dataset to remove dead end nodes with 0 links. Since HBase was working quite well for the content and dictionary systems, I was hopeful that moving from a single MySQL node to an 8 node HBase cluster would give me some room to grow. I did some initial testing with HBase as a backend and found that for the processing portion of the system, the 8 node HBase cluster (quad core, 8Gb mem, 500Gb disk) was approximately equivalent to a single MySQL node with the same hardware. The chart below shows a test run of data processing. The line is the number of items processed per minute. As you can see from the chart, the rate starts out fairly high at ~75,000 / minute. As time progresses though the rate quickly drops down to ~5,000 / minute (this rate is equivalent to the MySQL backend). As I watched the HBase nodes during this run I could see that the cache percentage was rock bottom and that all of the machines had their disk IO maxed out trying to handle the completely random read / write access pattern. So while there was an initial advantage to using HBase when the dataset fit in memory, that advantage is quickly lost. Even without the advantage of using HBase for the data processing, I wanted to test out the production access pattern which would include lots of clients reading from the table. Maybe HBase could handle this additional read load better than MySQL. So last week I converted the ReadPath code to use the HBase backend and flipped the switch in production. It quickly became apparent that I was going to have to roll the change back. The chart below shows the average amount of time in milliSeconds that it takes to complete a request. The red lines indicates when the conversion was made. As you can see the avg time to complete a request jumped from the 5mS range to several hundred mS. The next chart shows the number of requests / min that were being made during the same time period. The same amount of work is attempting to go through, it just isn’t moving as quickly. You can see the large spike that occurs after the conversion period where the system, after being restored to the MySQL backend, is catching up again. I’ve found that HBase works great for certain parts of ReadPath’s infrastructure.
In conclusion, with my current hardware availability, HBase just isn’t a candidate for the link graph. My primary option is to look at partitioning MySQL, splitting the table across servers. This solution will scale linearly with the number of servers. I’ve just found from personal experience with other jobs that it can be a real hassle to get the code working well and maintain. |
||
Hadoop reduce method reuses objectsDecember 28th, 2009Over the holiday break I was playing around with creating a Map/Reduce job that would scan through all of the content items and then create a link graph. It was a fairly straightforward job. I would scan each content item for all of the hrefs and for each one would emit a record that contained the hash of the url, the contentId that was linking to it, as well as whether that content item was the owner of the url. I needed to create a new LinkRecord Writable object keyed off of the hash of the URL to encapsulate these items, which was fairly straightforward by just implementing Writable. Then in the reduce method I collected all of the LinkRecords for the url. I needed to scan the list of LinkRecords several times because I needed to find the oldest content item that claimed to be the owner of the URL. Once I had that I could differentiate between items from the same content feed from items from different feeds. To do this I used a bit of code like:
Then I would iterate through the list and do all of the necessary work. This all seemed to run as expected, however whenever I looked at the results, each link would seem to have n copies of the same value. But, different links would have a different number of copies. What appeared to be going on is that the Iterable values was reusing the object that it was exposing in the loop and just changing the objects parameters. So my list ended up having n references to the same object. To solve this, in the for loop instead of adding the record to my list, I created a new LinkRecord object, copied the parameters from the loop object into the new LinkRecord and then added the new LinkRecord to the list. This allowed my code to function as expected. |
||
Hadoop and HBase in productionDecember 28th, 2009
At first I had focused most of my attention on HBase, seeing it as a way to scale systems beyond a single MySQL instance without having to deal with the headaches of partitioning. However, once things went into production, I discovered that being able to kick off Map/Reduce jobs over parts of the dataset is a huge advantage. The first two areas to be ported were the content archive and the dictionary. ReadPath still keeps items that are less than 4 weeks old in a large MySQL database, although the plan is to eventually move everything over to HBase. Items that are older then 4 weeks are removed from the primary database and inserted into an 8 node HBase/Hadoop cluster. The content table works very well with HBase’s access patterns and can be fetched at speeds on par with MySQL. The personalized content scoring features of ReadPath depend on having a good measurement of term frequencies. So to support this, there is a dictionary of all of the terms used in the content database along with their frequencies. The initial implementation of the dictionary wasn’t scaling properly so it was converted to a Map/Reduce job that stores data in HBase. The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute). One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since, but have also made good use of the HBase Exporter Map/Reduce job that will dump the contents of your tables to HDFS. This can then be easily restored if for some reason the HBase tables become corrupted. These backup and restore techniques are actually much easier than the standard systems used for MySQL at the scale that ReadPath had gotten to. Next steps include porting the entire content system over to HBase and looking at using HBase for the link graph system that ReadPath needs to sort items. The link graph is a much more difficult system, the read/write pattern is completely random which blows away any caching. In preliminary tests, the system ends up being disk bound. Of course the current system has grown larger than a single MySQL instance can hold and is disk bound as well, so having 8+ disks is better than 1 disk. |
||
Search UpdateSeptember 19th, 2009
Because things are running so much smoother, instead of doing a minor index build every other hour as I used to do, ReadPath is now updating the search every 2 mins. While not real-time per se it’s getting awfully close. After hooking up a real-time spidering system like pubsubhubbub or rsscloud things will be very near to actual real-time. |
||
DB UpdateSeptember 19th, 2009
So I cleared away some log files and other cruft which gave me 64M to make it till the evening with spidering turned off. Then tonight I put in an additional 1T sata drive that I had intended to use in the hadoop/hbase cluster. With a quick format of the disk, copy the ibdata file over to the new hard drive (which took 30 min for 134G), and then a link from the old ibdata location to the new one and we’re up and running. Everything seems to check out, but let me know if you see any issues. Thanks, Bryan |
||
Feedburner IssueJune 14th, 2009With the increasing number of subscriptions by users of ReadPath we’ve gone over a limit for Feedburner. This has caused them to block us temporarily. We’re trying to get access restored to Feedburner feeds, but you may notice that certain subscriptions are not updating temporarily. We thank you for your patience as we try to get this issue resolved. –Update– The issue seems to have resolved itself. Let us know if you see any further updating issues. |
||
Search UpdatesJanuary 24th, 2009There are three types of search on ReadPath. You can search for:
Subscription search has just gotten an update that adds relevant items to each subscription result that you were searching for. So if you were searching for “Barack Obama inauguration” you would have gotten a list of the most relevant subscriptions on the inauguration as well as the top stories that each subscription had on the inauguration. News Items search has been updated so that the index updates every 4 hours and you can now order the results by either relevance or newest first. Real time updates will be coming soon. |
||
Personal News FilteringOctober 3rd, 2008
First and foremost, personal recommendations outweigh anything that a computer can forseeably do. This is why sharing features are very important. Being able to quickly and easily share stories and ideas with your friends and family is a great way to make sure you’re getting the best information. To foster discussion, where a large percentage of people just aren’t comfortable exposing their thoughts and ideas to the world, but want to keep them among a known group of people, ReadPath offers private comments. Now you can discuss topics without having to be concerned about your words living forever in a Google cache visible to the world. ReadPath also offers personalized filters for each user. As you subscribe to information sources and read stories, ReadPath is able to score stories based on what you like or dislike. This scoring system is specific to each user so that your scoring is not impacted by what other users think. If you’re a huge fan of Beethoven, then stories about Beethoven and classical music will rise to the top. What the scoring also allows you to do is to filter items that are under a certain threshold. So if you have 3,000 unread items, you’ll be able to see that only 100 items are really of interest to you. You can set ReadPath to only show the “Top Rated” items which will hide the 2,900 stories that are really just noise. Each user is able to set their own threshold and adjust it as they see fit. All of these features come together to make sure that you’re getting the most out of your reading time and that you’re free to have the highest quality discussions. |
||






I haven’t had a chance to write much about it, but a couple weeks ago, I moved ReadPath’s search from a straight 
