Jonesy's blog feed
A Couple of MySQL Performance Tips
If you’re an advanced MySQL person, you might already know these, in which case, please read anyway, because I still have some questions. On the other hand, f you’re someone who launched an application without a lot of database background, thinking “MySQL Just Works”, you’ll eventually figure out that it doesn’t, and in that case, maybe these tips will be of some use. Note that I’m speaking specifically about InnoDB and MyISAM, since this is where most of my experience is. Feel free to add more to this content in the comment area.
InnoDB vs. MyISAM
Which one to use really depends on the application, how you’re deploying MySQL, your plans for growth, and several other things. The very high-level general rule you’ll see touted on the internet is “lots of reads, use MyISAM; lots of writes, use InnoDB”, but this is really an oversimplification. Know your application, and know your data. If all of your writes are *inserts* (as opposed to updates or deletes), MyISAM allows for concurrent inserts, so if you’re already using MyISAM and 90% of your writes are inserts, it’s not necessarily true that InnoDB will be a big win, even if those inserts make up 50% of the database activity
In reality, even knowing your application and your data isn’t enough. You also need to know your system, and how MySQL (and its various engines) use your system’s resources. If you’re using MyISAM, and you’re starting to be squeezed for disk space, I would not recommend moving to InnoDB. InnoDB will tend to take up more space on disk for the same database, and the If you’re squeezed for RAM, I would also not move to InnoDB, because, while clustered indexes are a big win for a lot of application scenarios, it causes data to be stored along with the index, causing it to take up more space in RAM (when it is being cached in RAM).
In short, there are a lot of things to consider before making the final decision. Don’t look to benchmarks for much in the way of help — they’re performed in “lab” environments and do not necessarily model the real world, and almost certainly aren’t modeled after your specific application. That said, reading about benchmarks and what might cause one engine to perform better than another given a certain set of circumstances is a great way to learn, in a generic sort of way, about the engines.
Indexing
Indexes are strongly tied to performance. The wrong indexing strategy can cause straight selects on tables with relatively few rows to take an inordinately long amount of time to complete. The right indexing strategy can help you keep your application ‘up to speed’ even as data grows. But there’s a lot more to the story, and blind navigation through the maze of options when it comes to indexing is likely to result in poorer performance, not better. For example, indexing all of the columns in a table in various schemes all at once is likely to hurt overall performance, but at the same time, depending on the application needs, the size of the table, and the operations that need to be performed on it, there could be an argument for doing just that!
You should know that indexes (at least in MySQL) come in two main flavors: clustered, and non-clustered (there are other attributes like ‘hashed’, etc that can be applied to indexes, but let’s keep it simple for now). MyISAM uses non-clustered indexes. This can be good or bad depending on your needs. InnoDB uses clustered indexes, which can also be good or bad depending on your needs.
Non-clustered indexing generally means that the index consists of a key, and a pointer to the data the key represents. I said “generally” - I don’t know the really low-level details of how MySQL deals with its non-clustered indexes, but everything I’ve read leads me to believe it’s not much different from Sybase and MSSQL, which do essentially the same thing. The result of this setup is that doing a query based on an index is still a two-step operation for the database engine: it has to scan the index for the values in the index, and then grab the pointer to get at the data the key represents. If that data is being grabbed from disk (as opposed to memory), then the disk seeks will fall into the category of “random I/O”. In other words, even though the index values are stored in order, the data on disk probably is not. The disk head has to go running around like a chicken without a head trying to grab all of the data.
Clustered indexes, by comparison, kinda rock. Different products do it differently, but the general idea is that the index and the data are stored together, and in order. The good news here is that all of that random I/O you had to go through for sequential range values of the index goes away, because the data is right there, and in the order dictated by the index. Another big win here which can be really dramatic (in my experience) is if you have an index-covered query (a query that can be completely satisfied by data in the index). This results in virtually no I/O, and extremely fast queries, even on tables with a million rows or more. The price you pay for this benefit, though, can be large, depending on your system configuration: in order to keep all of that data together in the index, more memory is required. Since InnoDB used clustered indexes, and MyISAM doesn’t, this is what most people cite as the reason for InnoDB’s larger memory footprint. In my experience, I don’t see anything else to attribute it to myself. Thoughts welcome.
Indexes can be tricky, and for some, it looks like a black art. While I am a fan of touting proper data schema design, and that data wants to be organized independently of the application(s) it serves, I think that once you get to indexing, it is imperative to understand how the application(s) use the data and interact with the database. There isn’t some generic set of rules for indexing that will result in good performance regardless of the application. You also don’t have data integrity issues to concern yourself with when developing an index strategy. One question that arises often enough to warrant further discussion is “hey, this column is indexed, and I’m querying on that column, so why isn’t the index being used?”
The answer is diversity. If you’re running one of those crazy high performance web 2.0 bohemuth web sites, one thing you’ve no doubt tossed around is the idea of sharding your data. This means that, instead of having a table with 400,000,000 rows on one server, you’re going to break up that data along some kind of logical demarcation point in the data to make it smaller so it can be more easily spread across multiple servers. In doing so, you might create 100 tables with 4,000,000 rows apiece. However, a common problem with figuring out how to shard the data deals with “hot spots”. For example, if you run Flickr, and your 400,000,000 row table maps user IDs to the locations of their photos, and you break up the data by user ID (maybe a “user_1000-2000″ for users with IDs between 1000 and 2000), then that can cause your tables to be contain far less diverse data than you had before, and could potentially cause *worse* performance than you had before. I’ve tested this lots of times, and found that MySQL tends to make the right call in these cases. Perhaps it’s a bit counterintuitive, but if you test it, you’ll find the same thing.
For example, say that user 1000 has 400,000 photos (and therefore, 400,000 rows in the user_1000-2000 table), and the entire table contains a total of 1,000,000 rows. That means that user 1000 makes up 40% of the rows in the table. What should MySQL do? Should it perform 400,000 multi-step “find the index value, get the pointer to the data, go get the data” operations, or should it just perform a single pass over the whole table? At some point there must be a threshold at which performing a table scan becomes more efficient than using the index, and the MyISAM engine seems to set this threshold at around 30-35%. This doesn’t mean you made a huge mistake sharding your data — it just means you can’t assume that a simple index on ‘userID’ that worked in the larger table is going to suffice in the smaller one.
But what if there just isn’t much diversity to be had? Well, perhaps clustered indexing can help you, then. If you switch engines to InnoDB, it’ll use a clustered index for the primary key index, and depending on what that index consists of, and how that matches up with your queries, you may find a solution there. What I’ve found in my testing is that, presumably due to the fact that data is stored, in order, along with the index, the “table scan” threshold is much higher, because the number of IO operations MySQL has to perform to get at the actual data is lower. If you have index-covered queries that are covered by the primary key index, they should be blazing fast, where in MyISAM you’d be doing a table scan and lots of random I/O.
For the record, and I’m still investigating why this is, I’ve also personally found that secondary indexes seem to be faster than those in MyISAM, though I don’t believe there’s much in the way of an advertised reason why this might be. Input?
Joins, and Denormalization
For some time, I read about how sites like LiveJournal, Flickr, and lots of other sites dealt with scaling MySQL with my head turned sideways. “Denormalize?! Why would you do that?!” Sure enough, though, the call from on high at all of the conferences by all of the speakers seemed to be to denormalize your data to dramatically improve performance. This completely baffled me.
Then I learned how MySQL does joins. There’s no magic. There’s no crazy hashing scheme or merging sequence going on here. It is, as I understand it (I haven’t read the source), a nested loop. After learning this, and twisting some of my own data around and performing literally hundreds, if not thousands of test queries (I really enjoy devising test queries), I cringed, recoiled, popped up and down from the ceiling to the floor a couple of times like that Jekkyl and Hyde cartoon, and started (carefully, very carefully) denormalizing the data.
I cannot stress how carefully this needs to be done. It may not be completely obvious which data should be denormalized/duplicated/whatever. Take your time. There are countless references for how to normalize your data, but not a single one that’ll tell you “the right way” to denormalize, because denormalization itself is not considered by any database theorists to be “right”. Ever. In fact, I have read some great theorists, and they will admit that, in practice, there is room for “lack of normalization”, but they just mean that if you only normalize to 3NF (3rd Normal Form), that suits many applications’ needs. They do *NOT* mean “it’s ok to take a decently normalized database and denormalize it”. To them, normalization is a one way street. You get more normalized - never less. These theorists typically do not run highly scalable web sites. They seem to talk mostly in the context of reporting on internal departmental data sets with a predictable and relatively slow growth rate, with relatively small amounts of data. They do not talk about 10GB tables containing tens or hundreds of millions of rows, growing at a rate of 3-500,000 rows per day. For that, there is only anecdotal evidence that solutions work, and tribal war stories about what doesn’t work.
My advice? If you cannot prove that removing a join results in a dramatic improvement in performance, I’d rather perform the join if it means my data is relatively normalized. Denormalization may appear to be something that “the big boys” at those fancy startups are doing, but keep in mind that they’re doing lots of stuff they’d rather not do, and probably wouldn’t do, if they had the option (and if MySQL didn’t have O(n2 or 3) or similar performance with regard to joins).
Do You Have an IO Bottleneck?
This is usually pretty easy to determine if you’re on a UNIX-like system. Most UNIX-like systems come with an ‘iostat’ command, or have one readily available. Different UNIX variants show different ‘iostat’ output, but the basic data is the same, and the number you’re looking for is “iowait” or “%iowait”. On Linux systems, you can run ‘iostat -cx 2′ and that’ll print out, every 2 seconds, the numbers you’re looking for. Basically, %iowait is the percentage of time (over the course of the last 2-second interval) that the CPU had to hang around waiting for I/O to complete so it would have data to work with. Get a read of what this number looks like when there’s nothing special going on. Then take a look at it on a moderately loaded server. Use these numbers to gauge when you might have a problem. For example, if %iowait never gets above 5% on a moderately loaded server, then 25% might raise an eyebrow. I don’t personally like when those numbers go into double-digits, but I’ve seen %iowait on a heavily loaded server get as high as 98%!
Ok, time for bed
I find database-related things to be really a lot of fun. Developing interesting queries that do interesting things with data is to me what crossword puzzles are to some people: a fun brain exercise, often done with coffee. Performance tuning at the query level, database server level, and even OS level, satisfies my need to occasionally get into the nitty-gritty details of how things work. I kept this information purposely kind of vague to focus on high-level concepts with little interruption for drawn out examples, but if you’re reading this and have good examples that support or refute anything here, I’m certainly not above being wrong, so please do leave your comments below!
Blogged with Flock
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F05%2F12%2Fa-couple-of-mysql-performance-tips%2F'; addthis_title = 'A+Couple+of+MySQL+Performance+Tips'; addthis_pub = 'jonesy';rrdpy - Thanks, Corey!
I have a somewhat unique situation to deal with in terms of monitoring. I need to put a graph a bunch of historical data mined from web server logs. I can get so far with loghetti, which is coming along and is great for certain things, but there’s a bridge missing between it and something like MRTG. I’m pretty sure that with a custom output plugin and Corey’s rrdpy, I can make it the rest of the way. In fact, I had been poring over the documentation for RRDTool and the various language bindings figuring out which way to go first just as this module was released.
I still may decide to graph the historical data using some generic ‘feed this your data in one big heap and this will chart it’ type of thing, but even that may be possible here, since rrdpy includes rrd_make.py, which may or may not (I haven’t looked yet) support the requisite arguments you need to pass to get historical data to work (I think you need to support a start time, for example).
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F29%2Frrdpy-thanks-corey%2F'; addthis_title = 'rrdpy+-+Thanks%2C+Corey%21'; addthis_pub = 'jonesy';Ubuntu 8.04 and Python Editors
So I updated one of my laptops to Ubuntu 8.04 pretty much as soon as it was available. I’ve been using my MacBook Pro laptop for everything for probably over a year now, because I grew tired of the hobby that *is* running Linux on a laptop and getting everything to work. I’ll note that I *do* run Linux on every server I maintain that I can think of
The first test for this laptop was wireless. I bought this laptop (Lenovo T61) specifically because it got rave reviews for its Linux compatibility. I was careful to order the laptop with the proper video and wireless chipsets that had the best support. However, 2 things annoyed me so much that I went back to the MacBook for everything:
- Wireless hung, and hung often, and in a way that it was unrecoverable.
- Lenovo put the Escape key in the worst place they could possibly put it, especially for a Vim user. Changing the key mapping caused issues with other apps, and configuring the key mapping inside .vimrc doesn’t help on the 30 other servers I use it on (ssh’d in from this laptop) :-/
Really, it was the wireless that did it. I work 100% remotely on everything I do. So, 8.04 seems to have fixed the wireless issues. The next thing I wanted to do was check out all of the Python IDE/editors I couldn’t use on the Mac (or, not easily). So I used Synaptic Package Manager to install all of the ones I could find. I’m sorry to say that I personally had Problems with most of them:
- DrPython launched fine, but using the file browser to open a file resulted in…. a no-op. I’m sorry, but an editor needs to be able to open a file.
- PyPE failed to launch altogether! It looks like it’s going to open, it spins for 5-10 seconds, and then just disappears. No window is ever shown, but a tab does appear in the bar on the bottom of the screen.
- Pida allows you to choose an external editor, so I chose Vim, and that kinda worked, but I really just want the key bindings, not the whole editor, and there’s no option to use some default built-in editor that has code folding and autocompletion and stuff. It appeared to me to be so close to gvim that I decided to skip it. I tried to stick around and give it a chance by reading the docs, but alas, the only thing under “Help” is “About”. Seems there are still a number of open source developers more concerned with getting credit than getting users.
- Stani’s Python Editor looked pretty nice, but I couldn’t find any easy way to change the syntax coloring, and while there is a manual, you have to donate to get your hands on it. This is nonsense. If you want to sell some kind of advanced documentation, fine, but you can’t expect me to donate to a project that I don’t even know if I want to use yet! “Please pay me so that you can see if this product fits your needs”…. it just doesn’t work that way. What you’ve done is given me a product that is complex enough that you pretty much need a manual just to get started, and then deprived me of that. Why not just give me the manual and a 30 day trial, after which I have to donate? I’d have no problem doing that if I planned on keeping it around. In fact, that’s how plenty of Mac applications work. I’ll pay for software that does what I need, but this game that’s being played is just offensive.
- Eclipse with PyDev, I can use this, but I don’t like it a whole lot. The good news is that there’s an SVN plugin (subclipse), and a plugin for vi keybindings if I want to pay for it (it’s only $20 - not bad if you use it a lot). The interface is a little clunky to me, and there’s no easy “change your syntax color scheme to this” type functionality. If you want a dark background and light colored text, you actually have to go to one place to change the background color, the color of the line numbering area, etc., and then go to another place to change the colors associated with the different elements of your particular language. That’s annoying for two reasons: first, it’ll take forever to get things the way I want, and second, if I installed this on another machine, I couldn’t just move over some kind of theme file and have my settings ready to go (as far as I know).
In the end, it looks like my three favorite editors are still Komodo Editor, JEdit, and Vim. What’s your favorite Python editor for Linux?
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F28%2Fubuntu-804-and-python-editors%2F'; addthis_title = 'Ubuntu+8.04+and+Python+Editors'; addthis_pub = 'jonesy';Python Magazine April Issue is Out
Hi all,
This month’s Python Magazine has been released, containing a few really great articles, including on about using the Google API and Google Spreadsheets to create a database “in the cloud”. For you scientist types, there’s also an excellent article about BioPython. For XCode users, there’s an article about scripting XCode with Python, and there’s also some in-depth coverage of PyTables, which I thought was really interesting. Have a look, and enjoy!
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F24%2Fpython-magazine-april-issue-is-out%2F'; addthis_title = 'Python+Magazine+April+Issue+is+Out'; addthis_pub = 'jonesy';Spring Means Blooming Flowers… and Ideas
I seem to have found a pattern in my own internal workings. In the fall, I work furiously and get a lot done. Around the time of the winter holidays, I almost always do major personal web site changes and upgrades according to a mental list I’ve compiled over the previous year.
In the spring, I shake off the winter (I’m not a fan of winter), I brew my first batch of beer for the season (which symbolizes the end of winter, because I brew outdoors), and my brain starts to be flooded with new ideas. They range from the simplistic (maybe we should consider replacing windows in the house this year), to the slightly odd (why isn’t there a bluetooth setup that pairs two devices and alerts you if they get out of range, so if my daughter strays too far…), to the really useful (I should really take on that woodworking project to build that bookcase we desperately need), to the GEEKY!
This year I seem to be having a lot of geeky ideas. The difference is that, this year, I finally feel empowered enough to go after some of them. One idea that has come up is building an online brewer’s workshop. I would just build a GUI to do this for myself, but then I’d have to deal with which widget set to use, which platforms to support, and whatever else. Also, the final step in the evolution of a lot of GUIs is webification anyway. So I *think* this might be a job for Python, and I *think* I might try to do this using Django, which is fully supported by my web host (finally - see yesterday’s post)!
Brewing is one of those things that you can make as complex as you care to get. I started brewing with a buddy using a Coleman picnic cooler, a few buckets, and some odds and ends from the kitchen. Now I have a full three keg system, with pumps, plate chillers (small plate heat exchangers), fancy false bottoms, cool valves and tubing, and it involves relatively little manual labor. And that complexity can infect recipe development as well. Hops add bitterness by leeching alpha acids into the wort (the liquid that is not yet beer). Hop utilization calculations can be non-trivial and depend on many other factors in your system. Other characteristics depend heavily on the percent of available sugars you’re able to extract from the grains, your ability to keep a mash at a given temperature for a fixed period of time. This is easier to predict if you know, for example, the thermal mass of the vessels involved, and how much heat will be lost when you combine water and grain and stir. There are also proteins at work in the mash which can gum things up enough to make draining the liquid off a chore, so knowing what water/grain ratio to use is also important. And how quickly can you bring wort from boiling down to a temperature more friendly to yeast at the end of the cycle?
That’s a small fraction of the considerations you *could* make when brewing. I didn’t even touch on pH and water characteristics, or yeast attenuation! Needless to say, brewing with any consistency would be a great challenge and take a good bit more preparation without some tool to help you figure out how much water you’ll need, how many ounces of hops for how long, and how much grain you need to mash (and for how long), etc. There are lots of tools to help brewers out with this kind of stuff (ProMash is a popular one). The problem I have is that these tools are mostly commercial, proprietary, platform-specific ventures. I’d like to put one on the web that is at least “good enough”, and free for anyone to use. I’m open source that way (I’m happy to release the source as well).
Another tool I’d love to see is one that would let me manage my consulting business online. If BestPractical’s RT had a good PayPal plugin that would let you charge per ticket or charge for a bundle of so many tickets or something, that’d be a good start, but I’ve mucked with the code for RT (it’s written mostly in Perl), and it wasn’t a pleasant experience. This wouldn’t be a complete solution either, because most of my work is *not* simple support tickets, it’s large projects. For those I’d like people to be able to pay invoices online. There’s lots more I’d like to add on top of that, but that’s the general gist of it, and in the past I’ve been unable to find a really good solution, where “really good” is a completely nebulous term barely defined in my own head.
In addition to those ideas, I registered a couple of domains over the past year, and I hope to do some cool things with them as well if I ever get some time away from work and consulting. Oh yeah - I’ll also continue working on loghetti! Keep any eye out for updates. Maybe some people reading this have similar interests and would like to collaborate. Ciao for now!
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F21%2Fspring-means-blooming-flowers-and-ideas%2F'; addthis_title = 'Spring+Means+Blooming+Flowers%26%238230%3B+and+Ideas'; addthis_pub = 'jonesy';A non-degree-holder’s view of hiring decisions
I get a good number of job offers without sending resumes around. I guess my name shows up in enough places, associated with enough buzzwords, that recruiters fire off emails first and read the fine print later. The “fine print” in my case, says that I do not have a college degree.
99.999% of the time, recruiters, and even hiring managers, tell me that my experience more than makes up for any lack of a formal education (one manager said he had seen many less capable MS degree holders). However, there are a few little quirks I’ve found at some larger companies. Mainly, they fall into two categories:
- They just plain don’t hire anyone without a degree
- You can’t get past a certain “tier” of employment without a degree
I’ve worked in business. I grew up in family businesses. I understand that, in certain circumstances, corporations can have legitimate reasons for these stances. Probably the only one I’ve ever actually heard myself that seemed almost reasonable is “insurance”. Some positions in some companies can have a drastic effect on things that directly affect the bottom line of that company, and if the company has insurance to protect them against extremely costly one-time errors (like E&O insurance), the insurance company might give them a better rate if they take steps to decrease the likelihood of such errors… like requiring that employees in these positions have a degree. I think it’s kind of a twisted logic, really. Instead of developing processes and procedures to reduce the likelihood of a problem, they think that hiring someone with a degree by itself will help the issue. Like degree-holders are less prone to errors due to the simple human condition. Odd, that.
Oh, and there’s a third quirk, but not with the corporate policies - with he hiring managers themselves. The quirk is that certain hiring managers, without regard for stated policy, won’t hire someone who doesn’t have a degree, presumably because they fear they might be fired for hiring someone who fails to produce because they don’t have a degree. The other possible reasoning here is that they have the attitude that “I went through it, so why should I give someone a job who hasn’t?”
The *real* problem with these hiring managers, and with corporations who have (non-insurance-related) strict educational requirements of applicants, is that that there’s a shortcoming in the business education curricula: they don’t teach the future middle managers of the world how to evaluate an applicant who doesn’t have a traditional, formal education.
This is a guess, of course, since I haven’t been to business school. But aren’t managers unwilling to hire those without formal educations also guessing? I would submit that they are. It’s the same kind of guess, too. It’s a guess based in part (maybe) on experience, and in part based on stereotypes or other preconceptions.
My experience with those who don’t, or won’t hire non-degree holders is that they think of degree-holders as “more well-rounded”. Assuming the non-degree holder hasn’t resigned themselves to a life of flipping burgers, I don’t think this could be further from the truth. It is, in fact, an old wive’s tale with no basis in fact. We were all told as kids that college would make us more “well-rounded”, and so we all worked to attain this nebulous goal. In reality, a college degree, by itself, is simply not any kind of valuable indicator of “well-roundedness”. Colleges are businesses. They produce college graduates. They do it efficiently, with an eye toward the business end of things more than anything else. If a college graduate is well-rounded, it is as much in spite of their college experience as because of it. Most well-rounded people are probably predisposed to being well-rounded, and had a tendency toward things to help them become well-rounded by the time they arrived on campus.
Besides this somewhat lame view of non-degree holders, another assumption is that non-degree holders do not have *any* education, and so *cannot* be prepared to perform the tasks that a graduate can (allegedly) perform. This argument might hold water with me if I didn’t have some idea already how resumes are typically handled by HR departments. The short story there is that there are tons of resumes that a hiring manager never sees because they’re pre-qualified (read: filtered) on the basis of educational status.
My area of expertise is technology. I don’t have a degree. It would therefore be assumed by many a hiring manager that I have no idea what Big-O notation is, don’t know anything about object delegation or polymorphism, and can’t analyze problems the same way as a college grad. The manager would be wrong on the first two counts, because while I didn’t study in college, I *did* study. But what about that third bit about analyzing problems?
I can tell you that it’s absolutely true that I do not analyze problems the same way as a college grad. What’s a real shame, though, is that a lot of managers would assume that “not the same” means “not as well”. There’s no justification for this assumption. In fact, I would argue that it *has* been the case in the past that having one rogue non-degree holder in a room full of grads can help to avoid “group think”, and help the group turn a problem sideways for another look. It is unfortunate that a degree that is supposed to help people “think outside the box” seems to put everyone in the same exact spot outside of that box, looking at it from the same exact perspective, coming to the same exact conclusion.
Finally, there is a certain class of degree-holder that I think is never a win over hiring a young, hungry rogue like myself. This class of graduate has hung their degree on the wall and decided that they no longer have any obligation to continue to keep up with new developments in their field. They code the same way they’ve always coded, use the same collection of old trusty tools, deal with technology the same way they’ve always dealt with technology, and stood more or less completely still, failing to seek out (much less embrace) new tools, techniques, languages, paradigms for getting things done. How can you possibly think outside the box when your vision of the box is 10 years old and assumes that the box is completely static?
I believe it was Nietzsche (sp?) who wrote that truth is not static (of course, I’m paraphrasing, and I might be thinking of James). If you can see yourself subscribing to that idea at all (it seems counterintuitive at first glance, but deeper thought will probably get you there), then how can a person with a notion of “truth” that is tied to their college experience be any better at figuring out what to do with it than someone who doesn’t have a degree, but is forever seeking out interesting things that come out of an ever-evolving truth?
Anyway, that’s my diatribe for the evening. If you’re a hiring manager with preconceived notions about college degree holders (or not) that come from decades of brain-hammering by graybeards, then cling to that safety blanket all you want, but know that it’s old thinking. Learn to be (gasp!) creative about how you evaluate applicants, and how you build your teams, and how you execute on your visions. Try to find the other box. The one that doesn’t look anything like it did in college.
I’m interested in hearing feedback on these ideas. I’m sure some will take offense. I don’t mean any. I’m certainly not saying that not having a degree is better, or that degree-holders are all complacent or anything like that. I *am* saying that *formal* education *can* be an irrelevant point of comparison, and that relying solely on the existence (or not) of a *formal* education as the basis for hiring one applicant over another is ludicrous.
Also, my blog is subscribed to by various sites, and I decided to publish this to all of the categories, because I think it *could* be interesting to pretty much anyone. If this is spam in your eyes, let me know. If you find a lively discussion about this going on anywhere, I’d be really interested in that as well ![]()
Amazon Adds Static IP and “Availability Zones” to EC2
This is cool. You can now associate a static IP address with your EC2 instances. No more mucking about with 10-minute DNS timeouts or dynamic DNS routines. You can also elect to start certain instances in multiple locations using “Availability Zones”
These new features will make it a little easier for people to deploy larger web sites and services without quite as much management overhead. There’s also some rumblings in the forums that Amazon is actually working on immutable storage for EC2 images, which would pretty much complete the puzzle for most. A good bit of the custom scripts and routines people come up with for running on EC2 is to get around this “limitation”, although, truth be told, a good part of dealing with that is having a good backup plan, which you really should have anyway - EC2 just forces the issue ![]()
Hadoop, EC2, S3, and me
I’m playing with a lot of rather large data sets. I’ve just been informed recently that these data sets are child’s play, because I’ve only been exposed to the outermost layer of the onion. The amount of data I *will* have access to (a nice way of saying “I’ll be required to wrangle and munge”) is many times bigger. Someone read an article about how easy it is to get Hadoop up and running on Amazon’s EC2 service, and next thing you know, there’s an email saying “hey, we can move this data to S3, access it from EC2, run it through that cool Python code you’ve been working with, and distribute the processing through Hadoop! Yay! And it looks pretty straightforward! Get on that!”
Oh joyous day.
I’d like to ask that people who find success with Hadoop+EC2+S3 stop writing documentation that make this procedure appear to be “straightforward”. It’s not.
One thing that *is* cool, for Python programmers, is that you actually don’t have to write Java to use Hadoop. You can write your map and reduce code in Python and use it just fine.
I’m not blaming Hadoop or EC2 really, because after a full day of banging my head on this I’m still not quite sure which one is at fault. I *did* read a forum post that someone had a similar problem to the one I wound up with, and it turned out to be a bug in Amazon’s SOAP API, which is used by the Amazon EC2 command line tools. So things just don’t work when that happens. Tip 1: if you have an issue, don’t assume you’re not getting something. Bugs appear to be fairly common.
Ok, so tonight I decided “I’ll just skip the whole hadoop thing, and let’s see how loghetti runs on some bigger iron than my macbook pro”. I moved a test log to S3, fired up an EC2 instance, ssh’d right in, and there I am… no data in sight, and no obvious way to get at it. This surprised me, because I thought that S3 and EC2 were much more closely related. After all, Amazon Machine Images (used to fire up said instance) are stored on S3. So where’s my “s3-copy” command? Or better yet, why can’t I just *mount* an s3 volume without having to install a bunch of stuff?
This goes down as one of the most frustrating things I’ve ever had to set up. It kinda reminds me of the time I had to set up a beowulf cluster of about 85 nodes using donated, out-of-warranty PC hardware. I spent what seemed like months just trying to get the thing to boot. Once I got over the hump, it ran like a top, but it was a non-trivial hump.
As of now, it looks like I’ll probably need to actually install my own image. A good number of the available public images are older versions of Linux distros for some reason. Maybe people have orphaned them and gone to greener pastures. Maybe they’re in production and haven’t seen a need to change them in any way. I’ll be registering a clean install image with the stuff I need and trudge onward.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F20%2Fhadoop-ec2-s3-and-me%2F'; addthis_title = 'Hadoop%2C+EC2%2C+S3%2C+and+me'; addthis_pub = 'jonesy';The Power of Open Source
I think my very favorite aspect of the open source development model is that it allows me to practice the philosophies I use in my every day personal life, and apply them to software development as well. In my teens and early 20’s I read quite a lot of Aristotle and Plato, and a very major philosophy that I took away from all of that reading is “be conscious of your own ignorance”. And so I am.
There are just about a million reasons to start an open source project. In the case of loghetti, I made it a project because I know that there are things that other people know, which I do not know, but would probably like to know or benefit from knowing (we’ll not go into epistemological discussions - I’m just going to use the word “know” in the traditional sense here)
Turns out, just knowing that there’s stuff out there that I don’t know has proven useful. Within hours of launching the Google Code site for the project, Kent Johnson joined the project, changed maybe 5 lines of code in the apachelogs.py module, and according to my testing, that change resulted in a 6x speed increase. If you’re using loghetti from the SVN trunk, it’s gone from being sluggish for anything over 50MB, to being pretty darn quick even up to 250MB, at least for simple queries like –code=404 (which is what I do speed comparisons with). The changes will be in a tarball probably some time next week, for those who don’t want to use svn.
We haven’t even touched threading yet ![]()
Loghetti is now an open source project
I was getting feedback about loghetti, and it was all very useful, and it’s still coming in, and I can’t work full-time on it. At the same time, I’d love for some of the stuff I’ve read about to be implemented, because I certainly could make use of it myself.
So if anyone is interested, you can get loghetti, get more info about loghetti (it’s an apache log filter written in Python), or join the project here.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F18%2Floghetti-is-now-an-open-source-project%2F'; addthis_title = 'Loghetti+is+now+an+open+source+project'; addthis_pub = 'jonesy';A GMail option I’d like to see: “Delayed Skip Inbox”
I use GMail extensively. In my main gmail account, I can send mail using a variety of accounts that I’ve authenticated to allow sending from. So, for example, mail from jonesy at pythonmagazine dot com is actually sent from a gmail interface, even though PyMag doesn’t officially use Google Mail for their domain (not that this would be a bad idea. Hrrmmmm…). I also administer a domain using Google Apps for you domain (well, two, but only one is what you’d call “production”). GMail has become a big part of my life in recent years.
This is not to say that it’s perfect by any stretch. In using it over the past few years (or however long it’s been around - I’ve been using it pretty much since it existed), I’ve come up with lots of ideas that would make it better, but I haven’t written many of them down, so I figured I would start doing that, starting now
Here’s today’s “I wish gmail did…” entry:
Time-based filter application!
Oh yeah. Some of you know what I’m talking about already I’m sure. From the user-interface perspective, the change is dead simple. On the screen where you define what to do with messages that match a filter you’ve created, just add an option next to “Skip the Inbox” that says something like “Remove from Inbox after…” and then provide a way for the user to define some time period like “1 day” or “1 week”.
The effect should be that incoming messages that match the filter are placed in the inbox, and are then removed from the inbox after the user-specified time period. The user can still see the message after this time period by clicking on the label, just like you do to view messages that have skipped the inbox. It’s a “delayed skip the inbox”.
The idea here is that labels aren’t perfect, and neither are people. Labels aren’t perfect because labeling alone isn’t really enough to declutter your inbox unless you skip the inbox. That causes problems due to peoples’ imperfections: they don’t click on all of those filter labels every day, and messages fall through the cracks. There are still some filters that would be good to just have skip the inbox altogether, but for all the clubs and organizations I belong to, having even a 2- or 3-day delay would be great.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F17%2Fa-gmail-option-id-like-to-see-delayed-skip-inbox%2F'; addthis_title = 'A+GMail+option+I%26%238217%3Bd+like+to+see%3A+%26%238220%3BDelayed+Skip+Inbox%26%238221%3B'; addthis_pub = 'jonesy';Feedback and Boredom Result in 35% Performance Boost for Loghetti
Well, I got some feedback on my last post, and I had some time on my hands tonight, and Python is pretty darn easy to use.
As a result, loghetti is making great strides in becoming a faster log filter. To test the performance in light of the actual changes I’ve made, I’m asking loghetti only to filter on http response code, and I’m only asking for a count of matching lines. I’m only asking for the response code because I happen to know that it will cause loghetti to skip a lot of processing which once was done per-line on every run, but which is now done lazily, on an as-requested basis. So, for example, there’s no reason to go parsing out dates and query strings (two costly operations when you’re dealing with large log files) if the user just wants to query on response codes.
Put another way “Hey, I only want response codes, why should I have to wait around while you process dates and query strings?”
So, here’s where I was when this little solo-sprint started:
strudel:loghetti jonesy$ time ./loghetti.py --code=404 --count 1MM.log
Matching lines: 10096
real 5m52.103s
user 5m35.196s
sys 0m3.214s
Almost 6 minutes to process one million lines. For reference, that “1MM.log” file is 246MB in size.
Here’s where I wound up as of about 5 minutes ago:
strudel:loghetti jonesy$ time ./loghetti.py --code=404 --count 1MM.log
Matching lines: 10096
real 3m53.350s
user 3m50.498s
sys 0m1.641s
Hey, looky, there! I even got the same result back. Nice!
Ok, so it’s not what you’d call a ’speed demon’, especially on larger log files. But testing with a 25MB log with 100k lines in it using the same arguments took 25 seconds, and at that point it’s at least usable, and I’m actually going to be using it to do offline processing and reporting, and it’ll be on a machine larger than my first-generation Intel MacBook Pro, and for that type of thing this works just fine, and it’s easier to run this than to sit around thinking about regular expressions and shell scripts all day.
I’m still not pleased with the performance - especially for simple cases like the one I tested with. I just ran a quick ‘grep | wc -l’ on the file to get the same exact results and it worked in about one half of one second! Sure, I don’t mind trading off *some* performance for the flexibility this gives me, but I still think it can be better.
For now, though, I think I might rather support s’more features, like supporting a comparison operator other than “=”, or specifying ranges of dates and times.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F14%2Ffeedback-and-boredom-result-in-35-performance-boost-for-loghetti%2F'; addthis_title = 'Feedback+and+Boredom+Result+in+35%25+Performance+Boost+for+Loghetti'; addthis_pub = 'jonesy';Loghetti Beta - An Apache Log Filter
I’m thinking about just making this an open source project hosted someplace like Google Code or something, because there are folks much smarter than myself who can probably do wonders with the code I’ve put together here. Loghetti takes an Apache combined format access log and a few options as arguments, throws your log lines through a strainer, and leaves you with the bits you actually *want* (kinda like spaghetti, but for logs)
It’s written in Python, and the two dependencies it has are included in the tarball at the bottom. The dependencies are an altered version of Kevin Scott’s apachelogs.py file (I’ve added more granular log line parsing), and Doug Hellmann’s CommandLineApp, which really made creating a CLI application a breeze, since it handles things like autogenerating options, help output, etc automatically without me having to mess with optparse.
So right now, I use it for offline reporting on what’s in my log files, and it’s great for that. I can run, for example:
./loghetti.py –code=500 access.log
And get a listing of the log lines that have an http response code of 500. You can get fancier of course:
./loghetti.py –ip=192.168.1.2 –urldata=foo:bar –month=1 –day=31 –hour=16 access.log
And that’ll return lines where the client IP is 192.168.1.2, with the date specified using the date-related options. The “–urldata” option allows you to filter log lines on the query string part of the URL. So, in the above case, it’ll match if you have something like “&foo=bar” in the query string of the URL.
There are tons of features I’d like to support, but before I do, I feel compelled to address its performance on large log files. Once you throw this at a log file greater than about 50MB, it’s not a great real-time troubleshooting tool. I believe I’d be better off ripping some of the parsing out of apachelogs.py and making it conditional (for example, don’t bother parsing out all of that date information if the user hasn’t asked to filter on it).
Anyway, it’s still useful as it is, so let me know your thoughts on this, and if it’s something you have a use for or would like to help out with, I’ll set up a project for it. For now, you can Download Loghetti
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F13%2Floghetti-beta-an-apache-log-filter%2F'; addthis_title = 'Loghetti+Beta+-+An+Apache+Log+Filter'; addthis_pub = 'jonesy';What I learned about Python Today - eval()
I was writing some Python yesterday, and came across an issue that I thought was going to send me back to the drawing board.
I was using a module that, given an Apache access log, returns line objects with the fields of the line as attributes of the line object. It was certainly usable as-is, but I wanted more granular parsing of the fields, and if there were query string arguments (like “?f0o=bar&page=stories”, etc), I wanted those broken up for easy access later as well.
I created a simple ruleset builder so I could pass arguments to my script and have them become rules that would filter the log and return the interesting bits according to the ruleset. So now we have two objects: the line object that has attributes like line.ip, and a rule object that only has three attributes: the attribute of the line you want, and the value of that attribute you want to filter on - and there’s also a comparison operator attribute, but right now it only holds an “=”. It’s a work in progress
So this means that you can do something like this:
getattr(line, rule.attr)And if you passed “ip=192.168.1.2″ on the command line, then rule.attr will be “ip”, and the above will be parsed as “line.ip”, and you’ll get the expected result.
This fails for any attribute of “line” that isn’t a simple string - anything that has to be parsed as some kind of an expression. Like, say, references to keys and elements of dictionaries. I used cgi.parse_qs to parse my query string so I could access the different keys and values of the query string, and filter my logs using site-specific things like “zip=10016″ or something. Of couse, cgi.parse_qs returns a dictionary, which I called “urldata”. So, if you want to filter on “line.urldata[’zip’][0]”, you should just find a way to assign that to rule.attr, right?
Wrong. getattr doesn’t work that way. It doesn’t look up the dictionary element, it just tags it onto the end of “line” and hopes for the best. It doesn’t evaluate expressions. If you wanted to get an element of a dictionary that is an attribute of “line” using getattr(), you’d have to do this:
getattr(line, rule.attr)['zip'][0]Where “rule.attr” is just “urldata”.
Well that stinks for my purposes, because I don’t want a given type of argument passed in by the user to cause a special case in my code. I was thinking of alternative models to do all of this, but as usual, Doug had an answer right off the top of his head. His ability to do that sickens me at times. ;-P
The solution was to replace getattr(line, rule.attr) with eval(rule.attr, line.__dict__)
In this case, rule.attr = “urldata[’zip’][0]’, but it’s not treated as “just a string”. In the case above, “line.__dict__” is a namespace used to search for and evaluate “urldata[’zip’][0]”. The beauty of this is that as long as the value of rule.attr is defined in line.__dict__, rule.attr can be any argument of any type, and this one line of code will handle it.
That worked wonderfully.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F11%2Fwhat-i-learned-about-python-today-eval%2F'; addthis_title = 'What+I+learned+about+Python+Today+-+eval%28%29'; addthis_pub = 'jonesy';Taking Virtualization to the Next Level: AppLogic
Every now and then, something comes around that is as useful as it is novel. Of course, the notion of virtualized systems isn’t new. In fact, systems running something like what we now call a “hypervisor” have existed for, literally, decades. But what about that next level of virtualization? What if you could not only run multiple system images on a single piece of hardware, but run an entire infrastructure — load balancers and switches and the like, in addition to your web servers and such — all in a virtual data center?
I’ve been doing that for a few months now. I’m using AppLogic.
I’m happy to say that it’s working quite well. I’m an infrastructure architect, and I work and consult for a few different clients. So far, only one has allowed me to actually migrate the infrastructure to an AppLogic grid, but successes there make me more confident in approaching this as a solution for others. Maybe someday I’ll be able to use AppLogic’s tools to be a completely remote consulting infrastructure architect who can design, lay out, demo, deploy, and support rather complex infrastructures without leaving the comfort of my favorite chair (shown at right, for effect).
It’s completely possible. In fact, the AppLogic deployment I’ve already done was done from that very chair. So let me get to some of the questions I’ve been asked about AppLogic:
What’s the big deal?
It’s more cost-effective to leverage some of AppLogic’s features than it is to buy those features in a hosted environment. If you already have your own colo or physical machine room, depending on what stage of growth you’re in, there may not be a big deal here - but there probably is
AppLogic has redundancy built-in, for example. You can keep enough resources in reserve on your grid such that (without any other work on your part) the failure of any single component results in that component being immediately brought back up using those resources on another node within the grid. This isn’t something you have to configure - it’s inherent in how the system operates. This kind of redundancy can be costly to build and manage.
AppLogic also makes it more cost-effective to build more complex architectures than you can in a hosted environment. In a hosted environment, you pay some number of thousands of dollars for servers, and nothing but servers. Load balancers are hundreds of dollars more, more storage is extra, and you don’t have much control over the OS install, and sometimes you’re forced to manage the systems through some weird interface. No Bueno(tm).
How long did it take?
I worked the equivalent of “full time” on a project to simultaneously rearchitect and move an infrastructure from a hosted VPS solution that was beginning to struggle, to a much more robust AppLogic-based solution in less than 60 days. This includes the time it took to get to know how the existing architecture and applications worked together, which was probably two or three weeks of that time. This 60-day period also included the Christmas and New Year’s holidays. Actual working days to complete the deployment and get into production? 35 days. Note, also, that I was the sole designer and engineer on the project. I did the design work, and I also executed everything. This also includes the time it took to learn AppLogic.
Is it as easy as they say?
No. It’s not. However, I have to say that the support I’ve received so far has been really top notch, and they deal with learning issues in addition to problems with systems and software. So if I went to AppLogic and told them that I wanted to do “x” but couldn’t find anything about it in the docs (and I’m more than happy to read them if I missed them), they either pointed me to the proper document, or told me how to do it, or told me how AppLogic works so I could figure out my own compromise if need be. They’ve been pretty open about telling me roughly how AppLogic works under the covers, and it has, in some instances, helped me get work done.
What was the hardest part to get used to?
Application reboots — but you can architect your way out of that compromise, thankfully. You see, when you purchase access to a grid that is managed by AppLogic, you get a graphical interface that kinda reminds me of Visio to lay out your architecture. You do this within the context of an “application”. So one application can contain a couple of web servers, a caching server or two, a couple of databases, a monitoring server, a management bastion host of some kind, a load balancer, the works. It’s great, because when you’re done setting it all up, you can clone the entire application to a grid running at a completely different provider to create a completely redundant site in relatively little time. However, the downside to architecting like this is that if you need to make a change that affects how AppLogic initializes the component in question - like you need to allocate more disk or ram to one of the web servers - the *entire application* - all of those components I mentioned a second ago - needs to be rebooted. And it can take upwards of 10 minutes to do this. This is bad.
However, there’s nothing stopping you from setting up multiple applications on the same grid, putting 2 web servers in one, two in another, create a third application to load balance between them, and then if you need to do something that requires an application restart, you’ve minimized the pain, and the load balancer helps to insure that your users don’t incur any downtime.
Do you have to use a GUI?
Nope. Not to manage a deployed grid, anyway. I don’t know why you *wouldn’t* use the GUI to lay everything out and get a rough architecture into test mode quickly, but I guess technically you don’t have to use a GUI for that either. I have personally not used the GUI since the grid was in production. I use the standard UNIX-guy key-based ssh tunnels for everything. If I need to make a change to a component, I do it using a key-based ssh tunnel to the “grid controller” which is assigned to you when you get your grid.
Do I have to run <some bad distro here>
You can technically run whatever you want if you want to set it up manually. I don’t know of any restrictions as to Linux distribution. There are some special instructions regarding building custom kernels so that they work with the virtualization at the kernel level that AppLogic uses, but that’s all I’ve run into so far. They also provide some stock “template” components that can save you some headaches getting started. If you want a CentOS 5 MySQL server, it’s already there, and it’s not laden with AppLogic nonsense. It’s all stock as supplied by Cent. If you want a blank-slate install so you can add only what you need when you need it, that’s already there as well. What I’ve found so far is that the AppLogic builds do a much better job of creating a functional, but minimal, installation than I tend to do myself.
Our VPS uses all kinds of software on our OS and it eats up resources. How’s AppLogic in that regard?
Tell me about it. I was looking at a server that ran like three mail servers, 3 different server management software packages (virtuozzo, cpanel, and one other one), several web-based apps that weren’t even being used, tons of monitoring stuff, and… enough stuff to bog down a “blazing fast” server before any real “stuff” was running on it.
In contrast, AppLogic’s presence on your servers is really not noticeable at all. There is a stock monitoring device that they supply, and it talks to a daemon that runs on every server by default (I think - I haven’t used *every* stock catalog component they supply - yet). I don’t use the monitoring device, so I was able to kill off that daemon (I got confirmation that that was ok to do, by the way). The monitoring device wasn’t particularly useful. It was kind of just pretty pictures showing activity for the last 3 minutes and that was basically it. I need traps, alerts, and some history to help with capacity planning, so I’m rolling my own in that department. You can, of course, install whatever the heck you want on the systems, by the way — I coded a MySQL backup routine in Python (I’ll release that code at some point if anyone cares), installed snmptrapd and stuff on all of the servers, and there’s more to come as well.
What was the biggest problem you had?
I had set up redundant Apache servers with KeepAlives set to “on”, not realizing that one of the load balancers (which I’d treated as a black box up until this point) was running either a version or configuration of the “pound” daemon that didn’t like that at all. I was getting crazy log messages, timeouts, 500 errors, the works. I got on the phone with support and they got the right people involved in just a couple of minutes. It turned out that I fixed the issue myself by just turning off KeepAlives in Apache, but it’s nice to know they’re there.
There’s some debate about using KeepAlives when you’re using a load balancer or in high-performance scenarios in general — that much I knew. I also knew that some reverse proxies puked on KeepAlives. So I turned off KeepAlives, and the problem magically disappeared immediately.
Would you do it again?
Yeah, if the circumstances were right, I definitely would. I have one government client that I don’t think could ever get an approval to use something like this, and I have a smaller client who might consider it overkill, but for a privately held small-to-midsize company looking for a cost-effective way to get some benefits like redundancy and easier scalability into their architecture, this is good stuff.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F05%2Ftaking-virtualization-to-the-next-level-applogic%2F'; addthis_title = 'Taking+Virtualization+to+the+Next+Level%3A+AppLogic'; addthis_pub = 'jonesy';