This week we welcome Ryan Mitchell (@Kludgist) as our PyDev of the Week. Ryan is the author of Web Scraping with Python and Instant Web Scraping with Java. Let’s spend some time getting to know Ryan better.
Can you tell us a little about yourself (hobbies, education, etc):
I’m a graduate of Olin College of Engineering, getting my master’s in software engineering from Harvard School of Extension Studies next year (I’ve been working on it half time for the past three years). With a bachelor’s degree in “general engineering,” I bounced around a bit between UX and application design, entrepreneurship, IT, bioengineering, software architecture design, and programming. I think I’ve pretty firmly settled on software engineering at this point, though! I’m currently an SE at a startup in South Boston called LinkeDrive — we get big data from long haul truck engines. It’s pretty awesome.
As far as hobbies go, I’m a volunteer at the Boston Museum of Science in the Tech Studio department, every Sunday afternoon. I love teaching, and it’s a great way to get engaged in the community, while getting to see all the IMAX and planetarium shows you want! I’m also a drag queen and a member of the Sisters of Perpetual Indulgence (a non-profit organization of group of “drag nuns” that raise money mostly for local causes) — that’s a longer story!
Why did you start using Python?
Truthfully, it was because it was required for a college course I was taking at the time! I had internships at Sun (RIP) and Microsoft in high school, so there obviously wasn’t a lot of Python exposure there. During college, I was so busy with engineering courses, I didn’t have a lot of time to pick up anything that wasn’t required. But once I had an opportunity to learn it, I really fell in love with the language, and it pretty quickly became my go-to for non-web projects.
What other programming languages do you know and which is your favorite?
I’ve dabbled in just about every modern programming language over the past 12 years (I even learned FORTRAN 95 to do a project in as a joke when the professor said we could write it in any language — he changed the policy after that!). I started, like many did, with BASIC, and moved to C, Java, C#, Perl, and then started doing websites with PHP and JavaScript (but we’ve all made mistakes in our lives…). In college, I dabbled in a lot of academic languages, like MATLAB/C code, and, of course, Python.
I try not to play favorites with languages, but I mostly do Java for my day job, and I’m a fan of that language. If I’m working with a machine learning project that involves a lot of math, I’ll probably use Python. If there’s a lot of complex business logic, I’ll probably use Java, but that’s mostly a bias of the accepted conventions and popular libraries for each language, than a fundamental property of the languages themselves.
What projects are you working on now?
Lots of web scraping to promote and support the book, getting ideas for a second edition, writing blog posts, putting together a video series for O’Reilly (filming in October, will probably be released shortly afterwards), in addition to my day job, of course.
I’m also working on a super secret project on the side, that I likely won’t release until next DEF CON (or earlier, if the talk isn’t accepted, but fingers crossed!)
Which Python libraries are your favorite (core or 3rd party)?
Favorite core library: urllib — I write web scrapers, so I have to say that!
3rd party: BeautifulSoup is an obvious choice here. I know it has a lot of competition with all the HTML parsers out there, and of course, Python’s core HTMLParser, but I’ve found BeautifulSoup extremely fast, lightweight, flexible, and easy to use compared some of the other libraries. Definitely my go-to for HTML parsing.
I’ve been in love with the Python Imaging Library lately — I use it a lot to automate captcha solving and even do random batch imaging processing tasks (resizing a folder of images, for example) that would be a pain to do by hand.
Of course, there’s also SciPy, NumPy, and NLTK. My only complaint there is that I don’t know nearly enough machine learning to use those to their fullest extent, however they’re surprisingly easy to get up and running with even if you have a relatively trivial task to do with them, so I’d really recommend checking one or all of them out, if you haven’t yet! I have a pretty hardcore machine learning class I signed up for starting in September though, so I’m pretty excited about that, and getting to know these libraries a little better.
What made you decide to write a book about Python?
Well, the book’s not really about Python, it’s about web scraping! I actually wrote a smaller book, with Packt Publishing, a couple years ago — Instant Web Scraping with Java. Web scraping is a subject I love a lot, and that I love teaching. Packt was the one who suggested that I write the book in Java. At that time, I hadn’t used Java in a few years, so that language certainly wouldn’t have been my first choice. I was writing web scrapers in Python at the time for work, so, honestly, there was a lot of research that I needed to do for that.
I approached O’Reilly last year about writing a lengthier web scraping book, and basically told them I could write it in either Python or Java. They, of course, said Python — ironically, I had now switched jobs so that I was writing Java during the day. I’ve never written a book in the same language that I’ve been doing for my day job at the time, which makes it really confusing when you’re deciding whether or not you want to add a semicolon at the end of the line!
Is there anything else you’d like to say?
I’m honored to be PyDev of the week! Anyone can feel free to reach out to me @kludgist. I love having Twitter discussions about Python and web scraping. Also, I have a fledgling blog going at http://pythonscraping.com/blog where I write about random web scraping/Python thoughts, and welcome feedback for the Web Scraping with Python second edition! Thanks!
Thank You!