Quantcast
Channel: Planet Python
Viewing all 23418 articles
Browse latest View live

Caktus Consulting Group: Wagtail: 2 Steps for Adding Pages Outside of the CMS

$
0
0

My first Caktus project went live late in the summer of 2015. It's a community portal for users of an SMS-based product called RapidPro. The portal was built in the Wagtail CMS framework which has a lovely, intuitive admin interface and excellent documentation for developers and content editors. The code for our Wagtail-based project is all open sourced on GitHub.

For this community portal, we needed to allow users to create blog pages on our front-facing site without giving those same users any level of access to the actual CMS. We also didn't want outside users to have to learn a new CMS just to submit content.

We wanted a simple, one-stop form that guided users through entering their content and thanked them for submitting. After these outside users requested pages be published on the site, CMS content editors could then view, edit, and publish the pages through the Wagtail CMS.

Here's how we accomplished this in two steps. Karen Tracey and I both worked on this project and a lot of this code was guided by her Django wisdom.

Step 1: Use the RoutablePageMixin for our form page and thank you page

Now for a little background information on Wagtail. The Wagtail CMS framework allows you to create a model for each type of page on your site. For example, you might have one model for a blog page and another model for a blog index page that lists out your blog pages and allows you to search through blog pages. Each page model automatically connects to one template, based on a naming convention. For example, if your model is called BlogIndexPage, you would need to also have a template called blog_index_page.html, so that Wagtail knows how to find the related template. You don't have to write any views to use Wagtail out of the box.

However, in our case, we wanted users to submit a BlogPage entry which would be a child of a BlogIndexPage. Therefore, we wanted our BlogIndexPage model to route to itself, to a submission page, and to a "thank you" page.

RapidPro blog workflow

This is where Wagtail's RoutablePageMixin came into play. Here's the relevant code from our BlogIndexPage model that routes the user from the list page to the submission page, then to the thank you page.

In models.py:

fromdjango.template.responseimportTemplateResponsefromwagtail.wagtailcore.modelsimportPagefromwagtail.wagtailcore.fieldsimportRichTextFieldfromwagtail.contrib.wagtailroutablepage.modelsimportRoutablePageMixin,routeclassBlogIndexPage(RoutablePageMixin,Page):intro=RichTextField(blank=True)submit_info=RichTextField(blank=True)thanks_info=RichTextField(blank=True)@route(r'^$')defbase(self,request):returnTemplateResponse(request,self.get_template(request),self.get_context(request))@route(r'^submit-blog/$')defsubmit(self,request):from.viewsimportsubmit_blogreturnsubmit_blog(request,self)@route(r'^submit-thank-you/$')defthanks(self,request):returnTemplateResponse(request,'portal_pages/thank_you.html',{"thanks_info":self.thanks_info})

The base() method points us to the blog index page itself. Once we added the RoutablePageMixin, we had to explicitly define this method to pass the request, template, and context to the related template. If we weren't using this mixin, Wagtail would just route to the correct template based on the naming convention I described earlier.

The submit() method routes to our blog submission view. We decided to use the URL string "submit-blog/" but we could have called it anything. We have a view method submit_blog() defined in our views.py file that does the work of actually adding the page to the CMS.

The thanks() method routes to the thank you page (thank_you.html) and passes in content editable via the CMS in the variable thanks_info as defined in the BlogIndexPage model.

Step 2: Creating the form and view method to save the user-generated information

Here's the slightly trickier part, because we didn't find any documentation on adding pages to Wagtail programmatically. We found some of this code by digging deeper through the Wagtail repo and found the tests files especially helpful. Here are the relevant parts of our code.

In forms.py, we added a Django ModelForm.

classBlogForm(forms.ModelForm):classMeta:model=BlogPage

In views.py, we created a view method called submit_blog() that does a number of things.

  1. Imports the BlogForm form into the context of the page.
  2. Upon submission/post, saves the BlogForm with commit=False, so that it is not saved to the database, yet.
  3. Creates a slug based on the title the user entered with slugify(). This would normally be auto-generated and editable in the Wagtail CMS.
  4. Adds the unsaved BlogPage as a child to the BlogIndexPage (we passed in the reference to the index page in our routable submit() view method).
  5. Saves the page with the unpublish() command which both saves the uncommitted data to our CMS and marks it as a Draft for review.
  6. Saves the revision of the page so that we can later notify the Wagtail admins that a new page is waiting for their review with save_revision(submitted_for_moderation=True)
  7. Finally, this sends out email notifications to all the Wagtail admins with send_notification(blog.get_latest_revision().id, 'submitted', None). The None parameter in this function means do not exclude any Wagtail moderators.
defsubmit_blog(request,blog_index):form=BlogForm(data=request.POSTorNone,label_suffix='')ifrequest.method=='POST'andform.is_valid():blog_page=form.save(commit=False)blog_page.slug=slugify(blog_page.title)blog=blog_index.add_child(instance=blog_page)ifblog:blog.unpublish()# Submit page for moderation. This requires first saving a revision.blog.save_revision(submitted_for_moderation=True)# Then send the notification to all Wagtail moderators.send_notification(blog.get_latest_revision().id,'submitted',None)returnHttpResponseRedirect(blog_index.url+blog_index.reverse_subpage('thanks'))context={'form':form,'blog_index':blog_index,}returnrender(request,'portal_pages/blog_page_add.html',context)

Final Thoughts and Some Screenshots

Blog submission page

Front-end website for user submission of blog content.

Wagtail is very straightforward to use; we plan to use it on future projects. If you want to get started with Wagtail, the documentation is very thorough and well written. I also highly recommend downloading the open sourced demo site and getting that rolling in order to see how it's all hooked together.


Mike Driscoll: PyDev of the Week: Mike Bayer

$
0
0

This week we welcome Mike Bayer (@zzzeek) as our PyDev of the Week. Mike is the creator of the popular SQLAlchemy project. He has a fun Python blog and contributes to many Python projects. I’ve seen Mike present tutorials on SQLAlchemy at PyCon and he regularly does talks there as well (here’s an example from 2014). Let’s spend a few moments getting to him better!

Can you tell us a little about yourself (hobbies, education, etc):

I grew up in Long Island, the suburbs east of New York City, during the 1970s and 80s.   I had my first exposure to computers in 1980 at age 12, and like everyone else who got into “home computers” at that time spent lots of time with the Basic programming language.  Eventually I chose an Atari 800 as my platform in the mid-80s and I even managed to do some rudimental assembly language stuff with it.  In high school I managed to get online in early pre-internet dialup forms like bulletin boards and Compuserve, I later learned Pascal in high school and in college I did major in computer engineering for part of the time, where I learned more procedural programming and data structure theory as well as languages like Modula III and apparently a little bit of Lisp as well, based on looking at my old notes recently.   But in my college years, I really didn’t want to be involved with computers at all so I eventually majored in music at Berklee College of Music, which I also dropped out of to just move to the city and be a drummer.  In the city, my high typing speed and computer skills led me into office temp work doing word processing, which even at the lowest level paid far more than any starving musician could make, so to this day I hardly ever get to do drumming.  In a lot of those jobs I ultimately ended up  writing code to replace lots of the repetitive work they gave to temps and by that time, the “internet” suddenly wanted to hire everyone everywhere who could write just five lines of code so it was a natural move into the enormous late-90’s internet bubble in NYC working for agencies.

Why did you start using Python?

I was doing lots of Perl at Major League Baseball, and after having spent many years doing OO work in Java I really longed for a scripting language that had good OO features.  Python always seemed like it but I couldn’t get past the whitespace thing.  At MLB, we rolled out part of a CMS solution by giving the access to a CVS client called WinCVS, and we needed to do some scripting for it; the scripting language was Python, it forced me to work with Python long enough (20 minutes) to realize the whitespace thing was great, and the rest was history.

What other programming languages do you know and which is your favorite?

Python is definitely my all-time favorite, I’ve also worked a lot with Java, mostly with earlier versions, and I’ve also spent many years with Perl and of course Javascript as it’s unavoidable.    Those are the ones I have multiple years of professional experience with.  Outside of those I’ve done a fair degree of C once in awhile, a very small amount of C++, as well as work with all the old languages in college and early jobs like Fortran, Rexx, Pascal, Modula III, Scheme, 6809 / 68000 assembly etc. none of which I’d have any idea how to use today.

What projects are you working on now?

I have this super awesome job at Red Hat, which is the first job I’ve ever had where the thing that we’re producing is not “a website”.  I work with the Openstack product and the challenge I’ve been facing is figuring out how to get Openstack, which depending on architecture can spin up literally hundreds of Python processes, to handle sometimes thousands of connections to a database which in our case is typically amulti-master MySQL variant called Galera, in such a way that we aren’t wasting resources, aren’t running out of connections, and that will never go down if various database nodes or services suddenly become unavailable.  It all has to be packaged so that it works that way automatically when a customer installs Red Hat’s Openstack distribution for the first time.

As always, I’m also working on keeping SQLAlchemy moving forward, right now for the 1.1 release.  Architecturally, documentation-wise, and feature-wise.

Which Python libraries are your favorite (core or 3rd party)?

The libraries I’m using most and have made deep investments in, include mock, argparse, py.test and sphinx.  I have a deep appreciation for gevent as well.  I also encourage people to use the curses library more; its style is necessarily a little old-timey but rich and colorful console applications are awesome.


What inspired you to create SQLAlchemy?

I’ve always wanted to have an “ultimate programming platform”, and for years it seemed like I wanted to have a set of great libraries to use in Java, back when Java was new and there wasn’t a lot around for it, but Java lost its shine soon enough.  When I got into Python, it seemed a great place to finally build this platform, and getting basic web framework stuff going was my first goal, and then relational database access.   I had already written a dozen relational database facades before so by the time I decided to do SQLAlchemy I had a lot of ideas for what it should have.  There were patterns that I had worked out in various jobs that I wanted to present as a first class library, so that I would no longer have to keep writing database access layers at every new job.


Can you tell us what you’ve learned while working with the Python community to create packages like SQLAlchemy?

I’m amazed at how even though I had like ten years on the job experience when I started SQLAlchemy, I was still completely unsophisticated in every way; despite having been a tech manager at some jobs I still had no idea how to interact with the community, other open source developers, other open source projects, or users, or how to write real tests, how to write and maintain good APIs, anything.   I had to figure it all out within my experience in the Python community, starting out with this terrible template engine called Myghty where I figured out all kinds of mistakes to not make again, and then with the first couple of years of SQLAlchemy versions where I also racked up a deep catalog of mistakes to not make again, both in the realm of the code being published as well as interactions with others.

The wisdom that saved SQLAlchemy from the fallout of this learning curve was that I had always intended for it to take ten years by itself to be really rock solid, which is why it took that much time for me to feel OK calling it 1.0.  By not overselling it before it was appropriate and just waiting for it to mature very deeply and slowly and to gain a following in an organic way, the project didn’t become as much of a target of derision and hasty decision-making as it definitely would have if say, people were given “SQLAlchemy 0.3” mugs at conferences in 2007.  That version was outrageously bad.  Things are much better now.

Thanks so much for doing the interview!

Doug Hellmann: glob — Filename Pattern Matching — PyMOTW 3

Investing using Python: 15 years of forex and CFDs tick data to MongoDB using Python. Part Two

$
0
0
In the first part we've downloaded all 20,479 (as of time of writing) zipped data files. Didn't even knew there were so many. It took almost entire day on approx. 2-4 Mbps: [crayon-56c2aa194c39c119342910/] Now we should process those 20 Gb and try to pass them to database, where we'll be able to play with it […]

Kushal Das: retask 1.0 is out

$
0
0

Retask is a super simple Task Queue written in Python. It uses Redis in the backend, and works with both Python2, and Python3. The last official release was 0.4 back in 2013. The code base is very stable and we only received few queries about adding new features.

So, after almost 2+ years, I have made a new release, 1.0. Marking it super stable for using in the production. Currently it is being used in various projects within Fedora Infrastructure, and I also heard about internal usages in different companies. I have started writing this module because I was looking for something super simple to distribute jobs in different computers/processes.

You can install it using pip (updated rpm packages are coming soon).

$ pip install retask

Below is an example of queueing in some data using Python dictionaries.

from retask import Task
from retask import Queue
queue = Queue('example')
info1 = {'user':'kushal', 'url':'http://kushaldas.in'}
info2 = {'user':'fedora planet', 'url':'http://fedoraplanet.org/'}
task1 = Task(info1)
task2 = Task(info2)
queue.connect()
queue.enqueue(task1)
queue.enqueue(task2)

Go ahead, have a look into it. If you have any queries, feel free to ping me on IRC, or leave a comment here.

بايثون العربي: تعرف على مكتبة Scapy

$
0
0

مرحبا بالجميع في هذا الدليل الجد المتواضع حول مكتبة Scapy وهو دليل موجه للمبتدئين والذي لا علاقة لهم بهذا المكتبة لا من قريب ولا من بعيد لذلك لا داعي إلى القلق حول صعوية الدليل .
سنحناول جاهدين في الجزء الاول من هذا الدليل تعريف هذه المكتبة بالشكل البسيط ، من يقف وراءها ، تاريخها و .مجالات إستخدامها .
الفئة المستهدفة من هذا الدليل هم مهندسوا الشبكات و أمن المعلومات بالدرجة الأولى ولكن هذا لا يعني أن الفئات الأخرى ستكون مقصية بل على العكس الكل مرحب به

لماذا Scapy
تعتبر مكتبة Scapy مكتبة تفاعلية قوية في مجالها وهو التعامل مع حزم الشبكات المختلفة حيث بإمكانها إعادة صياغة و فك شفرات الحزم لمجموعة واسعة من البروتوكولات المختلفة بعد ذلك تقوم بإرسالهم إلى الشبكة أو إلتقاطها ، القيام بعملية تطابق بين الطلبيات والردود والكثير من الأمور ، كما يمكن لهذه المكتبة القيام بمعظم عمليات الخاصة يالشبكات المختلفة مثل : المسح ، تتبع مسار الشبكة، فحص الوحدات، الهجوم أو إكتشاف الشبكات ،كما يمكن لها أن تقوم بما يقارب 85% من عمليات برنامج nmap،tcpdump والكثير من عمليات الهجوم المختلفة .
كما يمكن لهذه المكتبة القيام بعمليات محددة لا يمكن لمعظم البرامج القيام بها مثل : إرسال حزم غير صحيحة ، حقن الإيطار 802.11، فك تشفير voip على القنوات المشفرة ب WEP .
يبدوا أن الامر ممتع أليس كذلك ؟ لهذا قررت أن أكتب  دليل بسيط عن هذه المكتبة يكون مناسبا للمبتدئين يكون مرجع لي ولزوار الموقع المهتمين بتطوير برامج بهذه المكتبة الرائعة .
وقبل أن نبدأ أريد أن أعرج على أهم النقاط التي سنتناولها في هذا الدليل :

* تثبيت مكتبة سكابي (بالإضافة إلى ادوات أخرى مساعدة )
* إنشاء حزمة
* إرسال وإستقبال حزمة
* الأوامر اﻷساسية لمكتبة سكابي
* إلتقاط الحزمة المارة بالشبكة وتحليلها
* أمثلة متنوعة
أرجوا ان تكونو مستعدين للبدء في تثبيت مكتبة سكابي وإنشاء بعض الحزم .

تثبيت مكتبة سكابي 

سأتناول كيفية تثبيت سكابي على اوبنتو فقط وللإطلاع على كيفية تثبيتها على الأنظمة الاخرى يرجى زيارة الموقع التالي :
http://www.secdev.org/projects/scapy/doc/installation.html

ملاحظة : لمن يعمل على توزيعة كالي لينكس سيجد هذه المكتبة مثبتة مسبقا .
قد ترغب أيضا في تثبيت برنامج wireshark حتى نقوم بتحليل الحزم المنشأة (هذا الدليل لا يتضمن شرح wireshark)

الشرط الأساسي لتشغيل مكتبة سكابي هو :
* تثبيت نسخة بايثون 2.5
* تحميل وتثبيت مكتبة سكابي
* (إختياري) تثبيت بعض البرامج والأدوات المساعدة لتنفيذ بعض المزايا الجيدة
* تشغيل سكابي بصلاحيات الجذر
تثبيت بايثون 2.5
الوثائق الرسمية Scapy تنصح ببايثون 2.5 لتشغيل سكابي 2.x فأنا اعمل على بايثون 2.7 ولم تحدث معي أي مشاكل وإذا كنت تريد معرفة نسخة البايثون المثبتة على نظامك أكتب الامر التالي :


pyarab@ubuntu:~$ python -V
Python 2.7.3

وإذا كانت هناك نسخة أقل من 2.5 فأكتب الامر التالي لتثبيت بايثون :

pyarab@ubuntu:~$ sudo apt-get install python

تحميل وتثبيت Scapy
بمجرد حصولك على بايثون عليك الحصول علىscapy ولعمل ذلك توجد طريقتين ولكنني سأتكلم عن الطريقة التي إستخدمتها :

pyarab@ubuntu:~$ sudo apt-get install python-scapy

تثبيت برامج إضافية من أجل مميزات خاصة (إختياري )
بعد تثبيت مكتبة scapy يمكنك التأكد من نجاح العملية من خلال الأمر التالي :

pyarab@ubuntu:~$ sudo scapy
INFO: Can't import python gnuplot wrapper . Won't be able to plot.
INFO: Can't import PyX. Won't be able to use psdump() or pdfdump().
WARNING: No route found for IPv6 destination :: (no default route?)
Welcome to Scapy (2.2.0)
To exit out of Scapy, just type:
Welcome to Scapy (2.2.0)
>>> exit()

يمكنك الملاحظة أنه بمجرد تشغيلscapy قد تظهر بعض رسائل الأخطاء عن عدم وجود بعض المكونات ، مكتبةscapy يمكنها القيام بالعديد من الأمور المميزة منها توفير خرائط والرسوم البيانية ثلاثية الأبعاد ولكننا بحاجة إلى حزم إضافية لتوفير جميع تلك المميزات :

pyarab@ubuntu :~$ sudo apt-get install tcpdump graphviz imagemagick python-gnuplot python-
crypto python-pyx

تشغيل مكتبةscapy مع صلاحيات الجذر
هذه خطوة سهلة وربما قد تكون قد فعلتها :

~$ sudo scapy

وستكون النتيجة :

WARNING: No route found for IPv6 destination :: (no default route?)
Welcome to Scapy (2.2.0)
>>>

إلى هنا نكود قد إنتهينا من عملية تثبيت وتهيئة مكتبة scapy وسننتقل الان إلى ما هو أهم من هذه الخطوة وأي مشاكل تم مصادفتها في هذه الخطوة أرجوا إبلاغنا .

إنشاء حزمة

حان وقت الجد فدعونا نتلاعب قليلا ونقوم بإنشاء أول حزمة ، وربما تكون تتسأل الأن ألا يجب ان نتعرف علىscapy اولا ، ربما تكون على حق ولكن انا أرى أن التعلم يأتي بالعمل والتجريب وسأحاول ان أشرح كل مثال على حدى وكل ما صادفنا شيء جديد سنقوم بالتكلم عنه .

إذن الحزمة الأولى ستكون عبارة عن حزمة بروتوكول ICMP مع الرسالة المشهورة “مرحبا بالعالم ” .

ملاحظة : عنوان الانترنيت IP المستخدم في هذا الدرس ينتمي إلى الشبكة المحلية لذلك عليك أن تقوم بالتعديل مايناسب شبكك المحلية حتى تضمن وصول الحزمة إلى الجهاز المناسب.


Welcome to Scapy (2.2.0)
>>> send(IP(dst="192.168.1.100")/ICMP()/"HelloWorld")
.
Sent 1 packets.
>>>

دعونا الأن نقوم بشرح الامر السابق

  1. send هنا نقوم بإخبار Scapy أننا بصدد إرسال حزمة واحدة
  2. ip نوع الحزمة التي نريد وهذه الحالة نريد حزمة عنوان الأنترنيت
  3. (dst=”192.168.1.100”) عنوان المستقبل وفي حالتنا هو الراوتر
  4. /ICMP() تريد إنشاء حزمة icmp مع القيم الأفتراضية الخاصة ب Scapy

ويمكننا رؤية الحزمة المرسلة و الرد الخاص يها بواسطة برنامج  tcpdump


pyarab@ubuntu:~$ sudo tcpdump -i wlan0 -nnvvXSs 0 -c2 icmp

tcpdump: listening on wlan0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:02:21.822237 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto ICMP (1), length 50)
192.168.1.100 > 192.168.1.1: ICMP echo request, id 0, seq 0, length 30
0x0000: 4500 0032 0001 0000 4001 f714 c0a8 0164 E..2....@......d
0x0010: c0a8 0101 0800 5834 0000 0000 5468 6973 ......X4....This
0x0020: 2069 7320 616e 2049 434d 5020 7061 636b .is.an.ICMP.pack
0x0030: 6574 et
18:02:21.823074 IP (tos 0x0, ttl 254, id 49099, offset 0, flags [none], proto ICMP (1), length 50)
192.168.1.1 > 192.168.1.100: ICMP echo reply, id 0, seq 0, length 30
0x0000: 4500 0032 bfcb 0000 fe01 7949 c0a8 0101 E..2......yI....
0x0010: c0a8 0164 0000 6034 0000 0000 5468 6973 ...d..`4....This
0x0020: 2069 7320 616e 2049 434d 5020 7061 636b .is.an.ICMP.pack
0x0030: 6574 et
2 packets captured
2 packets received by filter
0 packets dropped by kernel

من خلال برنامج tcpdump يمكننا أن نرى ان الحزمة أرسلت من العنوان 192.168.1.100 لتتجه إلى العنوان 192.168.1.1 وهو طلب ICMP يحتوي على البيانات التالية ‘Hello world’

إرسال وإستقبال الحزم

سنقوم الأن بإرسال وإستقبال بعض الحزم من خلال Scapy .

هناك ثلاثة دالات أساسية للقيام بهذه المهمة :

  • sr() دالة لإرسال الحزم وإستقبال الردود المستقبلة من طرف المرسل إلية وحتى الردود الغير المستقبلة وهي تعمل على الطبقة الثالثة
  • sr1() تقوم هذه الدالة بعرض حزمة واحدة و التي تم الرد عليها ولا تقوم بعرض الرسالات الغير المجابة وهي تعمل على الطبقة الثالثة.

ملاحظة src() وsrc1() يجب أن يتم إستخدامهم في الطبقة الثالثة من طبقات osi (IP, ARP, etc.)

  • srp() تقوم هذه الدالة بنفس العمل ولكن على الطبقة الثانية من طبقات osi

مثال عن الدالة SR1

ستقوم بإرسال حزمة TCP إلى المنفذ 80

>>> h=sr1(IP(dst="192.168.1.100")/TCP(dport=80))
Begin emission:
.Finished to send 1 packets.
*
Received 2 packets, got 1 answers, remaining 0 packets
  1. 192.168.1.100 عنوان المستقبل
  2. حزمة TCP خاصة بالمنفذ 80
  3. حصلنا على حزمتين في الرد

سنقوم الأن بإكتشاف على ما تحتوي حزمة الرد

>>> h.show()
###[ IP ]###
version= 4L
ihl= 5L
tos= 0x0
len= 44
id= 5055
flags=
frag= 0L
ttl= 254
proto= tcp
chksum= 0x2557
src= 192.168.1.1
dst= 192.168.1.100
\options\
###[ TCP ]###
sport= http
dport= ftp_data
seq= 4081483776L
ack= 1
dataofs= 6L
reserved= 0L
flags= SA
window= 2800
chksum= 0x9600
urgptr= 0
options= [('MSS', 1400)]
###[ Padding ]###
load= 'A\x10'

<pre>

تقوم دالة Show على المتغير H بعرض هيكل الحزمة التي تلقيناها من المستقبل للحزمة التي أرسلناها وكما شاهدتم إنها حزمة IP وعنوان المصدر والمستقبل هو 192.168.1.100و 192.168.1.1 على التوالي .

قام المستقبل بالرد بإستخدام المنفذ 80 أما المصدر قام بإستقبال الرد عبر المنفذ 20 ftp_data .

كما يمكننا الملاحظة أن الأعلام SYN(S) و ACK(A) تم تعيينهم أيضا في الرد  وتعني (flags= SA) أن المنفذ 80 مفتوح.

مثال عن الدالة SR

سنقوم بإرسال حزمة TCP إلى المنفذ 80

>>>> ans,unans = sr(IP(dst="192.168.1.1")/TCP(dport=80))
Begin emission:
....Finished to send 1 packets.
*
Received 5 packets, got 1 answers, remaining 0 packets

تم إستقبال 5 حزم ورد واحد .

وكا سبق لنا و أن ذكرنا فإن الدالة SR تقوم بإرجاع كل من الحزم التي الرد عليها و التي لم يرد عليها يمكننا تخزينهم في متغيرين ans و unans.
حسنا دعونا نطلع على مايحتويه الرد من معلومات :

>>> ans
<Results: TCP:1 UDP:0 ICMP:0 Other:0>
>>> ans.summary()
IP / TCP 192.168.1.102:ftp_data > 192.168.1.1:http S ==> IP / TCP 192.168.1.1:http > 192.168.1.102:ftp_data SA / Padding

عند كتابة إسم المتغيير الأول ans يقوم بإعطائنا نتائج رد الحزمة المستقبل والتي تحتوي على معلومات من بينها أن الرد يحتوي على حزمة TCP واحدة تم إستقبالها ولمزيد من التفاصيل نقوم بإستخدام summary()

ملاحظة : الدالة show لا تعمل مع الدالة SR

 تعيين Timeout للحزم

timeout يستخدم لتحديد عدد الثواني لإنتضار الرد، قم بنخيل سينارويو بحيث نقوم بإرسال حزمة TCP على المنفذ80 إلى جهاز غير متصل بالشبكة وعدم تعيينا لTimeout ، في هذه الحالة سنقوم بالإنتظار ولتجنب هذه الحالة سنقوم بتعيين timeout .
قم بإرسال حزمة TCP على المنفذ 80 إلى جهاز لا يمكن الوصول إليه وبدون نحديد قيمة Timeout

>>> ans,unans = sr(IP(dst="10.0.0.99")/TCP(dport=80))
Begin emission:
WARNING: Mac address to reach destination not found. Using broadcast.
Finished to send 1 packets.
...........................................................^C
Received 6 packets, got 0 answers, remaining 1 packets

>>> ans,unans = sr(IP(dst="10.0.0.99")/TCP(dport=80),timeout=10)
Begin emission:
.....WARNING: Mac address to reach destination not found. Using broadcast.
Finished to send 1 packets.
...
Received 8 packets, got 0 answers, remaining 1 packets
<

إستخدام scapy داخل برنامج بايثون

يمكننا أيضا إستعمال مكتبة scapy داخل برامج بايثون وليس فقط على الطرفية وذلك عبر إستدعاء المكتبة داخل البرنامج


#! /usr/bin/python

from scapy.all import *

ans,unans = sr(IP(dst="10.0.0.1")/TCP(dport=80))
ans.summary()

إلى هنا أكون قد إنتهيت من هذا الدليل والذي سيتبعه دروس أخرى حول هذه المكتبة أرجوا لكم الإستفادة .

Glyph Lefkowitz

$
0
0

Monads are simple to understand.

You can just think of them like a fleet of mysterious inverted pyramids ominously hovering over a landscape dotted with the tombs of ancient and terrible gods. Tombs from which they may awake at any moment if they are “evaluated”.

The IO loop is then the malevolent personification force of entropy, causing every action we take to push the universe further into the depths of uncontrolled chaos.

Simple!

Continuum Analytics News: Continuum Analytics CEO Travis Oliphant and CMO & VP of Product Michele Chambers to Speak at Gartner Business Intelligence & Analytics Summit

$
0
0
PostedTuesday, February 16, 2016

AUSTIN, TX - February 16, 2016 - Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today announced Continuum Analytics CEO Travis Oliphant and CMO & VP of Product Michele Chambers will speak at the G​artner Business Intelligence & Analytics Summit, ​taking place March 14-­16, 2016 in Grapevine, Texas.

"Open Data Science is simultaneously breathing new life into the analytics market, while giving data science teams the ability to peacefully co­exist with legacy environments," said Travis Oliphant. "From business analysts and data scientists to IT professionals, Open Data Science is helping to foster an all­inclusive environment that brings together team members and technologies that help organizations easily and effectively create powerful high impact data science and analytics solutions."

WHAT:'W​hy Open Data Science Matters' -­ O​pen Data Science is eating the world. Why? Unlike proprietary vendors, Open Data Science is an inclusive movement that continuously incorporates innovation from open source tools of data science ­- data, analytics, & computation - to drive game­changing value.

WHO: Speakers include:

  • Travis Oliphant, Continuum Analytics CEO & Co-­Founder
  • Michele Chambers, Continuum Analytics CMO & VP of Product

WHEN: Tuesday, March 15, 2016 ­ 10:45 AM - 11:30 AM

WHERE: Gaylord Texan Hotel & Convention Center 1501 Gaylord Trail Grapevine, Texas, United States,­ 76051

Gartner Business Intelligence & Analytics Summit 2016​is the premier gathering of Business Intelligence (BI) and analytics leaders and the place to discover the latest research and transformative insight designed to help you drive maximum business value from your own BI and analytics programs. Continuum Analytics is a platinum sponsor and will be exhibiting at booth #421. If you are interested in joining the #AnacondaCrew and winning a FREE full conference pass to the Summit, click h​ere​ to enter. If you are a customer, prospect, analyst or journalist and interested in meeting with Continuum Analytics at the event, please reach out to continuum@treblepr.com.​

About Continuum Analytics

Anaconda is the modern open source analytics platform powered by Python. Continuum Analytics is the driving force behind Anaconda. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage open data science environments and harness the power of the latest open source analytic and technology innovations.

To learn more about Continuum Analytics, visit w​w​w.continuum.io.​

###

Media Contact

Treble

Aaron DeLucia

(512) 960­8222

continuum@treblepr.com


Continuum Analytics News: Continuum Analytics Announces Anaconda Enterprise Notebooks

$
0
0
PostedTuesday, February 16, 2016

Continuum Analytics, ­­Supporter of Jupyter, the Leading Open Source Notebook­­, Supercharges Notebooks for Enterprise Data Science Teams

AUSTIN, TX - February 16, 2016 - Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today announced several updates to the Anaconda platform that allow enterprise data science teams to collaborate easily. With the addition of Anaconda Enterprise Notebooks, enterprises now have the ability to tap the benefits of Jupyter (formerly known as IPython) Notebooks in a governed environment. The Anaconda platform includes AnacondaXL, which allows data science teams to integrate the power of Anaconda directly into Microsoft Excel and have the benefits of Anaconda at their fingertips.

"Modern Open Data Science is a team sport -­ it requires communication, collaboration and sharing. Without Enterprise Notebooks, your data science team is like a football team without a playbook," said Peter Wang, Continuum Analytics CTO and co­-founder. "Anaconda Enterprise Notebooks gives your data science team the plays to work together and with the greater open data science community."

Jupyter Notebooks have become a powerful way for users to share a data narrative with teams and the broader open data science community as lightweight applications. Notebooks encapsulate code, comments and visualization all in one place, but up until now have had limitations in the enterprise. With Anaconda Enterprise Notebooks, Continuum Analytics provides collaborative locking, version control, notebook differencing and searching needed to operate in the enterprise.

With the introduction of AnacondaXL, Anaconda can now easily integrate into the most widely­ adopted business intelligence tool on the planet - Microsoft Excel. AnacondaXL provides data scientists with access to popular Python packages including s​cikit­-learn​ and pandas​ for machine learning, enabling predictive analytics and data transformations. This empowers businesses to leverage skills­-ready business analysts who come out of university with Python skills. AnacondaXL brings powerful visualizations to Microsoft Excel with B​okeh,​ breaking through the standard Excel charting to embed rich, contextually relevant visualizations into any spreadsheet.

Anaconda Enterprise Notebooks are available immediately and AnacondaXL will be available in March 2016. AnacondaXL and Anaconda Enterprise Notebooks will be showcased at the Gartner Business Intelligence & Analytics Summit, ​March 14­-16, 2016 in Grapevine, Texas.

About Continuum Analytics:

Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, architects, and business analysts – to connect the dots in their data and accelerate the time­to­value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture­-backed by General Catalyst and BuildGroup.

To learn more about Continuum Analytics, visit w​w​w.continuum.io.​

###

Media Contact

Treble

Aaron DeLucia

(512) 960­8222

continuum@treblepr.com

End Point: img.bi, a secret encrypted image sharing service tool

$
0
0

After a fairly good experience with dnote installed on our own servers as an encrypted notes sharing service, my team decided that it would have been nice to have a similar service for images.

We found a nice project called img.bi that is based on NodeJS, Python, Redis and a lot of client-side JavaScript.

The system is divided into two components: the HTML/JS frontend and a Python FastCGI API.

Unfortunately the documentation is a still in its very early stage and it's lacking a meaningful structure and a lot of needed information.

Here's an overview of the steps we followed to setup img.bi on our own server behind nginx.

First of all we chose that we wanted to have as much as possible running and confined to a regular user, which is always a good idea with such young and potentially vulnerable tools. We chose to use the imgbi user.

Then since we wanted to keep as clean as possible the root user environment (and system status), we also decided to use pyenv. To be conservative we chose the latest Python 2.7 stable release, 2.7.10.

git clone https://github.com/yyuu/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
echo $SHELL -l
pyenv install -l  | grep 2\\.7
pyenv install 2.7.10
pyenv global 2.7.10
pyenv version
which python
python --version

In order to use img.bi, we also needed NodeJS and following the same approach we chose to use nvm and install the latest NodeJS stable version:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.25.4/install.sh | bash
nvm install stable
nvm list
nvm use stable
nvm alias default stable
node --version

As a short note to the usage of the bad practice of blindly using:

curl -o- https://some_obscure_link_or_not | bash

We want to add that we do not endorse this practice as it's dangerous and exposes your system to many security risks. On the other hand, though, it's true that cloning the source via Git and compile/installing it blindly is not much safer, so it's always up to how much you trust the peer review on the project you're about to use. And at least with an https URL you should be talking to the destination you want, whereas an http URL is much more dangerous.

Furthermore going through the entire Python and NodeJS installation as a regular user, was far beyond the scope of this post and the steps proposed here assumes that you're doing everything as the regular user, except where specifically stated differently.

Anyway after that we updated pip and then installed all the needed Python modules:

pip install --upgrade pip
pip install redis m2crypto web.py bcrypt pysha3 zbase62 pyutil flup

Then it's time to clone the actual img.bi code from the GitHub repo, install a few missing dependencies and then use the bower and npm .json files to add the desired packages:

git clone https://github.com/imgbi/img.bi.git
cd img.bi/
npm install -g bower grunt grunt-cli grunt-multiresize
npm install -g grunt-webfont --save-dev
npm install
bower install

We also faced an issue which made Grunt fail to start correctly. Grunt was complaining about an "undefined property" called "prototype". If you happen to have the same problem just type

cd node_modules/grunt-connect-proxy/node_modules/http-proxy
npm install eventemitter3@0.1.6
cd -

That'll basically install the eventemitter3 NodeJS package module locally to the grunt-connect-proxy module so to overcome the compatibility issues which in turn causes the error mentioned above.

You should use your favourite editor to change the file config.json, which basically contains all your local needed configuration. In particular our host is not exposed on the I2P or Tor network, so we "visually" disabled those options.

# lines with "+" needs to be replace the ones starting with a "-"
-  "name": "img.bi",
+  "name": "img.bi - End Point image sharing service",

-  "maxSize": "3145728",
+  "maxSize": "32145728",

-  "clearnet": "https://img.bi",
+  "clearnet": "https://imgbi.example",

-  "i2p": "http://imgbi.i2p",
+  "i2p": "http://NOTAVAILABLE.i2p",

-  "tor": "http://imgbifwwqoixh7te.onion",
+  "tor": "http://NOTAVAILABLE.onion",

Save and close the file. At this point you should be able to run "grunt" to build the project but if it fails on the multiresize task, just run

grunt --force

to ignore the warnings.

That's about everything you need for the frontend part, so it's now time to take care of the API.

cd
git clone https://github.com/imgbi/img.bi-api.git
cd /home/imgbi/img.bi-api/

You now need to edit the two Python files which are the core of the API.

# edit code.py expired.py
-upload_dir = '/home/img.bi/img.bi-files'
+upload_dir = '/home/imgbi/img.bi-files'

Verify that you're not having any Python import related error, due to missing modules or else, by running the Python code.py file directly.

./code.py

If that's working okay, just create a symlink in the build directory in order to have the API created files available to the frontend

ln -s /home/imgbi/img.bi-files /home/imgbi/img.bi/build/download

And then it's time to spawn the actual Python daemon:

spawn-fcgi -f /home/imgbi/img.bi-api/code.py -a 127.0.0.1 -p 1234

The expired.py file is used by a cronjob which periodically checks if there's any image/content that should be removed because its time has expired. First of all let's call the script directly and if there's no error, let's create the crontab:

python /home/imgbi/img.bi-api/expired.py

crontab -e

@reboot spawn-fcgi -f /home/imgbi/img.bi-api/code.py -a 127.0.0.1 -p 1234
30 4 * * * python /home/imgbi/img.bi-api/expired.py

It's now time to install nginx and Redis (if you still haven't done so), and then configure them. For Redis you can just follow the usual simple, basic installation and that'll be just okay. Same is true for nginx but we'll add our configuration/vhost file content here as an example /etc/nginx/sites-enabled/imgbi.example.conf for everyone who may need it:

upstream imgbi-fastcgi {
  server 127.0.0.1:1234;
}

server {
  listen 80;
  listen [::]:80;
  server_name imgbi.example;
  access_log /var/log/nginx/sites/imgbi.example/access.log;
  error_log /var/log/nginx/sites/imgbi.example/error.log;
  rewrite ^ https://imgbi.example/ permanent;
}

server {
  listen 443 ssl spdy;
  listen [::]:443 ssl spdy;
  server_name  imgbi.example;
  server_name  imgbi.example;
  access_log /var/log/nginx/sites/imgbi.example/access.log;
  error_log /var/log/nginx/sites/imgbi.example/error.log;

  client_max_body_size 4G;

  include include/ssl-wildcard-example.inc;

  add_header Strict-Transport-Security max-age=31536000;
  add_header X-Frame-Options SAMEORIGIN;
  add_header X-Content-Type-Options nosniff;
  add_header X-XSS-Protection "1; mode=block";

  location / {
    root /home/imgbi/img.bi/build;
  }

  location /api {
    fastcgi_param QUERY_STRING $query_string;
    fastcgi_param REQUEST_METHOD $request_method;
    fastcgi_param CONTENT_TYPE $content_type;
    fastcgi_param CONTENT_LENGTH $content_length;

    fastcgi_param SCRIPT_NAME "";
    fastcgi_param PATH_INFO $uri;
    fastcgi_param REQUEST_URI $request_uri;
    fastcgi_param DOCUMENT_URI $document_uri;
    fastcgi_param DOCUMENT_ROOT $document_root;
    fastcgi_param SERVER_PROTOCOL $server_protocol;

    fastcgi_param GATEWAY_INTERFACE CGI/1.1;
    fastcgi_param SERVER_SOFTWARE nginx/$nginx_version;

    fastcgi_param REMOTE_ADDR $remote_addr;
    fastcgi_param REMOTE_PORT $remote_port;
    fastcgi_param SERVER_ADDR $server_addr;
    fastcgi_param SERVER_PORT $server_port;
    fastcgi_param SERVER_NAME $server_name;
    fastcgi_param HTTPS on;

    fastcgi_pass imgbi-fastcgi;
    fastcgi_keep_conn on;
  }
}

Well, that should be enough to get you started and at least have all the components in place. Enjoy your secure image sharing now.

End Point: Testing Django Applications

$
0
0

This post summarizes some observations and guidelines originating from introducing the pytest unit testing framework into our CMS (Content Management System) component of the Liquid Galaxy. Our Django-based CMS allows users to define scenes, presentations and assets (StreetView, Earth tours, panos, etc) to be displayed on the Liquid Galaxy.

The purpose of this blog post is to capture my Django and testing study points, summarize useful resource links as well as to itemize some guidelines for implementing tests for newcomers to the project. It also provides a comparison between Python's standard unittest library and the aforementioned pytest. Its focus is on Django database interaction.

Versions of software packages used

This post describes some of our experiences at End Point in designing and working on comprehensive QA/CI facilities for a new system which is closely related to the Liquid Galaxy.

The experiments were done on Ubuntu Linux 14.04:

Testing Django Applications

We probably don't need to talk much about the importance of testing. Writing tests along with the application code has become standard over the years. Surely, developers may fall into a trap of their own prejudice when creating testing conditions which would still result in faulty software but the likelihood of buggy software is certainly higher on a code that has no QA measures. If the code works and is untested, it means it works by accident, they say. As a rule of thumb, unit tests should be very brief testing items seldom interacting with any external services such as the database. Integration tests on the other hand often communicate with external components.

This post will heavily reference an example minimal Django application written for the purpose of experimenting on Django testing. Its README file contains some set up and requirement notes. Also, I am not going to list (m)any code snippets here but rather reference the functional application and its test suite. Hence the points below qualify for more or less assorted little topics or observations. In order to benefit from this post, it will be helpful to follow the README and interact (run tests that is) with the demo django-testing application.

Basic Django unittest versus pytest basic examples

This pair of test modules shows the differences between Django TestCase (unittest) and pytest-django (pytest) frameworks.
  • test_unittest_style.py

    The base Django TestCase class derives along this tree:

        django.test.TestCase
            django.test.TransactionTestCase
                django.test.SimpleTestCase
                    unittest.TestCase
    
    Django adds (among any other aspects) handling of database, the documentation is here, on top of the Python standard unittest library.
  • test_pytest_style.py

    this is a pytest style implementation of the same tests and pytest-django plug-in adds, among other features, Django database handling support.

The advantage of unittest is that it comes with the Python installation - it’s a standard library. That means that one does not have to install anything for writing tests, unlike pytest which is a third-party library and needs to be installed separately. While the absence of additional installation is certainly a plus, it’s dubious whether being a part of Python distribution is a benefit. I seem to recall Guido Van Rossum during Europython 2010 having said the the best thing for pytest is not being part of the Python standard set of libraries for its lively development and evolution would be slowed down by the inclusion.

There are very good talks and articles summarizing advantages of pytest. For me personally, the reporting of error context is supreme. No boiler-plate (no inheritance), using plain Python asserts instead of many assert* methods and flexibility (function, class) are other big plus points

As the comment in the test_unittest_style.py file says, this particular unittest-based test module can be run by both Django manage.py (which boils down to unittest lookup discovery on a lower layer) or by py.test (pytest).

It should also be noted, that pytest's flexibility can bite back if something gets overlooked.

Django database interaction unittest versus pytest (advanced examples)

  • test_unittest_advanced.py

    Since this post concentrates on pytest and since it's the choice for our LG CMS project (naturally :-), this unittest example just shows how the test (fresh) database is determined and how Django migrations are run at each test suite execution. Just as described in the Django documentation: "If your tests rely on database access such as creating or querying models, be sure to create your test classes as subclasses of django.test.TestCase rather than unittest.TestCase." That is true for database interaction but not completely true when using pytest. And "Using unittest.TestCase avoids the cost of running each test in a transaction and flushing the database, but if your tests interact with the database their behavior will vary based on the order that the test runner executes them. This can lead to unit tests that pass when run in isolation but fail when run in a suite."django.test.TestCase, however, ensures that each test runs inside a transaction to provide isolation. The transaction is rolled back once the test case is over.

  • test_pytest_advanced.py

    This file represents the actual core of the test experiments for this blog / demo app and shows various pytest features and approaches typical for this framework as well as Django (pytest-django that is) specifics.

Django pytest notes (advanced example)

Much like the unittest documentation, the pytest-django recommends avoiding database interaction in unittest and concentrate only on the logic which should be designed in such a fashion that it can be tested without database.

  • test database name prefixed "test_" (just like at the unittest example), the base value is taken from the database section of the settings.py. As a matter of fact, it’s possible to run the test suite after previously dropping the main database, the test suite interacts only with "test_" + DATABASE_NAME
  • migration execution before any database interaction is carried out (similarly to unittest example)
  • database interaction marked by a Python decorator @pytest.mark.django_db on the method or class level (or stand-alone function level). It's in fact the first occurrence of this marker which triggers the database set up (its creation and migrations handling). Again analogously to unittest (django.test.TestCase), the test case is wrapped in a database transaction which puts the database back into the state prior to the test case. The database "test_" + DATABASE_NAME itself is dropped once the test suite run is over. The database is not dropped if --db-reuse option is used. The production DATABASE_NAME remains untouched during the test suite run (more about this below)
  • pytest_djangodb_only.py - setup_method - run this module separately and the data created in setup_method end up NOT in the "test_" + DATABASE_NAME database but in the standard one (as configured in the settings.py which would be the production database likely)! Also this data won’t be rolled back. When run separately, this test module will pass (but still the production database would be tainted). It may or may not fail on the second and subsequent run depending whether it creates any unique data. When run within the test suite, the database call from the setup_method will fail despite the presence of the class django_db marker. This has been very important to realize. Recommendation: do not include database interaction in the pytest special methods (such assetup_method or teardown_method, etc), only include database interaction in the test case methods
  • The error message "Failed: Database access not allowed, use the "django_db" mark to enable" was seen on a database error on a method which actually had the marker. This output is not to be 100% trusted
  • data model factories are discussed separately below
  • lastly the test module shows Django Client instance and calling an HTTP resource

pytest setup_method

While the fundamental differences between unittest and pytest were discussed, there is something to be said about Django specific differences of the two. There is different database-related behaviour of unittest setUp method versus the pytest setup_method method. The setUp is included in the transaction and database interactions are rolled back once the test case is over. The setup_method is not included in the transaction. Moreover, interacting with the database from setup_method results in faulty behaviour and difference depending whether the test module is run on its own or as a part of the whole test suite.

The bottom line is: do not include database interaction in setup_method. This setUp, setup_method behaviour was already shown in the basic examples. And more description and demonstration of this behaviour is in the file: pytest_djangodb_only.py. This actually revealed the fact that using django_db database fixture is not supported in special pytest methods and the aforementioned error message is misleading (more references here and here).

When running the whole test suite, this file won't be collected (its name lacks "test_" string). It needs to be renamed to be included in the test suite run.

JSON data fixtures versus factories (pytest advanced example)

The traditional way of interacting with some test data was to perform following steps:
  • have data loaded in the database
  • python manage.py dumpdata
  • the produced JSON file is dragged along the application test code
  • call_command("loaddata", fixture_json_file_name) happens at each test suite run

  • The load is expensive, the JSON dump file is hard to maintain manually if the original modified copy and the current needs diverge (the file has integer primary keys value, etc). Although even the recent Django testing documentation mentions usage of JSON data fixtures, the approach is considered discouraged and the goal is recommended to achieve by means of loading the data in migrations or using model data factories.

    This talk for example compares the both approaches in favour of factory_boy library. A quote from the article: "Factory Boy is a Python port of a popular Ruby project called Factory Girl. It provides a declarative syntax for how new instances should be created. ... Using fixtures for complex data structures in your tests is fraught with peril. They are hard to maintain and they make your tests slow. Creating model instances as they are needed is a cleaner way to write your tests which will make them faster and more maintainable."

    The file test_pytest_advanced.py demostrates interaction with factories defined in the module factories.py, the basic very easy-to-use features.

    Despite its ease of use, the factory_boy is a powerful library capable of modeling Django's ORM many-to-many relationships, among other features.

    Additional useful links

    Conclusion

    You should have a good idea about testing differences via unittest and pytest in the Django environment. The emphasis has been put on pytest (django-pytest) and some recommended approaches. The demo application django-testing brings functional test cases demonstrating the behaviour and features discussed. The articles and talks listed in this post were extremely helpful and instrumental in gaining expertise in the area and introducing rigorous testing approach into the production application.

    Any discrepancy between the behaviour described above and on your own setup may originate from different software versions. In any case, if anything is not clear enough, please let me know in the comments.

    End Point: The Portal project - Jenkins Continuous Integration summary

    $
    0
    0

    This post describes some of our experiences at End Point in designing and working on comprehensive QA/CI facilities for a new system which is closely related to the Liquid Galaxy.

    Due to the design of the system, the full deployment cycle can be rather lengthy and presents us with extra reasons for investing heavily in unit test development. Because of the very active ongoing development on the system we benefit greatly from running the tests in an automated fashion on the Jenkins CI (Continuous Integration) server.

    Our Project's CI Anatomy

    Our Jenkins CI service defines 10+ job types (a.k.a. Jenkins projects) that cover our system. These job types differ as far as source code branches are concerned, as well as by combinations of the types of target environments the project builds are executed on.

    The skeleton of a Jenkins project is what one finds under the Configure section on the Jenkins service webpage. The source code repository and branch are defined here. Each of our Jenkins projects also fetches a few more source code repositories during the build pre-execution phase. The environment variables are defined in a flat text file:

    Another configuration file is in the JSON format and defines variables for the test suite itself. Furthermore, we have a preparation phase bash script and then a second bash script which eventually executes the test suite. Factoring out all degrees of freedom into two pairs of externally managed (by Chef) concise files allows for pure and simple Jenkins job build definition:

    It’s well possible to have all variables and content of the bash scripts laid out directly in the corresponding text fields in the Jenkins configuration. We used to have that. It’s actually a terrible practice and the above desire for purity comes from a tedious and clumsy experience that changing a variable (e.g. an URL or such) in 10+ job types involves an unbearable amount of mouse clicking through the Jenkins service webpage. Performing some level of debugging of the CI environment (like when setting up ROS stack which the project depends on) one is in for repetitive strain injury.

    In essence, keeping knowledge about job types on the Jenkins server itself at a minimum and having it managed externally serves us well and is efficient. Another step forward would be managing everything (the entire job type definition) by Chef. We have yet to experiment with the already existing Chef community cookbooks for Jenkins.

    The tests themselves are implemented in Python using pytest unit testing envelope. The test cases depend on Selenium - the web automation framework. Python drives the browser through Selenium according to testing scenarios, sometimes rather complex. The Selenium framework provides handles by which the browser is controlled - this includes user data input, clicking buttons, etc.

    We use Selenium in two modes:
    local mode: selenium drives a browser running on the Jenkins CI machine itself, locally. The browser runs in the Xvfb environment. In this case everything runs on the Jenkins master machine.
    remote mode: the remote driver connects to a browser running on a remote machine (node A, B) and drives the browser there, as described in the diagram below. The test cases are run on the Jenkins slave machine located on a private network. The only difference between browser A and B is that they load their different respective Chrome extensions.

    The usual unit testing assertions are made on the state or values of HTML elements in the web page.

    Custom dashboard

    Our Jenkins server runs builds of 10+ various job types. The builds of each type are executed periodically and the builds are also triggered by git pushes as well as by git pull requests. As a result, we get a significant number of builds on daily basis.

    While Jenkins CI is extensible with very many plugins available out there, enabling and configuring a plugin gets cumbersome as the number of job types to configure rises. This is just to explain my personal aversion to experimenting with plugins on Jenkins for our project.

    The Jenkins service webpage itself does not offer creating a simple aggregated view across a number of job types to allow for a simple, concise, single page view. Natively, there is just the single job type trends $JOB_URL/buildTimeTrend page (see below).

    A view which immediately tells whether there is an infrastructure problem (such as loss of connectivity) or conveys straight away that everything passes on Jenkins,  seems to be missing. Such a view or feature is even more important in an environment suffering from occasional transient issues. Basically, we wanted a combination of JENKINS/Dashboard+View and JENKINS/Project+Statistics+Plugin, yet a lot simpler (see below).
    So yes, we coded up our own wheel, circular just according to our liking and thus developed the jenkins-watcher application.

    jenkins-watcher

    The application is freely available from this repository, deploys on the Google App Engine platform and so utilizes certain platform features like Datastore, Cron jobs, TaskQueue and Access Control. A single configuration file contains mainly Jenkins CI server access credentials and job type names we are interested in. The above repository merely provides a template of this (secret) config file. AngularJS is used on the frontend and a smashing Jenkins API Python library is used to communicate from Python to the Jenkins CI server through its REST API. See below the result view it provides, the screenshot is cropped to show only 5 job types and their builds within the last 24 hours:

    Colour coding in green (passed), red (failed) and grey (aborted) shows a build status and is in fact just standard Jenkins colour coding. Each table row corresponds to 1 build of the build ID, build timestamp (start of the build), build duration, number of test cases which passed (P), failed (F), were skipped (S), or suffered from errors (E). The last item in the row is a direct link to the build console output, very handy for immediate inspection. In my experience, this is enough for a Jenkins babysitter’s swift daily checks. This is nothing fancy: no cool stats, graphs or plots. It is just a brief, useful overview.

    The application also performs periodic checks and aborts builds which take too long (yes, a Jenkins plugin with this functionality exists as well).

    For example, at a glance it’s obvious that the following failed builds suffer from some kind of transient infrastructure problems: no tests were run, nothing failed, the builds were marked as failure since some command in either their prep or build scripts failed:

    Or let’s take a look at another situation proving how simple visualisation can sometimes be very useful and immediately hint-providing. We observed a test case, interestingly only on just one particular job type, which sometimes ended up with a “Connection refused” error between the Selenium driver and the web browser (in the remote mode):

    Only after seeing the failures visualized, the pattern struck us. We immediately got an idea that something is rotten in the state of Denmark shortly after midnight: from that point on, the previously mysterious issue boiled down to an erroneous cronjob command. The killall command was killing everything and not just what it was supposed to (bug filed here):

    killall --older-than 2h -r chromedriver

    Once we fixed the cronjob with a more complex but functional solution, without the killall command this time, so that the builds had not the chromedriver blanket pulled from under them while running, the mysterious error disappeared.

    Summary, conclusion

    Jenkins CI proved in general very useful for our Portal project. Keeping its configuration minimal and handling it externally worked most efficient. The custom jenkins-watcher application provides useful, aggregated, dashboard-like view. It is very easily configurable and not in any way dependent on the base project - take it for free, configure a bit and push as your own Google App Engine project. The visualisation can sometimes be a useful debugging tool.

    Continuum Analytics News: Calling C Libraries from Numba Using CFFI

    $
    0
    0
    PostedTuesday, February 16, 2016

    This post was originally published on Jupyter Notebooks nbviewer by Joshua Adelman and is reposted here with his permission.

    TL;DR - The python CFFI library provides an easy and efficient way to call C code from within a function jitted (just-in-time compiled) by Numba. This makes it simple to produce fast code with functionality that is not yet available directly in Numba. As a simple demonstration, I wrap several statistical functions from the Rmath library.

    Background and Motivation

    A large fraction of the code that I write has a performance requirement attached to it. Either I'm churning through a large amount of data in an analysis pipeline, or it is part of a real-time system and needs to complete a specific calculation in a constrained amount of time. Sometimes I can rely on numpy, pandas or existing Python packages that wrap C or Fortran code under the covers to get sufficient performance. Often times though, I'm dealing with algorithms that are difficult to implement efficiently using these tools.

    Since I started coding primarily in Python ~6 years ago, in those instances I'd typically reach for Cython to either wrap something I or others wrote in C/C++/Fortran or to provide sufficient type information to my code so that Cython could generate a performant C-extension that I could call from Python. Although Cython has been a pretty rock solid solution for me, the amount of boilerplate often required and some of the strange semantic of mixing python and low-level C code often feels less than ideal. I also collaborate with people who know Python, but don't have backgrounds in C and/or haven't had enough experience with Cython to understand how it all fits together.

    More and more frequently, I find myself using Numba in instances that I had traditionally used Cython. In short, through a simple decorator mechanism, Numba converts a subset of Python code into efficient machine code using LLVM. It uses type inference so you don't have to specify the type of every variable in a function like you do in Cython to generate fast code. This subset primarily deals with numerical code operating on scalars or Numpy arrays, but that covers 95% of the cases where I need efficient code so it does not feel that limiting. That said, the most common mistake I see people making with Numba is trying to use it as a general Python compiler and then being confused/disappointed when it doesn't speed up their code. The library has matured incredibly over the last 6-12 months to the point where at work we have it deployed in a couple of critical pieces of production code. When I first seriously prototyped it maybe a year and a half ago, it was super buggy and missing a number of key features (e.g. caching of jitted functions, memory management of numpy arrays, etc). But now it feels stable and I rarely run into problems, although I've written a very extensive unit test suite for every bit of code that it touches.

    One of the limitations that I do encounter semi-regularly though is when I need some specialized function that is available in Numpy or Scipy, but that function has not been re-implemented in the Numba core library so it can be called in the so-called "nopython" mode. Basically this means that if you want to call one of these functions, you have to go through Numba's object mode, which typically cannot generate nearly as efficient code.

    While there is a proposalunder development that should allow external libraries to define an interface to make usable in nopython mode, it is not complete and will them require adoption within the larger Scipy/PyData communities. I'm looking forward to that day, but currently you have to choose a different option. The first is to re-implement a function yourself using Numba. This is often possible for functionality that is small and limited in scope, but for anything non-trivial this approach can rapidly become untenable.

    In the remainder of this notebook, I'm going to describe a second technique that involves using CFFI to call external C code directly from within Numba jitted code. This turns out to be a really great solution if the functionality you want has already been written either in C or a language with a C interface. It is mentioned in the Numba docs, but there aren't any examples that I have seen, and looking at the tests only helped a little.

    I had not used CFFI before integrating it with Numba for a recent project. I had largely overlooked it for two reasons: (1) Cython covered the basic usecase of exposing external C code to python and I was already very comfortable with Cython, and (2) I had the (incorrect) impression that CFFI was mostly useful in the PyPy ecosystem. Since PyPy is a non-starter for all of my projects, I largely just ignored its existence. I'm thankfully correcting that mistake now.

    Rmath. It's not just for R

    Every once in a while I fire up R, usually through rpy2, to do something that I can't do using Statsmodel or Scikit-Learn. But for the most part I live squarely in the Python world, and my experience with R is rudimentary. So it wasn't totally surprising that I only recently discovered that the math library that underpins R, Rmath, can be built in a standalone mode without invoking R at all. In fact, the Julia programming language uses Rmath for its probability distributions library and maintains a fork of the package called Rmath-julia.

    Discovering Rmath over the summer, led to the following tweet (apologies for the Jupyter Notebook input cell) and a horrific amalgamation of code that worked, but was pretty difficult to maintain and extend:

    display(HTML(''''''))

    As I began to introduce more and more Numba into various code bases at work, I recently decided to revisit this particular bit and see if I could re-implement the whole thing using Numba + CFFI + Rmath. This would cut out the C code that I wrote, the Cython wrapper that involved a bunch of boilerplate strewn across multiple .pyx and .pxd files, and hopefully would make the code easier to extend in the future by people who didn't know C or Cython, but could write some Python and apply the appropriate Numba jit decorator.

    So to begin with, I vendorized the whole Rmath-julia library into our project under externals/Rmath-julia. I'll do the same here in this example. Now the fun begins...

    Building the Rmath library using CFFI

    Since we are going to use what cffi calls the "API-level, out-of-line", we need to define a build script (build_rmath.py) that we will use to compile the Rmath source and produce an importable extension module. The notebook "cell magic", %%file will write the contents of the below cell to a file.

    %%file build_rmath.py
    
    import glob
    import os
    import platform
    
    from cffi import FFI
    
    
    include_dirs = [os.path.join('externals', 'Rmath-julia', 'src'),
                    os.path.join('externals', 'Rmath-julia', 'include')]
    
    rmath_src = glob.glob(os.path.join('externals', 'Rmath-julia', 'src', '*.c'))
    
    # Take out dSFMT dependant files; Just use the basic rng
    rmath_src = [f for f in rmath_src if ('librandom.c' not in f) and ('randmtzig.c' not in f)]
    
    extra_compile_args = ['-DMATHLIB_STANDALONE']
    if platform.system() == 'Windows':
        extra_compile_args.append('-std=c99')
    
    ffi = FFI()
    ffi.set_source('_rmath_ffi', '#include ',
            include_dirs=include_dirs,
            sources=rmath_src,
            libraries=[],
            extra_compile_args=extra_compile_args)
    
    # This is an incomplete list of the available functions in Rmath
    # but these are sufficient for our example purposes and gives a sense of
    # the types of functions we can get
    ffi.cdef('''\
    // Normal Distribution
    double dnorm(double, double, double, int);
    double pnorm(double, double, double, int, int);
    
    // Uniform Distribution
    double dunif(double, double, double, int);
    double punif(double, double, double, int, int);
    
    // Gamma Distribution
    double dgamma(double, double, double, int);
    double pgamma(double, double, double, int, int);
    ''')
    
    if __name__ == '__main__':
        # Normally set verbose to `True`, but silence output
        # for reduced notebook noise
        ffi.compile(verbose=False)
    Overwriting build_rmath.py  

    Then we simply run the script as below assuming we have a properly configured C compiler on our system. For larger projects integration with setuptools is supported. The exclamation point tells the notebook to execute the following command in a system shell.

    !python build_rmath.py  

    We should now have an extension module named _rmath_ffi that gives us access to the functions whose prototypes we enumerated in the ffi.cdef(...).

    An example of replicating scipy.stats with Numba

    Now that we have built our module wrapping Rmath using cffi, we can write Numba jit-able versions of scipy stats functions that we can call without additional overhead

    import numpy as np
    import numba as nb
    
    import scipy.stats
    
    # Import our Rmath module
    import _rmath_ffi

    Now we can define a number of shorter aliases to the Rmath functions for use in the python namespace

    dnorm = _rmath_ffi.lib.dnorm
    pnorm = _rmath_ffi.lib.pnorm
    
    dunif = _rmath_ffi.lib.dunif
    punif = _rmath_ffi.lib.punif
    
    dgamma = _rmath_ffi.lib.dgamma
    pgamma = _rmath_ffi.lib.pgamma

    In order for us to use these methods from within a function that we've jit-ed with Numba, we need to import cffi_support and register the module:

    from numba import cffi_support
    
    cffi_support.register_module(_rmath_ffi)

    I'll start off by writing a function that is equivalent to scipy.stats.norm.cdf using two different styles. In the first, pnorm_nb, I'll make the assumption that I'm going to be working on a 1d array, and in the second, I'll use numba.vectorize to lift that constraint and create a universal function that can operate on arbitrary dimensional numpy arrays:

    @nb.jit(nopython=True)
    def pnorm_nb(x):
        y = np.empty_like(x)
        for k in xrange(x.shape[0]):
            y[k] = pnorm(x[k], 0.0, 1.0, 1, 0)
            
        return y
    
    @nb.vectorize(nopython=True)
    def pnorm_nb_vec(x):
        return pnorm(x, 0.0, 1.0, 1, 0)
    x = np.random.normal(size=(100,))
    
    y1 = scipy.stats.norm.cdf(x)
    y2 = pnorm_nb(x)
    y3 = pnorm_nb_vec(x)
    
    # Check that they all give the same results
    print np.allclose(y1, y2)
    print np.allclose(y1, y3)
    True  
    True

    And now let's do the same calculation for 2D data, demonstrating that the vectorized form of the Numba function automatically create the appropriate universal function for the given dimensionality of the inputs:

    x = np.random.normal(size=(100,100))
    
    y1 = scipy.stats.norm.cdf(x)
    y2 = pnorm_nb_vec(x)
    
    # Check that they all give the same results
    print np.allclose(y1, y2)
    True  

    Timing the scipy and numba versions:

    %timeit scipy.stats.norm.cdf(x)
    %timeit pnorm_nb_vec(x)
    1000 loops, best of 3: 618 µs per loop  1000 loops, best of 3: 336 µs per loop  

    We can see that our Numba version is almost 2x faster than the scipy version, with the added bonus that it can be called from within other Numba-ized methods without going through the python object layer, which can be quite slow.

    Just for kicks, lets also try to take advantage of multiple cores using the target argument:

    @nb.vectorize([nb.float64(nb.float64),], nopython=True, target='parallel')
    def pnorm_nb_vec_parallel(x):
        return pnorm(x, 0.0, 1.0, 1, 0)
    y3 = pnorm_nb_vec_parallel(x)
    print np.allclose(y1, y3)
    print
    
    %timeit pnorm_nb_vec_parallel(x) 
    True
        The slowest run took 17.27 times longer than the fastest. This could mean that an intermediate result is being cached   1000 loops, best of 3: 131 µs per loop  

    So on my laptop with 4 physical cores, we get a nice additional speed-up over the serial numba version and the scipy.stats function.

    Finally, I'm going to programmatically wrap the all of the Rmath functions I exposed and compare them to the equivalent scipy functions.

    from collections import OrderedDict
    
    func_map = OrderedDict([
        ('norm_pdf', (scipy.stats.norm.pdf, dnorm)),
        ('norm_cdf', (scipy.stats.norm.cdf, pnorm)),
        ('unif_pdf', (scipy.stats.uniform.pdf, dunif)),
        ('unif_cdf', (scipy.stats.uniform.cdf, punif)),
        ('gamma_pdf', (scipy.stats.gamma.pdf, dgamma)),
        ('gamma_cdf', (scipy.stats.gamma.cdf, pgamma)),
        ])
    
    def generate_function(name, rmath_func):
        if 'norm' in name or 'unif' in name:
            def cdf_func(x):
                return rmath_func(x, 0.0, 1.0, 1, 0)
    
            def pdf_func(x):
                return rmath_func(x, 0.0, 1.0, 0)
        
        elif 'gamma' in name:
            def cdf_func(x, shape):
                return rmath_func(x, shape, 1.0, 1, 0)
    
            def pdf_func(x, shape):
                return rmath_func(x, shape, 1.0, 0)
            
        
        if 'cdf' in name:
            return cdf_func
        elif 'pdf' in name:
            return pdf_func
    
    x = np.random.normal(size=(100,100))
        
    for name, (scipy_func, rmath_func) in func_map.iteritems():
        nb_func = nb.vectorize(nopython=True)(generate_function(name, rmath_func))
        
        
        print name
        if 'norm' in name or 'unif' in name:
            y1 = scipy_func(x)
            y2 = nb_func(x)
            print 'allclose: ', np.allclose(y1, y2)
            
            print 'scipy timing:'
            %timeit scipy_func(x)
            print 'numba timing:'
            %timeit nb_func(x)
        elif 'gamma' in name:
            shape = 1.0
            y1 = scipy_func(x, shape)
            y2 = nb_func(x, shape)
            print 'allclose: ', np.allclose(y1, y2)
            
            print 'scipy timing:'
            %timeit scipy_func(x, shape)
            print 'numba timing:'
            %timeit nb_func(x, shape)
        print 
    norm_pdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 545 µs per loop
    numba timing:
    1000 loops, best of 3: 212 µs per loop
    
    norm_cdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 634 µs per loop
    numba timing:
    1000 loops, best of 3: 328 µs per loop
    
    unif_pdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 436 µs per loop
    numba timing:
    10000 loops, best of 3: 68.4 µs per loop
    
    unif_cdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 495 µs per loop
    numba timing:
    10000 loops, best of 3: 128 µs per loop
    
    gamma_pdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 616 µs per loop
    numba timing:
    1000 loops, best of 3: 277 µs per loop
    
    gamma_cdf
    allclose:  True
    scipy timing:
    1000 loops, best of 3: 1.13 ms per loop
    numba timing:
    1000 loops, best of 3: 1.11 ms per loop 

    Conclusion

    To wrap up, CFFI + Numba provides a powerful and surprisingly simple way to generate fast python code and extend the currently limited repertoire of functionality that is baked into Numba. Pairing this approach with Rmath specifically has been particularly useful in my own work.

    Appendix

    For completeness, I'll use Sebastian Raschka's watermark package to specify the libraries and hardware used to run these examples:

    %install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py  
    %load_ext watermark
    %watermark -vm -p numpy,scipy,numba,cffi  
    CPython 2.7.11
    IPython 4.0.3
    
    numpy 1.10.2
    scipy 0.16.1
    numba 0.23.1
    cffi 1.5.0
    
    compiler   : GCC 4.2.1 (Apple Inc. build 5577)
    system     : Darwin
    release    : 13.4.0
    machine    : x86_64
    processor  : i386
    CPU cores  : 8
    interpreter: 64bit
     
     
     
     

    Talk Python to Me: #46 Python in Movies and Entertainment

    $
    0
    0
    What did you experience the last time you watched a movie in a theater? Were you captivated by fast-paced action and special effects? Deeply moved by the characters that came to life during those two hours when the outside world just melted away? Yeah, movies are still magical. <br/> <br/> What was likely not top of mind was all the work that went into that movie, from the editing of the audio and video, the special effects, renderings, and coordination of maybe 100's of creative professionals. It turns out that Python plays a key role in coordinating all of this production work and that's what this episode is all about. <br/> <br/> Join me as I talk with Rob Blau from Autodesk about Python in the movies and entertainment business. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>Autodesk</b>: <a href='http://www.autodesk.com/' target='_blank'>autodesk.com</a> <br/> <b>Maya (3D animation)</b>: <a href='http://www.autodesk.com/products/maya/overview' target='_blank'>autodesk.com/products/maya</a> <br/> <b>Rob Blau</b>: <a href='https://www.linkedin.com/in/robblau' target='_blank'>linkedin.com/in/robblau</a> <br/> <b>Michael's course: Python Jumpstart by Building 10 Apps</b>: <br/> <a href='https://talkpython.fm/course' target='_blank'>talkpython.fm/course</a> <br/> <b>Pyjion by Microsoft</b>: <a href='https://github.com/Microsoft/Pyjion' target='_blank'>github.com/Microsoft/Pyjion</a> <br/> <b>IronPython</b>: <a href='http://ironpython.net/' target='_blank'>ironpython.net</a></div>

    S. Lott: SQL Hegemony and Document Databases

    $
    0
    0
    A surpassingly strange question is this: "How do I get the data out of MongoDB into a spreadsheet?"

    The variation is "How can we load the MongoDB data into a relational database?"

    I'm always perplexed by this question. It has a subtext that I find baffling. The subtext is this "all databases are relational, right?"

    In order to ask the question, one has to be laboring under the assumption that the only difference between MongoDB and a relational database is the clever sticker on your laptop. Mongo folks have a little green Mango leaf. Postgres has a blue/gray elephant.

    This assumption is remarkably hard to overcome.

    THEM: "How can we move this mongo data into a spreadsheet?"
    ME: "What?"
    THEM: "You know. Get a bulk CSV extract."
    ME: "Of complex, nested documents?"
    THEM: "Nested documents?"
    ME: "Mongo database documents include arrays and -- well -- subdocuments. They're not in first normal form. They don't fit the spreadsheet data model."
    THEM: "Whatever. Every database has a bulk unload into CSV. How do you do that in Mongo?"
    ME: "You can't represent a mongo document in rows and columns."
    THEM: (Thumping desk for emphasis.) "Relational Theory is explicit. ALL DATA CAN BE REDUCED TO ROWS AND COLUMNS!"
    ME: "Right. Through a process of normalization. The Mongo data you're looking at isn't normalized. You'd have to normalize it into a relational table model. Then you could write a customized extract focused on that relational model."
    THEM: "That's absurd."

    At this point, all we can do is give them the minimal pymongo MongoClient code block. Hands-on queries seem to be the only way to make progress.

    from pymongo import MongoClient
    from pprint import pprint
    with MongoClient("mongodb://somehost:27017") as mongo:
    collection = mongo.database.collection
    for document in collection.find():
    pprint(document)

    Explanations seem to wind up in a weird circular pattern where they keep repeating their relational assumptions. Not much seems to work: diagrams, hand-waving, links to tutorials are all implicitly rejected because they don't confirm SQL bias.

    A few days later they call asking how they are supposed to work with a document that has complex nested fields inside it.

    This could be the beginning of wisdom. Or it could be the beginning of a lengthy reiteration of SQL Hegemony talking points and desk thumping.

    THEM: "The document has an array of values."
    ME: "Correct."
    THEM: "What's that mean?"
    ME: "It means there are multiple occurrences of the child object within each parent object."
    THEM: "I can see that. What does it mean?"
    ME: (Rising inflection.) "The parent is associated with multiple instances of the child."
    THEM: "Don't patronize me! Stop using mongo mumbo-jumbo. Just a simple explanation is all I want."
    ME: "One Parent. Many Children."
    THEM: "That's stupid. One-to-many absolutely requires a foreign key. The children don't even have keys. Mongo must have hidden keys somewhere. How can I see the keys on the children in this so-called 'array' structure? How can expose the underlying implementation?"

    The best I can do is show them an approach to normalizing some of the data in their collection.

    from pymongo import MongoClient
    from pprint import pprint
    with MongoClient("mongodb://your_host:27017") as mongo:
    collection = mongo.your_database.your_collection
    for document in collection.find():
    for child in parent['child_array']:
    print( document['parent_field'], child['child_field'] )

    This leads to endless confusion when some documents lack a particular field. The Python document.get('field') is an elegant way to handle optional fields. I like to warn them that they should not rely on this. Sometimes document['field'] is appropriate because the field really is mandatory. If it's missing, there are serious problems. Of course, the simple get() method doesn't work for optional nested documents. For this, we need document.get('field', {}). And for optional arrays, we can use document.get('field', []).

    Interestingly we sometimes have confusion over {} for document and [] for array. I chalk that up to folks who are too used to very wordy SQL and Java. I save the questions for my next book on Python.

    At some point, the "optional" items may be more significant than this. Perhaps an if statement is required to handle business rules that are reflected as different document structures in a single collection.

    This leads to yet more desk-thumping. It's accompanied with the laughable claim that a "real" database doesn't rely on if statements to distinguish variant subentities that are persisted in a single table. The presence of SQL ifnull() functions, case expressions, and application code with if statements apparently doesn't exist. Or -- when it is pointed out -- isn't the same thing as writing an if statement to handle variant document subentities in a Mongo database.

    It appears to take about two weeks to successfully challenge entrenched relational assumptions. Even then, we have to go over some of the basics of optional fields and arrays more than once.

    PyCharm: Faster debugger in PyCharm 5.1

    $
    0
    0

    The first EAP for PyCharm 5.1 was released last week, with lots of enticing features. Here’s one that jumps out for long-time users: “Debugger performance improvements”. Let’s take a look at this in a little more depth, then provide some of the back story.

    Imagine we’re debugging a semi-large code base with PyCharm 5.0.4. If we set a breakpoint and time how long it takes to get there, it measures out as 12 seconds:

    debugger_speedup 5.1 pre

    In PyCharm 5.1, the debugger performance has improved, especially for large code bases. Let’s give the same scenario a try:

    debugger_speedup

    Almost 3.5 seconds of improvement on OS X.

    Wouldn’t it be great to get a couple more seconds? And now comes the real story: the new debugger has some performance improvements implemented in Cython. On OS X and Linux we need to install them manually, as the debugger console message helpfully tells us:

    debugger_speedup 5.1 screenshot

    Windows users get these Python speedups pre-bundled. Let’s install them on OS X by copying the text above after “Debugger speedups using Cython not found. Run…”:

    debugger_speedup 5.1 install

    That is, we’re running:

    env27/bin/python /Applications/PyCharm\ 5.1\ EAP.app/Contents/helpers/pydev/setup_cython.py build_ext --inplace

    …using the Python in this project’s virtual environment, just to make sure we have the correct Python version. Once more, let’s measure the time to get to the breakpoint:

    debugger_speedup 5.1 post

    5.7 seconds! That’s a meaningful difference. As stated by the developers, the debugger is 40% faster in the general case and over 130% faster when you have the Cython modules compiled and installed. Again, if you are on Windows, you don’t need to do anything — you will get these improvements automatically for Python 2.7, Python 3.4, and Python 3.5 interpreters.

    On OS X and Linux, you need to do a one-time compilation using any Python on your system matching the Python version you need, as shown in the screenshot above. For example, if you use Python 2.7 and Python 3.4, you need to run the Cython speedups with an interpreter matching those two versions. Each time you run it, a compiled speedup will be saved in your PyCharm application, for that Python version. For example, on my system, this was created:

    /Applications/PyCharm 5.1 EAP.app/Contents/helpers/pydev/build/lib.macosx-10.11-x86_64-2.7/_pydevd_bundle/pydevd_cython.so

    Note the “-2.7” in the directory name. Also, note that you don’t have to do this with your virtual environment’s Python, but it certainly makes sense to do so, as you’ll be sure to match the version. You don’t have to do this once for every Python 2.7 virtual environment, as these speedups aren’t stored in the virtual environment. They are stored inside PyCharm’s pydevd helper.

    Now, on to the backstory. As explained in the blog post announcing the 5.1 EAP, JetBrains joined efforts with PyDev, helping sponsor the work on pydevd which is shared by the two projects. Both projects require a sophisticated debugger backend and previously merged their work. This performance improvement is another step forward in the collaboration.
    If you debug a large code base, give this EAP a try with the Cython speedups and post a comment letting us know your results. We expect performance improvements to be higher for larger code bases.

    PyCharm Team
    The Drive to Develop

    Holger Peters: An Interesting Fact About The Python Garbage Collector

    $
    0
    0

    While Python prides itsself of being a simple, straightforward programming language and being explicit is pointed out as a core value, of course, one can always discover interpreter specifics and implementation detail, that one did not expect to find when working at the surface. These days I learned more about a peculiar property of the Python garbage collector, that I would like to share.

    Let's start by introducing the problem quickly. Python manages its objects primarily by reference counting. I.e. each object stores how many times it is referenced from other places, and this reference count is updated over the runtime of the program. If the reference count drops to zero, the object cannot be reached by the Python code anymore, and the memory can be freed/reused by the interpreter.

    An optional method __del__ is called by the Python interpreter when the object is about to be destroyed. This allows us to do some cleanup, for example closing database connections, etc. Typically __del__ rarely has to be defined. For our example we will use it to illustrate when the disposal of an object happens:

    >>>classA(object):...def__del__(self):...print("no reference to {}".format(self))...>>>a=A()>>>b=a>>>c=a

    The situation in memory resembles this schematic:

    ┌────┐
    │ a  │────────────┐
    └────┘            ▼
    ┌────┐    ┌───────────────┐
    │ b  │───▶│A() refcount=3 │
    └────┘    └───────────────┘
    ┌────┐            ▲
    │ c  │────────────┘
    └────┘
    

    Now we let the variables a, b, and c point to None instead of the instance A():

    >>>a=None>>>b=None>>>c=NoneNoreferenceto<__main__.Aobjectat0x102ace9d0>

    Changing the situation to:

    ┌────┐    ┌────┐
    │ a  │─┬─▶│None│
    └────┘ │  └────┘
    ┌────┐ │  ┌───────────────┐
    │ b  │─┤  │A() refcount=0 │
    └────┘ │  └───────────────┘
    ┌────┐ │
    │ c  │─┘
    └────┘
    

    After we have overwritten the last reference (c) to our instance of A, the object is destroyed, which triggers a call to __del__ just before really destroying the object.

    Cyclic References

    However, there are instances where the reference count cannot simply go down to zero, it is the case of cylic references:

    ┌────┐
            │ a  │
            └────┘
               │
               ▼
       ┌───────────────┐
    ┌──│A() refcount=2 │◀─┐
    │  └───────────────┘  │
    │  ┌───────────────┐  │
    └─▶│B() refcount=1 │──┘
       └───────────────┘
    

    Setting a to None, we will still have refcounts of >= 1. For these cases, Python employs a garbage collector, some code that traverses memory and applies more complicated heuristics to discover unused objects. We can use the gc module to manually trigger a garbage collection run.

    >>>a=A()>>>b=A()>>>a.other=b>>>b.other=a>>>a=None>>>b=None>>>importgc>>>gc.collect()11

    However, since A implements __del__, Python refuses to clean them, arguing that it cannot not tell, which __del__ method to call first. Instead of doing the wrong thing (invoking them in the wrong sequence), Python decides to rather do nothing -- avoiding undefined behaviour, but introducing a potential memory leak.

    In fact, Python will not clean any objects in the cycle, which can possibly render a huger group of objects to pollute memory (see https://docs.python.org/2/library/gc.html#gc.garbage ). We can inspect the list of objects, which could not be garbage collected:

    >>>gc.garbage[<__main__.Aobjectat0x102ace9d0>,<__main__.Aobjectat0x102aceb10>]

    Finally, if you remove the __del__ method from the class, you would not find these objects in gc.garbage, as Python would just dispose of them.

    Python 3

    As it turns out, from Python 3.4 on, the issue I wrote about does not exist anymore. __del__ s do not impede garbage collection any more, so gc.garbage will only be filled for other reasons. For details, you can read PEP 442 and the Python docs.

    Considering the adoption of Python 3.4, most Python code bases have to be careful about when to use __del__.

    Anarcat: My free software activities, february 2016

    $
    0
    0

    Debian Long Term Support (LTS)

    This is my third month working on Debian LTS, started by Raphael Hertzog at Freexian. This month was my first month working on the frontdesk duty and did a good bunch of triage. I also performed one upload and reviewed a few security issues.

    Frontdesk duties

    I spent some time trying to get familiar with the frontdesk duty. I still need to document a bit of what I learned, which did involve asking around for parts of the process. The following issues were triaged:

    • roundcube in squeeze was happily not vulnerable to CVE-2015-8794 and CVE-2015-8793, as the code affected was not present. roundcube is also not shipped with jessie but the backport is vulnerable
    • the php-openid vulnerability was actually just a code sample, a bug report comment clarified all of CVE-2016-2049
    • ffmpeg issues were closed, as it is not supported in squeeze
    • libxml2 was marked as needing work (CVE-2016-2073)
    • asterisk was triaged for all distros before i found out it is also unsupported in squeeze (CVEs coming up, AST-2016-001, AST-2016-001, AST-2016-001)
    • libebml and libmatroska were marked as unsupported, although an upload of debian-security-support will be necessary to complete that work (bug [#814557][] filed)

      [#814557]: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814557

      Uploads and reviews

    I only ended up doing one upload, of the chrony package (CVE-2016-1567), thanks to the maintainer which provided the patch.

    I tried my best trying to sort through the issues with tiff (CVE-2015-8668 and CVE-2015-7554), which didn't have obvious fixes available. OpenSUSE seem to have patches, but it is really hard to find them through their issue trackers, which were timing out on me. Hopefully someone else can pick that one up.

    I also tried and failed to reproduce the cpio issue (CVE-2016-2037), which, at the time, didn't have a patch for a fix. This ended up being solved and Santiago took up the upload.

    I finally spent some time trying to untangle the mess that is libraw, or more precisely, all the packages that embeddcraw code instead of linking against libraw. Now I really feel even more strongly for the Debian policy section 4.13 which states that Debian packages should not ship with copies of other packages code. It made it really hard to figure out which packages were vulnerable, especially because it was hard to figure out which versions of libraw/dcraw were actually vulnerable to the bug, but also just plain figure out which packages were copying code from libraw. I wish I had found out about secure-testing/data/embedded-code-copies earlier... Still, it was interesting to get familiar with codesearch.debian.net to try to find copies of the vulnerable code, which was not working so well. Kudos to darktable 2.0 for getting rid of their embedded copy of libraw, by the way - it made it completely not vulnerable to the issue, the versions in stretch and sid not having the code at all and older versions having non-vulnerable copies of the code.

    Issues with VMs again

    I still had problems running a squeeze VM - not wanting to use virtualbox because of the overhead, I got lost for a bit trying to use libvirt and KVM. A bunch of issues crept up: using virt-manager would just fail on startup with an error saying interface mtu value is improper, which is a very unhelpful error message (what is a proper value??) - and, for the record, the MTU on eth0 and wlan0 is the fairly standard 1500, while lo is at 65536 bytes, nothing unusual there as far as I know.

    Then the next problem was actually running a VM - I still somewhat expected to be able to boot off a chroot, something I should definitely forget about it seems like (boot loader missing? not sure). I ended up calling virt-install with the live ISO image I was previously using:

    virt-install --virt-type kvm --name squeeze-amd64 --memory 512 --cdrom ~/iso/Debian/cdimage.debian.org_mirror_cdimage_archive_6.0.10_live_amd64_iso_hybrid_debian_live_6.0.10_amd64_gnome_desktop.iso --disk size=4 --os-variant debiansqueeze
    

    At least now I have an installed squeeze VM, something I didn't get to do in Virtualbox (mostly because I didn't want to wait through the install, because it was so slow).

    Finally, I still have trouble getting a commandline console on the VM: somehow, running virtsh console squeeze-amd64 doesn't give me a login terminal, and worse, it actually freezes the terminal that I can actually get on virt-viewer squeeze-amd64, which definitely sounds like a bug.

    I documented a bit more of that setup in the Debian wiki KVM page so hopefully this will be useful for others.

    Other free software work

    I continued my work on improving timetracking with ledger in my ledger-timetracking git repository, which now got a place on the new plaintextaccounting.org website, which acts as a portal for ledger-like software projects and documentation.

    Darktable 2.0

    I had the pleasure of trying the new Darktable 2.0 release, which only recently entered Debian. I built a backport for jessie, which works beautifully: much faster thumbnail rendering, no dropping of history when switching views... The new features are great, but I also appreciate how they are being very conservative in their approach.

    Darktable is great software: I may have trouble approaching the results other are having with lightroom and snapseed, but those are proprietary software that I can't use anyways. I also suspect that I just don't have enough of a clue of what I'm doing to get the results I need in Darktable. Maybe with hand-holding, one day, I will surpass the results I get with the JPEGs from my Canon camera. Until then, I turned off RAW exports in my camera to try and control the explosion of disk use I saw since I got that camera:

    41M     2004
    363M    2005
    937M    2006
    2,2G    2007
    894M    2008
    800M    2009
    1,8G    2010
    1,4G    2011
    9,8G    2012
    31G     2013
    26G     2014
    9,8G    2015
    

    The drop in 2015 is mostly due to me taking less pictures in the last year, for some reason...

    Markdown mode hacks

    I ended up writing some elisp for the markdown mode. It seems I am always writing links like [text](link) which seems more natural at first, but then the formatting looks messier, as paragraph wrapping is all off because of the long URLs. So I always ended up converting those links, which was a painful series of keystrokes.

    So I made a macro, and while I'm a it, why not rewrite it as a lisp function. Twice.

    Then I was told by the markdown-mode.el developers that they had already fixed that (in the 2.1 version, not in Debian jessie) and that the C-c C-a r key binding actually recognized existing links and conveniently converted them.

    I documented my adventures in bug #94, but it seems I wrote this code for nothing else than re-learning Emacs lisp, which was actually quite fun.

    More emacs hacking

    Another thing I always wasted time doing by and is "rename file and buffer". Often, you visit a file but it's named wrong. My most common case is a .txt file that i rename to .mdwn.

    I would then have to do:

    M-x rename-file <ret> newfile
    M-x rename-buffer <ret> newfile
    C-x C-s <ret> newfile
    

    Really annoying.

    Turns out that set-visited-file-name actually does most of the job, but doesn't actually rename the file, which is really silly. So I wrote this small function instead:

    (defun rename-file-and-buffer (newfname)
      "combine rename-file and rename-buffer
    
    set-visited-file-name does most of the job, but unfortunately
    doesn't actually rename the file. rename-file does that, but
    doesn't rename the buffer. rename-buffer only renames the buffer,
    which is pretty pointless.
    
    only operates on current buffer because set-visited-file-name
    also does so and we don't bother doing excursions around.
    "
      (interactive "GRename file and bufer: ")
      (let ((oldfname (buffer-file-name)))
        (set-visited-file-name newfname nil t)
        (rename-file oldfname newfname)
        )
      )
    

    Not bound to any key, really trivial, but doing this without that function is really non-trivial, especially since set-visited-file-name needs special arguments to not mark the file as modified.

    IRC packages updates

    I updated the Sopel IRC bot package to the latest release, 6.3.0. They have finally switched to Requests, but apart from that, no change was necessary. I am glad to finally see SNI support working everywhere in the bot!

    I also update the Charydbis IRC server package to the latest 3.5.0 stable release. This release is great news, as I was able to remove 5 of the 7 patches I was dragging along the Debian package. The previous Charybdis stable release was over 3 years old, as 3.4.2 was released in (December) 2012!

    I spend a good chunk of time making the package reproducible. I filed a bug upstream and eventually made a patch to make it possible to hardcode a build timestamp, which seems to have been the only detectable change in the reproducible build infrastructure. Charybdis had been FTBS for a while in sid now, and the upload should fix that as well. Unfortunately, Charybdis still doesn't build with hardening flags - but hopefully a future update of the package should fix that. It is probably because CFLAGS are not passed around properly.

    There's really interesting stuff going on in the IRC world. Even though IRC is one of the oldest protocols still in operation (1988, even before the Web, but after SMTP and the even more venerable FTP), it is still being actively developed, with a working group drafting multiple extensions to the IRCv3 protocol defined in RFC 1459.

    For example, IRCv3.3 includes a Strict Transport Security extension, which tries to ensure users use encrypted channels as much as possible, through warnings and STARTTLS support. Charybdis goes even further by proposing a reversal of the +S ("secure channel" flag) where all channels are secure by default, and you need to deliberately mark a channel as insecure with the +U flag if you actually want to allow users on an clear-text connection to join the channel. A transition mechanism is also proposed.

    Miscellaneous bug reports

    En vrac...

    I fell face-first in this amazing game that is endless-sky. I made a small pull request on the documentation, a bug report and a feature request.

    I forwarded a bug report, originally filed against monkeysign, to the pyqrencode maintainers.

    I filed a usability bug against tails-installer, which just entered Debian, mostly usability issues.

    I discovered the fim image viewer, which re-entered Debian recently. It seemed perfect to adjust my photos-import workflow, so I added it to my script, to be able to review photos prior to importing them into Darktable and git-annex.

    Continuum Analytics News: Making Python on Hadoop Easier with Anaconda and Cloudera

    $
    0
    0
    PostedWednesday, February 17, 2016

    Python has become an increasingly popular tool for data analysis, including data processing, feature engineering, machine learning, and visualization. Data scientists and data engineers enjoy Python’s rich numerical and analytical libraries — such as NumPy, pandas, and scikit-learn — and have long wanted to apply them to large datasets stored in Hadoop clusters. While Apache Spark, through PySpark, has made data in Hadoop clusters more accessible to Python users, actually using these libraries on a Hadoop cluster remains challenging. In particular, setting up a full-featured and modern Python environment on a cluster can be challenging, error-prone, and time-consuming.

    Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution and installation of popular Python packages and their dependencies.

    Anaconda dramatically simplifies installation and management of popular Python packages and their dependencies, and this new parcel makes it easy for CDH users to deploy Anaconda across a Hadoop cluster for use in PySpark, Hadoop Streaming, and other contexts where Python is available and useful.

    The newly available Anaconda parcel:

    • includes 300+ of the most popular Python packages
    • simplifies the installation of Anaconda across a CDH cluster
    • will be updated with each new Anaconda release

    The rest of this blog post will show you how to install and configure the Anaconda parcel, as well as provide an example of training a scikit-learn model on a single node and then using the model to make predictions on data in a cluster.

    Installing the Anaconda Parcel

    1. From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.

    2. Click the Edit Settings button on the top right of the Parcels page.

     

    3. Click the plus symbol in the Remote Parcel Repository URLs section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

    4. Cick the Save Changes button at the top of the page.

    5. Click the Parcels indicator in the top navigation bar to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.

    6. Click the Download button to the right of the Anaconda parcel listing.

    7. After the parcel is downloaded, click the Distribute button to distribute the parcel to all of the cluster nodes.

    8. After the parcel is distributed, click the Activate button to activate the parcel on all of the cluster nodes, which will prompt with a confirmation dialog.

    9. After the parcel is activated, Anaconda is now available on all of the cluster nodes.

    These instructions are current as of the day of publication. Up to date instructions will be maintained in Anaconda’s documentation.

    To make Spark aware that we want to use the installed parcels as the Python runtime environment on the cluster, we need to set the PYSPARK_PYTHON environment variable. Spark determines which Python interpreter to use by checking the value of the PYSPARK_PYTHON environment variable on the driver node. With the default configuration for Cloudera Manager and parcels, Anaconda will be installed to /opt/cloudera/parcels/Anaconda, but if the parcel directory for Cloudera Manager has been changed, you will need to change the below instructions to ${YOUR_PARCEL_DIR}/Anaconda/bin/python.

    To specify which Python to use on a per-application basis, we can specify it on the same line as our spark-submit command. This would look like:

    $ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit pyspark_script.py
    

    You can also use Anaconda by default in Spark applications while still allowing users to override the value if they wish. To do this, you will need follow the instructions for Advanced Configuration Snippets and add the following lines to Spark’s configuration:

    if [ -n "${PYSPARK_PYTHON}" ]; then
       export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python
    fi

    Now with Anaconda on your CDH cluster, there’s no need to manually install, manage, and provision Python packages on your Hadoop cluster.

    Anaconda in Action

    A commonly needed workflow for a Python using data scientist is to:

    1. Train a scikit-learn model on a single node
    2. Save the results to disk
    3. Apply the trained model using PySpark to generate predictions on a larger dataset.

    Let’s take a classic machine learning classification problem as an example of what having complex Python dependencies from Anaconda installed on CDH cluster allows us to do.

    The MNIST dataset is a canonical machine learning classification problem that involves recognizing handwritten digits, where each row of the dataset consists of a representation of one handwritten digit from 0 to 9. The training data we will use is the original MNIST dataset (60,000 rows). The prediction will be done with the MNIST8M dataset (8,000,000 rows). Both of these datasets are available from the libsvm datasets website. This dataset is used as a standard test for various machine learning algorithms. More information, including benchmarks, can be found on the MNIST Dataset website.

    To train the model on a single node we will use scikit-learn and then we can save the model to a file with pickle:

    with open('mnist.scale', 'r') as f:
        train = f.read()
    
    with open('mnist.scale.t', 'r') as f:
        test = f.read()
    
    import numpy as np
    
    def parse(data):
        lines = data.split('\n')
        lines = filter(lambda x: x.strip() != '', lines)
        nlines = len(lines)
        X = np.zeros((nlines, 784) , dtype=float)
        Y = np.zeros((nlines, ) , dtype=float)
    
        for n, line in enumerate(lines):
            line = line.strip()
            if line != '':
                parts = line.strip().split(' ')
                for pair in parts[1:]:
                    pos, val = pair.strip().split(':')
                    pos, val = int(pos), float(val)
                    X[n, pos] = float(val)
                Y[n] = parts[0]
        return X, Y
    
    X_train, Y_train = parse(train)
    X_test, Y_test = parse(test)
    
    from sklearn import svm, metrics
    
    classifier = svm.SVC(gamma=0.001)
    classifier.fit(X_train, Y_train)
    predicted = classifier.predict(X_test)
    
    print metrics.classification_report(Y_test, predicted)
    
    import pickle
    with open('classifier.pickle', 'w') as f:
        pickle.dump(classifier, f)

    With the classifier now trained, it can be saved to disk and then copied to HDFS.

    Next, we configure and create a SparkContext to run in yarn-client mode:

    from pyspark import SparkConf
    from pyspark import SparkContext
    
    conf = SparkConf()
    conf.setMaster('yarn-client')
    conf.setAppName('sklearn-predict')
    sc = SparkContext(conf=conf)

    To load the MNIST8M data from HDFS into an RDD:

    input_data = sc.textFile('hdfs:///tmp/mnist8m.scale')

    We’ll do some preprocessing on this dataset to convert the text to a NumPy array, which will serve as input for the scikit-learn classifier. We’ve installed Anaconda on every cluster node, so both NumPy and scikit-learn are available to the Spark worker processes.

    def clean(line):
        """
        Read the mnist8m file format and return a numpy array
        """
        import numpy as np
        X = np.zeros((1, 784) , dtype=float)
        parts = line.strip().split(' ')
        for pair in parts[1:]:
            pos, val = pair.strip().split(':')
            pos, val = int(pos), float(val)
            if pos < pos:
                X[0, pos] = float(val)
        return X
    inputs = input_data.map(clean)

    To import the scikit-learn model and load the training data:

    from sklearn.externals import joblib
    classifier = joblib.load('classifier.pickle')

    In order to apply the trained model to a data in a large file in HDFS we need the trained model available in memory on the executors. To move the classifier from one node to all of the Spark workers we can then use the SparkContext.broadcast function to:

    broadcastVar = sc.broadcast(classifier)

    This broadcast variable is then available to us in executors. This means we can use the variable in logic that needs to be executed on the cluster, eg inside of map or flatMap functions. It is simple to apply the trained model and save the output to a file:

    def apply_classifier(input_array):
        label = broadcastVar.value.predict(input_array)
        return label
    predictions = inputs.map(apply_classifier)
    predictions.saveAsTextFile('hdfs:///tmp/predictions')

    To submit this code as a script we add the environment variable declaration at the beginning and then the usual spark-submit command:

    PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit pyspark_job.py

    Getting started with Anaconda on your CDH cluster is easy with the newly available parcel. Be sure to check out the Anaconda parcel documentation for more details.

    Continuum Analytics News: Continuum Analytics Announces Anaconda for Cloudera

    $
    0
    0
    PostedWednesday, February 17, 2016

    Anaconda Parcel Brings Open Source Anaconda to Hadoop to Power Data Science Analytics

    AUSTIN, TX - February 17, 2016 -Continuum Analytics, the creator and driving force behind Anaconda, the leading modern open source analytics platform powered by Python, today announced the release of Anaconda for C​loudera. This n​ew solution​ makes it easy to use Anaconda within a Cloudera­-managed Hadoop cluster to power data science analytics.

    Previously, Cloudera users had to manually install a complete Python data science stack on a Hadoop cluster and manage runtime dependencies themselves. With the A​naconda parcel,​ they now have a simple and compatible way to install Python on a Hadoop cluster. The Anaconda parcel enables users to easily build and run Python based solutions across a Cloudera cluster and alongside Spark jobs. Now, data scientists using Python and PySpark on a Hadoop cluster can exploit the full power of Anaconda analytic libraries to easily and effectively create powerful, high impact data science solutions.

    “The recent certification of Anaconda with Cloudera Enterprise makes Python much more accessible to customers and allows data scientists to easily scale out their data science solutions and realize benefits faster,” said Tim Stevens, vice president of Corporate and Business Development at Cloudera.

    Continuum worked closely with Cloudera to improve the process of using Python packages for data science and data analysis in a Hadoop cluster with Spark. The Anaconda parcel is installed via Cloudera Manager, which makes it easy to have the most popular open source Python packages available across a Hadoop cluster.

    "Spark has clearly demonstrated that Python is one of the most important technologies in modern Open Data Science. Nearly half of all Spark users are using Python for their data science needs - including data exploration and predictive modeling - in their Hadoop cluster," said Peter Wang, Continuum CTO and co-­founder. "We’re excited about the low level technology advancements in Hadoop, such as Parquet, as well as the pioneering advancements by Cloudera on Impala and Kudu. These advancements have set the foundation for our next generation Hadoop innovations, which extend Python from an interface for data science on Hadoop to a full­-fledged native analytic computational platform for Hadoop."

    About Cloudera

    Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available for the modern world. Our customers efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at lower cost than has been possible before. To ensure our customers are successful, we offer comprehensive support, training and professional services. Learn more at ​h​ttp://cloudera.com.​

    Connect with Cloudera

    Read our blogs:​ cloudera.com/engblog​and ​vision.cloudera.comFollow us on Twitter: ​twitter.com/clouderaVisit us on Facebook: ​facebook.com/clouderaJoin the Cloudera Community: ​cloudera.com/communityCloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Edition, Cloudera Navigator Optimizer ​and CDH ​are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trademarks of their respective owners.

    About Continuum Analytics

    Continuum Analytics is the creator and driving force behind Anaconda, the leading, modern open source analytics platform powered by Python. We put superpowers into the hands of people who are changing the world.

    With more than 2.25M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda does this by helping everyone in the data science team discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their open data science environments without any hassles to harness the power of the latest open source analytic and technology innovations.

    Our community loves Anaconda because it empowers the entire data science team – data scientists, developers, DevOps, architects, and business analysts – to connect the dots in their data and accelerate the time­-to­-value that is required in today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

    Continuum Analytics' founders and developers have created or contribute to some of the most popular open data science technologies, including NumPy, SciPy, Matplotlib, pandas, Jupyter/IPython, Bokeh, Numba and many others. Continuum Analytics is venture­-backed by General Catalyst and BuildGroup.

    To learn more about Continuum Analytics, visit w​w​w.continuum.io.​

    ###

    Media Contacts:

    Treble​for Continuum Analytics

    Aaron DeLucia

    (512) 960­8222

    continuum@treblepr.com

    Deborah Wiltshire

    Cloudera

    press@cloudera.com

    +1 (650) 644­3900

    Viewing all 23418 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>