PSF GSoC students blogs: Google Summer of Code with Nuitka 5th Weekly Check-in

July 30, 2019, 1:47 pm

≫ Next: Yasoob Khalid: Python mind-teaser: Make the function return True

≪ Previous: Thibauld Nion: Why leave Wordpress behind for Nikola ?

1. What did you do this week?
This week, I continued to work on my script which automates the testing of nuitka-wheel pytest. Details can be found on my pull request: https://github.com/Nuitka/Nuitka/pull/440

The script automates the manual testing of comparing pytest results of a nuitka compiled wheel using `python setup.py bdist_nuitka` to the pytest results of an uncompiled wheel built using `python setup.py bdist_wheel` for the most popular PyPI packages. Testing is done to ensure that nuitka is building the wheel correctly. If the pytests pass/fail in the same way, that means Nuitka built the wheel properly. Else if the tests differ, then something is wrong. Virtualenv is used to create a clean environment with no outside pollution.

Testing has been improved and extended to many more PyPI packages.

2. What is coming up next?
Perfect the script in preparation for merging and work on documentations.

3. Did you get stuck anywhere?
Some PyPI packages are special and require special handling during automation. I skipped these for the sake of speed, but I will need to get back to them in the future.

↧

Yasoob Khalid: Python mind-teaser: Make the function return True

July 30, 2019, 3:43 pm

≫ Next: PSF GSoC students blogs: Packaging your Panda3D game for iOS

≪ Previous: PSF GSoC students blogs: Google Summer of Code with Nuitka 5th Weekly Check-in

Hi everyone! I was browsing /r/python and came across this post:

The challenge was easy. Provide such an input that if 1 is added to it, it is the instance of the same object but if 2 is added it is not.

Solution 1: Custom class

The way I personally thought to solve this challenge was this:

def check(x):
    if x+1 is 1+x:
        return False
    if x+2 is not 2+x:
        return False
    return True

class Test(int):
    def __add__(self, v):
        if v == 1:
            return 0
        else:
            return v

print(check(Test()))
# output: True

Let me explain how this works. In Python when we use the + operator Python calls a different dunder method depending on which side of the operator our object is. If our object is on the left side of the operator then __add__ will be called, if it is on the right side then __radd__ will be called.

Our Test object will return 0 if Test() + 1 is called and 1 if 1 + Test() is called. The trick is that we are overloading only one dunder method and keeping the other one same. This will help us pass the first if condition. If you take a second look at it you will see that it helps us pass the second if check as well because we simply return the input if it is not 1 so Test() + 2 will always be similar to 2 + Test().

However, after reading the comments, I found another solution which did not require a custom class.

Solution 2: A unique integer

User /u/SethGecko11 came up with this absurdly short answer:

def check(x):
    if x+1 is 1+x:
        return False
    if x+2 is not 2+x:
        return False
    return True

print(check(-7))
# output: True

Only -7 works. Any other number will not return True. If you are confused as to why this works then you aren’t alone. I had to read the comments to figure out the reasoning.

So apparently, in Python, integers from -5 to 256 are pre-allocated. When you do any operation and the result falls within that range, you get the pre-allocated object. These are singletons so the is operator returns True. However, if you try using integers which don’t fall in this range, you get a new instance.

The memory requirement for pre-allocating these integers is not that high but apparently the performance gains are huge.

So when you use -7 as input, you get a new instance of -6 but the same instance when the answer is -5. This doesn’t work with the upper bound (256) precisely because of the way if statements are constructed. 255 would work as an answer if the check function was implemented like this:

def check(x):
    if x+1 is not 1+x:
        return False
    if x+2 is 2+x:
        return False
    return True

I hope you learned something new in this article. I don’t think you would ever have to use this in any code-base ever but it is a really good mind-teaser which can catch even seasoned Python developers off-guard.

Happy programming! I will see you in the next article

↧

PSF GSoC students blogs: Packaging your Panda3D game for iOS

July 30, 2019, 8:10 pm

≫ Next: IslandT: Use Pandas Data Frame to display market data

≪ Previous: Yasoob Khalid: Python mind-teaser: Make the function return True

HI everyone,

I'd like to quickly detail how you can help test out the iOS port using your own game. I am making available a wheel that contains all of the files required to develop a Panda app on iOS. To build your game for iOS, there is a new command, `make_xcodeproj` that comes as an addition to the recently released deploy-ng system. There is no formal documentation available yet, but one should be able to surmise how it works based on the source code, located in direct/dist/commands.py. It is best if you get a build of your app using the build_apps command going first, since make_xcodeproj piggy-backs off of it in order to generate an Xcode project.

You can download the wheel here.

See you later!

↧

IslandT: Use Pandas Data Frame to display market data

July 30, 2019, 11:13 pm

≫ Next: Test and Code: 82: pytest - favorite features since 3.0 - Anthony Sottile

≪ Previous: PSF GSoC students blogs: Packaging your Panda3D game for iOS

In the previous article, we have used the Blockchain API to display the Bitcoin vs world major currencies exchange rate in our application. In this article, we will use the Pandas Data Frame object to create a beautiful table for our displaying data. I have already introduced the Pandas Data Frame object before in the previous chapter, therefore, I won’t go through it again in this post. Let us go straight to the business.

We will not use the sell buy string to display the currency data from Blockchain as before but we will directly construct the data frame object to display the related data.

BTC vs World Currencies table

The index of this data frame table will be the world currencies symbol and the column will be the bitcoin symbol.

Import Pandas module.

import pandas as pd

Construct the data frame object within the get exchange rate function.

# print the 15 min price for every bitcoin/currency
    currency_exchange_rate = []
    currency_index = []

    for k in ticker:
        #sell_buy += "BTC:" + str(k) + " " + str(ticker[k].p15min) + "\n"
        currency_exchange_rate.append((ticker[k].p15min))
        currency_index.append(str(k))

    # construct the pandas data frame object
    d = {'BTC': currency_exchange_rate}
    df = pd.DataFrame(data=d, index=currency_index)

    text_widget.delete('1.0', END)  # clear all those previous text first
    s.set(df)
    text_widget.insert(INSERT, s.get())  # populate the text widget with new exchange rate data

We have commented out the sell buy string because we will use the data frame object to display the market data instead.

I have started a new python channel, come and join in the discussion through this link.

↧

Test and Code: 82: pytest - favorite features since 3.0 - Anthony Sottile

July 31, 2019, 12:15 am

≫ Next: Python Software Foundation: PyPI now supports uploading via API token

≪ Previous: IslandT: Use Pandas Data Frame to display market data

Anthony Sotille is a pytest core contributor, as well as a maintainer and contributor to
many other projects. In this episode, Anthony shares some of the super cool features of pytest that have been added since he started using it.

We also discuss Anthony's move from user to contributor, and how others can help with the pytest project.

Special Guest: Anthony Sottile.

Python Software Foundation: PyPI now supports uploading via API token

July 31, 2019, 5:02 am

≫ Next: Python Insider: PyPI now supports uploading via API token

≪ Previous: Test and Code: 82: pytest - favorite features since 3.0 - Anthony Sottile

We're further increasing the security of the Python Package Index with another new beta feature: scoped API tokens for package upload. This is thanks to a grant from the Open Technology Fund, coordinated by the Packaging Working Group of the Python Software Foundation.

Over the last few months, we've added two-factor authentication (2FA) login security methods. We added Time-based One-Time Password (TOTP) support in late May and physical security device support in mid-June. Now, over 1600 users have started using physical security devices or TOTP applications to better secure their accounts. And over the past week, over 7.8% of logins to PyPI.org have been protected by 2FA, up from 3% in the month of June.

Add API token screen, with textarea for token name and dropdown menu to choose token scope

PyPI interface for adding an
API token for package upload

Now, we have another improvement: you can use API tokens to upload packages to PyPI and Test PyPI! And we've designed the token to be a drop-in replacement for the username and password you already use (warning: this is a beta feature that we need your help to test).

How it works: Go to your PyPI account settings and select "Add API token". When you create an API token, you choose its scope: you can create a token that can upload to all the projects you maintain or own, or you can limit its scope to just one project.

API token management interface displays each token's name, scope, date/time created, and date/time last used, and the user can view each token's unique ID or revoke it

PyPI API token management interface

The token management screen shows you when each of your tokens were created, and last used. And you can revoke one token without revoking others, and without having to change your password on PyPI and in configuration files.

Uploading with an API token is currently optional but encouraged; in the future, PyPI will set and enforce a policy requiring users with two-factor authentication enabled to use API tokens to upload (rather than just their password sans second factor). Watch our announcement mailing list for future details.


Immediately after creating the API token, PyPI gives the user one chance to copy it

Why: These API tokens can only be used to upload packages to PyPI, and not to log in more generally. This makes it safer to automate package upload and store the credential in the cloud, since a thief who copies the token won't also gain the ability to delete the project, delete old releases, or add or remove collaborators. And, since the token is a long character string (with 32 bytes of entropy and a service identifier) that PyPI has securely generated on the server side, we vastly reduce the potential for credential reuse on other sites and for a bad actor to guess the token.

Help us test: Please try this out! This is a beta feature and we expect that users will find minor issues over the next few weeks; we ask for your bug reports. If you find any potential security vulnerabilities, please follow our published security policy. (Please don't report security issues in Warehouse via GitHub, IRC, or mailing lists. Instead, please directly email security@python.org.) If you find an issue that is not a security vulnerability, please report it via GitHub.

We'd particularly like testing from:

Organizations that automate uploads using continuous integration
People who save PyPI credentials in a .pypirc file
Windows users
People on mobile devices
People on very slow connections
Organizations where users share an auth token within a group
Projects with 4+ maintainers or owners
People who usually block cookies and JavaScript
People who maintain 20+ projects
People who created their PyPI account 6+ years ago

What's next for PyPI: Next, we'll move on to working on an advanced audit trail of sensitive user actions, plus improvements to accessibility and localization for PyPI (some of which have already started). More details are in our progress reports on Discourse.

Thanks to the Open Technology Fund for funding this work. And please sign up for the PyPI Announcement Mailing List for future updates.

↧

Python Insider: PyPI now supports uploading via API token

July 31, 2019, 5:08 am

≫ Next: Real Python: First Steps With PySpark and Big Data Processing

≪ Previous: Python Software Foundation: PyPI now supports uploading via API token

PyPI interface for adding an
API token for package upload

How it works: Go to your PyPI account settings and select "Add API token". When you create an API token, you choose its scope: you can create a token that can upload to all the projects you maintain or own, or you can limit its scope to just one project.

The token management screen shows you when each of your tokens were created, and last used. And you can revoke one token without revoking others, and without having to change your password on PyPI and in configuration files.

PyPI API token management interface

Uploading with an API token is currently optional but encouraged; in the future, PyPI will set and enforce a policy requiring users with two-factor authentication enabled to use API tokens to upload (rather than just their password sans second factor). Watch our announcement mailing list for future details.


Immediately after creating the API token, PyPI gives the user one chance to copy it

Organizations that automate uploads using continuous integration
People who save PyPI credentials in a .pypirc file
Windows users
People on mobile devices
People on very slow connections
Organizations where users share an auth token within a group
Projects with 4+ maintainers or owners
People who usually block cookies and JavaScript
People who maintain 20+ projects
People who created their PyPI account 6+ years ago

↧

Real Python: First Steps With PySpark and Big Data Processing

July 31, 2019, 7:00 am

≫ Next: PyCharm: Jupyter, PyCharm and Pizza

≪ Previous: Python Insider: PyPI now supports uploading via API token

It’s becoming more common to face situations where the amount of data is simply too big to handle on a single machine. Luckily, technologies such as Apache Spark, Hadoop, and others have been developed to solve this exact problem. The power of those systems can be tapped into directly from Python using PySpark!

Efficiently handling datasets of gigabytes and more is well within the reach of any Python developer, whether you’re a data scientist, a web developer, or anything in between.

In this tutorial, you’ll learn:

What Python concepts can be applied to Big Data
How to use Apache Spark and PySpark
How to write basic PySpark programs
How to run PySpark programs on small datasets locally
Where to go next for taking your PySpark skills to a distributed system

Free Bonus:Click here to get access to a chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Big Data Concepts in Python

Despite its popularity as just a scripting language, Python exposes several programming paradigms like array-oriented programming, object-oriented programming, asynchronous programming, and many others. One paradigm that is of particular interest for aspiring Big Data professionals is functional programming.

Functional programming is a common paradigm when you are dealing with Big Data. Writing in a functional manner makes for embarrassingly parallel code. This means it’s easier to take your code and have it run on several CPUs or even entirely different machines. You can work around the physical memory and CPU restrictions of a single workstation by running on multiple systems at once.

This is the power of the PySpark ecosystem, allowing you to take functional code and automatically distribute it across an entire cluster of computers.

Luckily for Python programmers, many of the core ideas of functional programming are available in Python’s standard library and built-ins. You can learn many of the concepts needed for Big Data processing without ever leaving the comfort of Python.

The core idea of functional programming is that data should be manipulated by functions without maintaining any external state. This means that your code avoids global variables and always returns new data instead of manipulating the data in-place.

Another common idea in functional programming is anonymous functions. Python exposes anonymous functions using the lambda keyword, not to be confused with AWS Lambda functions.

Now that you know some of the terms and concepts, you can explore how those ideas manifest in the Python ecosystem.

Lambda Functions

lambda functions in Python are defined inline and are limited to a single expression. You’ve likely seen lambda functions when using the built-in sorted() function:

>>>

>>> x=['Python','programming','is','awesome!']>>> print(sorted(x))['Python', 'awesome!', 'is', 'programming']>>> print(sorted(x,key=lambdaarg:arg.lower()))['awesome!', 'is', 'programming', 'Python']

The key parameter to sorted is called for each item in the iterable. This makes the sorting case-insensitive by changing all the strings to lowercase before the sorting takes place.

This is a common use-case for lambda functions, small anonymous functions that maintain no external state.

Other common functional programming functions exist in Python as well, such as filter(), map(), and reduce(). All these functions can make use of lambda functions or standard functions defined with def in a similar manner.

`filter()`, `map()`, and `reduce()`

The built-in filter(), map(), and reduce() functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

It’s important to understand these functions in a core Python context. Then, you’ll be able to translate that knowledge into PySpark programs and the Spark API.

filter() filters items out of an iterable based on a condition, typically expressed as a lambda function:

>>>

>>> x=['Python','programming','is','awesome!']>>> print(list(filter(lambdaarg:len(arg)<8,x)))['Python', 'is']

filter() takes an iterable, calls the lambda function on each item, and returns the items where the lambda returned True.

Note: Calling list() is required because filter() is also an iterable. filter() only gives you the values as you loop over them. list() forces all the items into memory at once instead of having to use a loop.

You can imagine using filter() to replace a common for loop pattern like the following:

defis_less_than_8_characters(item):returnlen(item)<8x=['Python','programming','is','awesome!']results=[]foriteminx:ifis_less_than_8_characters(item):results.append(item)print(results)

This code collects all the strings that have less than 8 characters. The code is more verbose than the filter() example, but it performs the same function with the same results.

Another less obvious benefit of filter() is that it returns an iterable. This means filter() doesn’t require that your computer have enough memory to hold all the items in the iterable at once. This is increasingly important with Big Data sets that can quickly grow to several gigabytes in size.

map() is similar to filter() in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The new iterable that map() returns will always have the same number of elements as the original iterable, which was not the case with filter():

>>>

>>> x=['Python','programming','is','awesome!']>>> print(list(map(lambdaarg:arg.upper(),x)))['PYTHON', 'PROGRAMMING', 'IS', 'AWESOME!']

map() automatically calls the lambda function on all the items, effectively replacing a for loop like the following:

results=[]x=['Python','programming','is','awesome!']foriteminx:results.append(item.upper())print(results)

The for loop has the same result as the map() example, which collects all items in their upper-case form. However, as with the filter() example, map() returns an iterable, which again makes it possible to process large sets of data that are too big to fit entirely in memory.

Finally, the last of the functional trio in the Python standard library is reduce(). As with filter() and map(), reduce()applies a function to elements in an iterable.

Again, the function being applied can be a standard Python function created with the def keyword or a lambda function.

However, reduce() doesn’t return a new iterable. Instead, reduce() uses the function called to reduce the iterable to a single value:

>>>

>>> fromfunctoolsimportreduce>>> x=['Python','programming','is','awesome!']>>> print(reduce(lambdaval1,val2:val1+val2,x))Pythonprogrammingisawesome!

This code combines all the items in the iterable, from left to right, into a single item. There is no call to list() here because reduce() already returns a single item.

Note: Python 3.x moved the built-in reduce() function into the functools package.

lambda, map(), filter(), and reduce() are concepts that exist in many languages and can be used in regular Python programs. Soon, you’ll see these concepts extend to the PySpark API to process large amounts of data.

Sets

Sets are another common piece of functionality that exist in standard Python and is widely useful in Big Data processing. Sets are very similar to lists except they do not have any ordering and cannot contain duplicate values. You can think of a set as similar to the keys in a Python dict.

Hello World in PySpark

As in any good programming tutorial, you’ll want to get started with a Hello World example. Below is the PySpark equivalent:

importpysparksc=pyspark.SparkContext('local[*]')txt=sc.textFile('file:////usr/share/doc/python/copyright')print(txt.count())python_lines=txt.filter(lambdaline:'python'inline.lower())print(python_lines.count())

Don’t worry about all the details yet. The main idea is to keep in mind that a PySpark program isn’t much different from a regular Python program.

Note: This program will likely raise an Exception on your system if you don’t have PySpark installed yet or don’t have the specified copyright file, which you’ll see how to do later.

You’ll learn all the details of this program soon, but take a good look. The program counts the total number of lines and the number of lines that have the word python in a file named copyright.

Remember, a PySpark program isn’t that much different from a regular Python program, but the execution model can be very different from a regular Python program, especially if you’re running on a cluster.

There can be a lot of things happening behind the scenes that distribute the processing across multiple nodes if you’re on a cluster. However, for now, think of the program as a Python program that uses the PySpark library.

Now that you’ve seen some common functional concepts that exist in Python as well as a simple PySpark program, it’s time to dive deeper into Spark and PySpark.

What Is Spark?

Apache Spark is made up of several components, so describing it can be difficult. At its core, Spark is a generic engine for processing large amounts of data.

Spark is written in Scala and runs on the JVM. Spark has built-in components for processing streaming data, machine learning, graph processing, and even interacting with data via SQL.

In this guide, you’ll only learn about the core Spark components for processing Big Data. However, all the other components such as machine learning, SQL, and so on are all available to Python projects via PySpark too.

What Is PySpark?

Spark is implemented in Scala, a language that runs on the JVM, so how can you access all that functionality via Python?

PySpark is the answer.

The current version of PySpark is 2.4.3 and works with Python 2.7, 3.3, and above.

You can think of PySpark as a Python-based wrapper on top of the Scala API. This means you have two sets of documentation to refer to:

The PySpark API docs have examples, but often you’ll want to refer to the Scala documentation and translate the code into Python syntax for your PySpark programs. Luckily, Scala is a very readable function-based programming language.

PySpark communicates with the Spark Scala-based API via the Py4J library. Py4J isn’t specific to PySpark or Spark. Py4J allows any Python program to talk to JVM-based code.

There are two reasons that PySpark is based on the functional paradigm:

Spark’s native language, Scala, is functional-based.
Functional code is much easier to parallelize.

Another way to think of PySpark is a library that allows processing large amounts of data on a single machine or a cluster of machines.

In a Python context, think of PySpark has a way to handle parallel processing without the need for the threading or multiprocessing modules. All of the complicated communication and synchronization between threads, processes, and even different CPUs is handled by Spark.

PySpark API and Data Structures

To interact with PySpark, you create specialized data structures called Resilient Distributed Datasets (RDDs).

RDDs hide all the complexity of transforming and distributing your data automatically across multiple nodes by a scheduler if you’re running on a cluster.

To better understand PySpark’s API and data structures, recall the Hello World program mentioned previously:

importpysparksc=pyspark.SparkContext('local[*]')txt=sc.textFile('file:////usr/share/doc/python/copyright')print(txt.count())python_lines=txt.filter(lambdaline:'python'inline.lower())print(python_lines.count())

The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs. The local[*] string is a special string denoting that you’re using a local cluster, which is another way of saying you’re running in single-machine mode. The * tells Spark to create as many worker threads as logical cores on your machine.

Creating a SparkContext can be more involved when you’re using a cluster. To connect to a Spark cluster, you might need to handle authentication and a few other pieces of information specific to your cluster. You can set up those details similarly to the following:

conf=pyspark.SparkConf()conf.setMaster('spark://head_node:56887')conf.set('spark.authenticate',True)conf.set('spark.authenticate.secret','secret-key')sc=SparkContext(conf=conf)

You can start creating RDDs once you have a SparkContext.

You can create RDDs in a number of ways, but one common way is the PySpark parallelize() function. parallelize() can transform some Python data structures like lists and tuples into RDDs, which gives you functionality that makes them fault-tolerant and distributed.

To better understand RDDs, consider another example. The following code creates an iterator of 10,000 elements and then uses parallelize() to distribute that data into 2 partitions:

>>>

>>> big_list=range(10000)>>> rdd=sc.parallelize(big_list,2)>>> odds=rdd.filter(lambdax:x%2!=0)>>> odds.take(5)[1, 3, 5, 7, 9]

parallelize() turns that iterator into a distributed set of numbers and gives you all the capability of Spark’s infrastructure.

Notice that this code uses the RDD’s filter() method instead of Python’s built-in filter(), which you saw earlier. The result is the same, but what’s happening behind the scenes is drastically different. By using the RDD filter() method, that operation occurs in a distributed manner across several CPUs or computers.

Again, imagine this as Spark doing the multiprocessing work for you, all encapsulated in the RDD data structure.

take() is a way to see the contents of your RDD, but only a small subset. take() pulls that subset of data from the distributed system onto a single machine.

take() is important for debugging because inspecting your entire dataset on a single machine may not be possible. RDDs are optimized to be used on Big Data so in a real world scenario a single machine may not have enough RAM to hold your entire dataset.

Note: Spark temporarily prints information to stdout when running examples like this in the shell, which you’ll see how to do soon. Your stdout might temporarily show something like [Stage 0:> (0 + 1) / 1].

The stdout text demonstrates how Spark is splitting up the RDDs and processing your data into multiple stages across different CPUs and machines.

Another way to create RDDs is to read in a file with textFile(), which you’ve seen in previous examples. RDDs are one of the foundational data structures for using PySpark so many of the functions in the API return RDDs.

One of the key distinctions between RDDs and other data structures is that processing is delayed until the result is requested. This is similar to a Python generator. Developers in the Python ecosystem typically use the term lazy evaluation to explain this behavior.

You can stack up multiple transformations on the same RDD without any processing happening. This functionality is possible because Spark maintains a directed acyclic graph of the transformations. The underlying graph is only activated when the final results are requested. In the previous example, no computation took place until you requested the results by calling take().

There are multiple ways to request the results from an RDD. You can explicitly request results to be evaluated and collected to a single cluster node by using collect() on a RDD. You can also implicitly request the results in various ways, one of which was using count() as you saw earlier.

Note: Be careful when using these methods because they pull the entire dataset into memory, which will not work if the dataset is too big to fit into the RAM of a single machine.

Again, refer to the PySpark API documentation for even more details on all the possible functionality.

Installing PySpark

Typically, you’ll run PySpark programs on a Hadoop cluster, but other cluster deployment options are supported. You can read Spark’s cluster mode overview for more details.

Note: Setting up one of these clusters can be difficult and is outside the scope of this guide. Ideally, your team has some wizard DevOps engineers to help get that working. If not, Hadoop publishes a guide to help you.

In this guide, you’ll see several ways to run PySpark programs on your local machine. This is useful for testing and learning, but you’ll quickly want to take your new programs and run them on a cluster to truly process Big Data.

Sometimes setting up PySpark by itself can be challenging too because of all the required dependencies.

PySpark runs on top of the JVM and requires a lot of underlying Java infrastructure to function. That being said, we live in the age of Docker, which makes experimenting with PySpark much easier.

Even better, the amazing developers behind Jupyter have done all the heavy lifting for you. They publish a Dockerfile that includes all the PySpark dependencies along with Jupyter. So, you can experiment directly in a Jupyter notebook!

Note: Jupyter notebooks have a lot of functionality. Check out Jupyter Notebook: An Introduction for a lot more details on how to use notebooks effectively.

First, you’ll need to install Docker. Take a look at Docker in Action – Fitter, Happier, More Productive if you don’t have Docker setup yet.

Note: The Docker images can be quite large so make sure you’re okay with using up around 5 GBs of disk space to use PySpark and Jupyter.

Next, you can run the following command to download and automatically launch a Docker container with a pre-built PySpark single-node setup. This command may take a few minutes because it downloads the images directly from DockerHub along with all the requirements for Spark, PySpark, and Jupyter:

$ docker run -p 8888:8888 jupyter/pyspark-notebook

Once that command stops printing output, you have a running container that has everything you need to test out your PySpark programs in a single-node environment.

To stop your container, type Ctrl+C in the same window you typed the docker run command in.

Now it’s time to finally run some programs!

Running PySpark Programs

There are a number of ways to execute PySpark programs, depending on whether you prefer a command-line or a more visual interface. For a command-line interface, you can use the spark-submit command, the standard Python shell, or the specialized PySpark shell.

First, you’ll see the more visual interface with a Jupyter notebook.

Jupyter Notebook

You can run your program in a Jupyter notebook by running the following command to start the Docker container you previously downloaded (if it’s not already running):

$ docker run -p 8888:8888 jupyter/pyspark-notebook
Executing the command: jupyter notebook[I 08:04:22.869 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret[I 08:04:25.022 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab[I 08:04:25.022 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab[I 08:04:25.027 NotebookApp] Serving notebooks from local directory: /home/jovyan[I 08:04:25.028 NotebookApp] The Jupyter Notebook is running at:[I 08:04:25.029 NotebookApp] http://(4d5ab7a93902 or 127.0.0.1):8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437[I 08:04:25.029 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).[C 08:04:25.037 NotebookApp]    To access the notebook, open this file in a browser:        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html    Or copy and paste one of these URLs:        http://(4d5ab7a93902 or 127.0.0.1):8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437

Now you have a container running with PySpark. Notice that the end of the docker run command output mentions a local URL.

Note: The output from the docker commands will be slightly different on every machine because the tokens, container IDs, and container names are all randomly generated.

You need to use that URL to connect to the Docker container running Jupyter in a web browser. Copy and paste the URL from your output directly into your web browser. Here is an example of the URL you’ll likely see:

$ http://127.0.0.1:8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437

The URL in the command below will likely differ slightly on your machine, but once you connect to that URL in your browser, you can access a Jupyter notebook environment, which should look similar to this:

From the Jupyter notebook page, you can use the New button on the far right to create a new Python 3 shell. Then you can test out some code, like the Hello World example from before:

importpysparksc=pyspark.SparkContext('local[*]')txt=sc.textFile('file:////usr/share/doc/python/copyright')print(txt.count())python_lines=txt.filter(lambdaline:'python'inline.lower())print(python_lines.count())

Here’s what running that code will look like in the Jupyter notebook:

There is a lot happening behind the scenes here, so it may take a few seconds for your results to display. The answer won’t appear immediately after you click the cell.

Command-Line Interface

The command-line interface offers a variety of ways to submit PySpark programs including the PySpark shell and the spark-submit command. To use these CLI approaches, you’ll first need to connect to the CLI of the system that has PySpark installed.

To connect to the CLI of the Docker setup, you’ll need to start the container like before and then attach to that container. Again, to start the container, you can run the following command:

$ docker run -p 8888:8888 jupyter/pyspark-notebook

Once you have the Docker container running, you need to connect to it via the shell instead of a Jupyter notebook. To do this, run the following command to find the container name:

$ docker container ls
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS                    NAMES4d5ab7a93902        jupyter/pyspark-notebook   "tini -g -- start-no…"   12 seconds ago      Up 10 seconds       0.0.0.0:8888->8888/tcp   kind_edison

This command will show you all the running containers. Find the CONTAINER ID of the container running the jupyter/pyspark-notebook image and use it to connect to the bash shell inside the container:

$ docker exec -it 4d5ab7a93902 bash
jovyan@4d5ab7a93902:~$

Now you should be connected to a bash prompt inside of the container. You can verify that things are working because the prompt of your shell will change to be something similar to jovyan@4d5ab7a93902, but using the unique ID of your container.

Note: Replace 4d5ab7a93902 with the CONTAINER ID used on your machine.

Cluster

You can use the spark-submit command installed along with Spark to submit PySpark code to a cluster using the command line. This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you’ll execute your real Big Data processing jobs.

Note: The path to these commands depends on where Spark was installed and will likely only work when using the referenced Docker container.

To run the Hello World example (or any PySpark program) with the running Docker container, first access the shell as described above. Once you’re in the container’s shell environment you can create files using the nano text editor.

To create the file in your current folder, simply launch nano with the name of the file you want to create:

$ nano hello_world.py

Type in the contents of the Hello World example and save the file by typing Ctrl+X and following the save prompts:

Finally, you can run the code through Spark with the pyspark-submit command:

$ /usr/local/spark/bin/spark-submit hello_world.py

This command results in a lot of output by default so it may be difficult to see your program’s output. You can control the log verbosity somewhat inside your PySpark program by changing the level on your SparkContext variable. To do that, put this line near the top of your script:

sc.setLogLevel('WARN')

This will omit some of the output of spark-submit so you can more clearly see the output of your program. However, in a real-world scenario, you’ll want to put any output into a file, database, or some other storage mechanism for easier debugging later.

Luckily, a PySpark program still has access to all of Python’s standard library, so saving your results to a file is not an issue:

importpysparksc=pyspark.SparkContext('local[*]')txt=sc.textFile('file:////usr/share/doc/python/copyright')python_lines=txt.filter(lambdaline:'python'inline.lower())withopen('results.txt','w')asfile_obj:file_obj.write(f'Number of lines: {txt.count()}\n')file_obj.write(f'Number of lines with python: {python_lines.count()}\n')

Now your results are in a separate file called results.txt for easier reference later.

Note: The above code uses f-strings, which were introduced in Python 3.6.

PySpark Shell

Another PySpark-specific way to run your programs is using the shell provided with PySpark itself. Again, using the Docker setup, you can connect to the container’s CLI as described above. Then, you can run the specialized Python shell with the following command:

$ /usr/local/spark/bin/pyspark
Python 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00)[GCC 7.3.0] :: Anaconda, Inc. on linuxType "help", "copyright", "credits" or "license" for more information.Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Welcome to      ____              __     / __/__  ___ _____/ /__    _\ \/ _ \/ _ `/ __/  '_/   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1      /_/Using Python version 3.7.3 (default, Mar 27 2019 23:01:00)SparkSession available as 'spark'.

Now you’re in the Pyspark shell environment inside your Docker container, and you can test out code similar to the Jupyter notebook example:

>>>

>>> txt=sc.textFile('file:////usr/share/doc/python/copyright')>>> print(txt.count())316

Now you can work in the Pyspark shell just as you would with your normal Python shell.

Note: You didn’t have to create a SparkContext variable in the Pyspark shell example. The PySpark shell automatically creates a variable, sc, to connect you to the Spark engine in single-node mode.

You must create your ownSparkContext when submitting real PySpark programs with spark-submit or a Jupyter notebook.

You can also use the standard Python shell to execute your programs as long as PySpark is installed into that Python environment. The Docker container you’ve been using does not have PySpark enabled for the standard Python environment. So, you must use one of the previous methods to use PySpark in the Docker container.

Combining PySpark With Other Tools

As you already saw, PySpark comes with additional libraries to do things like machine learning and SQL-like manipulation of large datasets. However, you can also use other common scientific libraries like NumPy and Pandas.

You must install these in the same environment on each cluster node, and then your program can use them as usual. Then, you’re free to use all the familiar idiomatic Pandas tricks you already know.

Remember:Pandas DataFrames are eagerly evaluated so all the data will need to fit in memory on a single machine.

Next Steps for Real Big Data Processing

Soon after learning the PySpark basics, you’ll surely want to start analyzing huge amounts of data that likely won’t work when you’re using single-machine mode. Installing and maintaining a Spark cluster is way outside the scope of this guide and is likely a full-time job in itself.

So, it might be time to visit the IT department at your office or look into a hosted Spark cluster solution. One potential hosted solution is Databricks.

Databricks allows you to host your data with Microsoft Azure or AWS and has a free 14-day trial.

After you have a working Spark cluster, you’ll want to get all your data into that cluster for analysis. Spark has a number of ways to import data:

Amazon S3
Apache Hive Data Warehouse
Any database with a JDBC or ODBC interface

You can even read data directly from a Network File System, which is how the previous examples worked.

There’s no shortage of ways to get access to all your data, whether you’re using a hosted solution like Databricks or your own cluster of machines.

Conclusion

PySpark is a good entry-point into Big Data Processing.

In this tutorial, you learned that you don’t have to spend a lot of time learning up-front if you’re familiar with a few functional programming concepts like map(), filter(), and basic Python. In fact, you can use all the Python you already know including familiar tools like NumPy and Pandas directly in your PySpark programs.

You are now able to:

Understand built-in Python concepts that apply to Big Data
Write basic PySpark programs
Run PySpark programs on small datasets with your local machine
Explore more capable Big Data solutions like a Spark cluster or another custom, hosted solution

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: Jupyter, PyCharm and Pizza

July 31, 2019, 7:00 am

≫ Next: Erik Marsja: How to Read and Write JSON Files using Python and Pandas

≪ Previous: Real Python: First Steps With PySpark and Big Data Processing

Hi there! Have you tried Jupyter Notebooks integration in PyCharm 2019.2? Not yet? Then let me show you what it looks like!

In this blog post, we’re going to explore some data using PyCharm and its Jupyter Notebook integration. First, of course, we’ll need said data. Whenever I need a new dataset to play with, I typically head to Kaggle where I’m sure to find something interesting to toy with. This time a dataset called “Pizza Restaurants and the Pizza They Sell” caught my attention. Who doesn’t love pizza? Let’s analyze these pizza restaurants and try to learn a thing or two from it.

Since this data isn’t a part of any of my existing PyCharm projects, I’ll create a new one.
Make sure to use PyCharm Professional Edition, the Community Edition does not include Jupyter Notebooks integration.

Create new PyCHarm project

Tip: When using Jupyter notebooks in the browser, I tend to create multiply temporary notebooks just for experiments. It would be quite tedious to create a PyCharm project for each of them, so instead, you can have a single project for such experiments.

I like my things organized, so once the project is created, I’ll add some structure to it – a directory for the data where I’ll move the downloaded dataset, and another directory for the notebooks.

Once I create my first pizza.ipynb notebook, PyCharm suggests to install Jupyter package and provides a link in the upper right corner to do that.

Install Jupyter package

Once the Jupyter package is installed, we’re ready to go!

The first thing that probably 90% of data scientists do in their Jupyter notebooks is type import pandas as pd. At this point, PyCharm will suggest installing pandas in this venv and you can do it with a single click:

Install Pandas

Once we have pandas installed, we can read the data from the csv into a pandas DataFrame:
df = pd.read_csv("../data/Datafiniti_Pizza_Restaurants_and_the_Pizza_They_Sell_May19.csv")

To execute this cell, hit Shift+Enter, or click the green arrow icon in the gutter next to the cell.
When you run a cell for the first time, PyCharm will launch a local Jupyter server to execute the code in it – you don’t need to manually do this from your terminal.

Let’s get to know the data. First, we’ll learn the basic things about this dataset – how many rows does it have? What are the columns? What does the data look like?

First look at the data

I have a suspicion that this data contains information only on restaurants in the US. To confirm this, let’s count the values in the country column:

Count unique values in country column

Yep, the only country presented in this dataset is US, so it’s safe to drop the country column altogether. Same goes for menus.currency and priceRangeCurrency, those values too are all the same – USD. I’ll also drop menuPageURL as it doesn’t add much value to the analysis, and key as it duplicates the information from other columns (country, state, city, etc.).

Another cleanup that I’ll do here is rename province column into states as it makes more sense in this context, and for better readability, I’ll replace the state acronyms with full names of the states.

Data cleanup

Once we’re done with cleaning the data, how about we plot it? As humans, we are better at understanding information when it’s presented visually.

First, let’s see what are the most common types of pizza we have in this dataset. Given the theme, it feels appropriate to visualize this as a pie with matplotlib

Pizza pie plot

Oops, where’s my pie? To have it displayed, I need to add %matplotlib inline magic command for IPython, and while I’m at it, I’ll add another magic command to let IPython know to render the plots appropriately for retina screen.

I could add these lines to the same cell and run it again, but I prefer to have this type of magic commands defined at the very beginning of the notebook.

To navigate to the very beginning of the notebook, you can use Cmd+[ (Ctrl+Alt+Left on Windows). Inserting a new cell is as easy as typing #%% (if you prefer a shortcut to insert a cell above your current one, there’s one! Option+Shift+Aon mac, or Alt+Shift+A on Windows). Now all I need to do is add the magic commands and run all cells below:

Run Below

And voila! Now we know that the most common type of pizza is Cheese Pizza closely followed by White Pizza.

Pie plot

What about the restaurants? We have their geographical locations in the dataset, so we can easily see where they are located.

Each restaurant has a unique id and can have multiple entries in the dataset, each entry representing a pizza from that restaurant’s menu. So to plot the restaurants and not the pizza, we’ll need to group the entries by restaurant id.

Unique restaurants

Now we can plot them on a map. For geographical plotting, I like to use plotly. Make sure to grab the latest version of it (4.0.0) to have plotly outputs rendered nicely in PyCharm.

Pizza restaurants on a map

What else can we learn from this data? Let’s try something a little more complicated. Let’s see what states have the most pizza restaurants in them. To make this comparison fair, we’ll count the restaurants per capita (per 100 000 residents). You can get the population data for the US and multiple other datasets at https://www.census.gov/.

Pizza restaurants per capita

And the winner is… New York!

One can think of a number of questions we can try to get answered with this dataset, like, what city has the most/least expensive Veggie Pizza? Or what are the most common pizza restaurant chains? If you want to toy with this dataset and answer these or other questions, you can grab it on kaggle and run your own analysis. The notebook used in this blog post is available on GitHub. And if you want to try it with PyCharm, make sure you’re using PyCharm 2019.2 Professional Edition.

↧

Erik Marsja: How to Read and Write JSON Files using Python and Pandas

July 31, 2019, 8:22 am

≫ Next: Will Kahn-Greene: crashstats-tools v1.0.1 released! cli for Crash Stats.

≪ Previous: PyCharm: Jupyter, PyCharm and Pizza

In this post we will learn how to read and write JSON files using Python. In the first, part we are going to use the Python package json to create a JSON file and write a JSON file. In the next part we are going to use Pandas json method to load JSON files into Pandas dataframe. Here, we will learn how to read from a JSON file locally and from an URL as well as how to read a nested JSON file using Pandas.

Finally, as a bonus, we will also learn how to manipulate data in Pandas dataframes, rename columns, and plot the data using Seaborn.

What is a JSON File?

JSON, short for JavaScript Object Notation, is a compact, text based format used to exchange data. This format that is common for downloading, and storing, information from web servers via so-called Web APIs. JSON is a text-based format and when opening up a JSON file, we will recognize the structure. That is, it is not so different from Python’s structure for a dictionary.

Example JSON file

In the first example we are going to use the Python module json to create a JSON file. After we’ve done that we are going to load the JSON file. In this Python JSON tutorial, we start by create a dictionary for our data:

import json

data = {"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Manegement", "IT", "HR", 
                      "Finance", "IT", "Manegement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}

print(data)

Python dictionary

Saving to a JSON file

In Python, there is the module json that enables us read and write content to and from a JSON file. This module converts the JSONs format to Python’s internal format for Data Structures. So we can work with JSON structures just as we do in the usual way with Python’s own data structures.

Python JSON Example:

In the example code below, we start by importing the json module. After we’ve done that, we open up a new file and use the dump method to write a json file using Python.

import json
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

How to Use Pandas to Load a JSON File

Now, if we are going to work with the data we might want to use Pandas to load the JSON file into a Pandas dataframe. This will enable us to manipulate data, do summary statistics, and data visualization using Pandas built-in methods. Note, we will cover this briefly later in this post also.

Pandas Read Json Example:

In the next example we are going to use Pandas read_json method to read the JSON file we wrote earlier (i.e., data.json). It’s fairly simple we start by importing pandas as pd:

import pandas as pd

df = pd.read_json('data.json')

df

The output, when working with Jupyter Notebooks, will look like this:

Data Manipulation using Pandas

Now that we have loaded the JSON file into a Pandas dataframe we are going use Pandas inplace method to modify our dataframe. We start by setting the Sub_ID column as index.

df.set_index('Sub_ID', inplace=True)
df

Pandas JSON to CSV Example

Now when we have loaded a JSON file into a dataframe we may want to save it in another format. For instance, we may want to save it as a CSV file and we can do that using Pandas read_csv method. It may be useful to store it in a CSV, if we prefer to browse through the data in a text editor or Excel.

In the Pandas JSON to CSV example below, we carry out the same data manipulation method.

df.to_csv("data.csv")

Learn more about working with CSV files using Pandas in the Pandas Read CSV Tutorial

How to Load JSON from an URL

We have now seen how easy it is to create a JSON file, write it to our hard drive using Python, and, finally, how to read it using Pandas. However, as previously mentioned, many times the data in stored in the JSON format are on the web.

Thus, in this section of the Python json guide, we are going to learn how to use Pandas read_json method to read a JSON file from an URL. Most often, it’s fairly simple we just create a string variable pointing to the URL:

url = "https://api.exchangerate-api.com/v4/latest/USD"
df = pd.read_json(url)
df.head()

Load JSON from an URL Second Example

When loading some data, using Pandas read_json seems to create a dataframe with dictionaries within each cell. One way to deal with these dictionaries, nested within dictionaries, is to work with the Python module request. This module also have a method for parsing JSON files. After we have parsed the JSON file we will use the method json_normalize to convert the JSON file to a dataframe.

Pandas Dataframe from JSON

import requests
from pandas.io.json import json_normalize

url = "https://think.cs.vt.edu/corgis/json/airlines/airlines.json"
resp = requests.get(url=url)

df = json_normalize(resp.json())
df.head()

As can be seen in the image above, the column names are quite long. This is quite impractical when we are going to create a time series plot, later, using Seaborn. We are now going to rename the columns so they become a bit easier to use.

In the code example below, we use Pandas rename method together with the Python module re. That is, we are using a regular expression to remove “statistics.# of” and “statistics.” from the column names. Finally, we are also replacing dots (“.”) with underscores (“_”) using the str.replace method:

import re

df.rename(columns=lambda x: re.sub("statistics.# of","",x), 
          inplace=True)
df.rename(columns=lambda x: re.sub("statistics.","",x), 
          inplace=True)

df.columns = df.columns.str.replace("[.]", "_")
df.head()

Time Series Plot from JSON Data using Seaborn

In the last example, in this post, we are going to use Seaborn to create a time series plot. The data we loaded from JSON to a dataframe contains data about delayed and canceled flights. We are going to use Seaborns lineplot method to create a time series plot of the number of canceled flights throughout 2003 to 2016, grouped by carrier code.

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(10, 7))
g = sns.lineplot(x="timeyear", y="flightscancelled", ci=False,
             hue="carriercode", data=df)

g.set_ylabel("Flights Cancelled",fontsize=20)
g.set_xlabel("Year",fontsize=20)


plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Note, we changed the font size as well as the x- and y-axis’ labels using the methods set_ylabel and set_xlabel. Furthermore, we also moved the legend using the legend method from matplotlib.

For more about exploratory data analysis using Python:

Conclusion

In this post we have learned how to write a JSON file from a Python dictionary, how to load that JSON file using Python and Pandas. Furthermore, we have also learned how to use Pandas to load a JSON file from an URL to a dataframe, how to read a nested JSON file to a dataframe.

Here’s a link to a Jupyter Notebook containing all code examples in this post.

The post How to Read and Write JSON Files using Python and Pandas appeared first on Erik Marsja.

↧

Will Kahn-Greene: crashstats-tools v1.0.1 released! cli for Crash Stats.

July 31, 2019, 12:00 pm

≫ Next: PSF GSoC students blogs: Coding week #9

≪ Previous: Erik Marsja: How to Read and Write JSON Files using Python and Pandas

What is it?

crashstats-tools is a set of command-line tools for working with Crash Stats (https://crash-stats.mozilla.org/).

crashstats-tools comes with two commands:

supersearch: for performing Crash Stats Super Search queries
fetch-data: for fetching raw crash, dumps, and processed crash data for specified crash ids

v1.0.1 released!

I extracted two commands we have in the Socorro local dev environment as a separate Python project. This allows anyone to use those two commands without having to set up a Socorro local dev environment.

The audience for this is pretty limited, but I think it'll help significantly for testing analysis tools.

Say I'm working on an analysis tool that looks at crash report minidump files and does some additional analysis on it. I could use supersearch command to get me a list of crash ids to download data for and the fetch-data command to download the requisite data.

$ exportCRASHSTATS_API_TOKEN=foo
$ mkdir crashdata
$ supersearch --product=Firefox --num=10|\    fetch-data --raw --dumps --no-processed crashdata

Then I can run my tools on the dumps in crashdata/upload_file_minidump/.

Be thoughtful about using data

Make sure to use these tools in compliance with our data policy:

https://crash-stats.mozilla.org/documentation/memory_dump_access/

Where to go for more

See the project on GitHub which includes a README which contains everything about the project including examples of usage, the issue tracker, and the source code:

https://github.com/willkg/crashstats-tools

Let me know whether this helps you!

↧

PSF GSoC students blogs: Coding week #9

July 31, 2019, 6:06 pm

≫ Next: PSF GSoC students blogs: Weekly Check-in #8

≪ Previous: Will Kahn-Greene: crashstats-tools v1.0.1 released! cli for Crash Stats.

What did I do this week?

After a productive discussion with my mentors last week, we agreed to proceed coding the local scoring algorithm for Binomial MGWR and testing the results from that. After familiarizing myself with the literature on the local scoring procedure, I coded it in the context of local models for MGWR. After multiple iterations, the model is converging and the bandwidth results are looking as expected. Though the parameter coefficients have values close to expected but not as accurate as needed. There could be possible issues with the weights associated in the model, and some adjustments need to be made for the coefficients which need to be figured out.

What is coming up next?

In the coming week I will work on resolving the coefficient value issue discussed above and design and implement a Monte-Carlo design for the Binomial model as was done for the Poisson MGWR model.

Did I get stuck anywhere?

The modeling of the binary response variable with MGWR is still not resolved and issues have been encountered continuously in it, though that is expected from research. Hoping to resolve these final issues soon and work further on the predictions with GWR and MGWR.

Looking forward to the progress update next week!

↧

PSF GSoC students blogs: Weekly Check-in #8

July 31, 2019, 9:20 pm

≫ Next: Tryton News: Newsletter August 2019

≪ Previous: PSF GSoC students blogs: Coding week #9

In the past week, I was working on setting up Hadoop and trying to import data from it. I got my PRs reviewed by my mentor and working on the changes he suggested.

What did I do this week?

I had initially set up Hadoop in my Ubuntu system. But setting this up would be difficult in Travis CI. So I was exploring other options. The easy way to do this is through docker but there is no official Hadoop distribution in docker. I was checking out cloudera's quick start VM but when I was trying to set this up my laptop started to hang. I will continue to look into other options. Also my mentor had reviewed my HDFS source PR and guided me on how to proceed further.

What is coming up next?

I'll have to work on the docker set up for Hdfs source. I'll probably have to write a script or a docker-compose script. MySQL PR had an issue while my mentor was adding a merge test. Will work on that as well. We'll be preparing for a release soon.

Did you get stuck anywhere?

I struggled a bit with the Hadoop set up. My mentor gave me some input on this and hopefully I'll be able to create a docker set up by next week.

↧

Tryton News: Newsletter August 2019

July 31, 2019, 11:00 pm

≫ Next: Django Weblog: Django security releases issued: 2.2.4, 2.1.11 and 1.11.23

≪ Previous: PSF GSoC students blogs: Weekly Check-in #8

@ced wrote:

as11-44-6551.jpg800×528 126 KB
The Tryton development has resumed now its cruising pace. There are a lot of changes to improve the user experiences. A new major feature, the secondary unit, has landed in the form of four new modules.
Thanks to the Open Source program of KeyCDN, our website and forum are now speeded up by delivering static content at global scale. We have also pushed our downloads on the KeyCDN, so we encourage you to use downloads-cdn.tryton.org instead of downloads.tryton.org (Thank you for checking your automated scripts which are looking up for new releases).
Please help translating Tryton in to your language. The development sources from the repositories are updated every month. So please don’t forget to check them regularly every month on https://translate.tryton.org/.
Contents:
Changes for users
New modules
Changes for developers
Changes For The User
We changed all main editable lists to add new records on top. This is more efficient for the web client on large lists. But inside One2Many lists we keep adding new records at the bottom.
You can now define which unit of measure is the basis for quantities used in a price list. In standard modules we support the default unit (the original one) and the sale unit.
The column size of the web client has been improved. Now columns have a minimal width (depending of the type) and a double scrollbar (top and bottom) is displayed if there is not enough space to show all the columns on the view-port.
Until now, it was possible to cancel a posted supplier invoice but not one from a customer. This was because in many countries it is not allowed. But in order to be more flexible, we added an option on the company to allow cancel of customer invoice.
When renewing a fiscal year, the new sequences will have their name also updated. If the name of the previous fiscal year appears in the sequence name then it will be replaced by the new fiscal year name. This reduces confusion when listing all the sequences.
The income statement is included in the balance sheet for the Spanish accounting (as it is done for other countries). So the running income of the current year is already included before the year closing.
When using the “Enter Timesheet” wizard, now we display the date in the window name (next to the employee name) . The shown date is the one selected in the first step of the wizard.
New Modules
Modules to manage secondary unit
They follow the blueprint of Uom Conversion Inter category on product
and allow to define a different secondary unit and factor on the product for sale and for purchase.
The quantity of sale and purchase lines can be defined using the secondary unit fields (quantity and unit price), the main unit fields are automatically updated using the product factor.
On related documents like the invoice or shipment, the secondary fields are displayed using the factor stored on the sale or purchase.
Account Invoice Secondary Unit
Purchase Secondary Unit
Sale Secondary Unit
Stock Secondary Unit
Changes For The Developer
Since release 5.2 we have made the view parser reusable for different view types (e.g. form and list-form). Now we reuse also the form parser for the board views. This reduces the code to maintain and ensure the same behavior for the same tags.
A stock move can be the origin of another stock move. This allows us to keep a link between inventory, incoming and outgoing moves.
We support the conversion between different categories of unit of measures as long as the user provides a factor/rate for the two base units of both categories.
The docker images of Tryton have proteus installed now. This is useful if you want to run trytond_import_zip on it or launch the tests.
The expand attribute has been changed from a boolean (1 or 0) into an integer. The integer represents the proportion of available space which is taken among all expanded columns.
The format_date method on Report can now take an optional format parameter if you don’t want to use the default format of the language.
The web client updates the states of the wizard buttons and title like the desktop client does. This closes a little more the behavior gap between both clients.
The Stripe module for payment didn’t support the webhook when charge expired. Now it is supported and behaves the same way as for charge failed.
There is now an environment variable to set the default logging level when running trytond as a WSGI application.
The countries, subdivisions and currencies are no more loaded from XML at the module installation but using proteus scripts which use pycountry data: trytond_import_countries and trytond_import_currencies. The translations are also loaded by those scripts.
This reduces the maintenance load of each release and allows users to keep their database up to date without relying on Tryton releases.

Posts: 1

Participants: 1

Read full topic

↧

Django Weblog: Django security releases issued: 2.2.4, 2.1.11 and 1.11.23

August 1, 2019, 2:10 am

≫ Next: PSF GSoC students blogs: Weekly Checkin #5

≪ Previous: Tryton News: Newsletter August 2019

In accordance with our security release policy, the Django team is issuing Django 1.11.23, Django 2.1.11, and Django 2.2.4. These releases addresses the security issues detailed below. We encourage all users of Django to upgrade as soon as possible.

Thanks Guido Vranken and Sage M. Abdullah for reporting these issues.

CVE-2019-14232: Denial-of-service possibility in `django.utils.text.Truncator`

If django.utils.text.Truncator's chars() and words() methods were passed the html=True argument, they were extremely slow to evaluate certain inputs due to a catastrophic backtracking vulnerability in a regular expression. The chars() and words() methods are used to implement the truncatechars_html and truncatewords_html template filters, which were thus vulnerable.

The regular expressions used by Truncator have been simplified in order to avoid potential backtracking issues. As a consequence, trailing punctuation may now at times be included in the truncated output.

CVE-2019-14233: Denial-of-service possibility in `strip_tags()`

Due to the behavior of the underlying HTMLParser, django.utils.html.strip_tags() would be extremely slow to evaluate certain inputs containing large sequences of nested incomplete HTML entities. The strip_tags() method is used to implement the corresponding striptags template filter, which was thus also vulnerable.

strip_tags() now avoids recursive calls to HTMLParser when progress removing tags, but necessarily incomplete HTML entities, stops being made.

Remember that absolutely NO guarantee is provided about the results of strip_tags() being HTML safe. So NEVER mark safe the result of a strip_tags() call without escaping it first, for example with django.utils.html.escape().

CVE-2019-14234: SQL injection possibility in key and index lookups for `JSONField`/`HStoreField`

Key and index lookups for django.contrib.postgres.fields.JSONField and key lookups for django.contrib.postgres.fields.HStoreField were subject to SQL injection, using a suitably crafted dictionary, with dictionary expansion, as the **kwargs passed to QuerySet.filter().

CVE-2019-14235: Potential memory exhaustion in `django.utils.encoding.uri_to_iri()`

If passed certain inputs, django.utils.encoding.uri_to_iri could lead to significant memory usage due to excessive recursion when re-percent-encoding invalid UTF-8 octet sequences.

uri_to_iri() now avoids recursion when re-percent-encoding invalid UTF-8 octet sequences.

Affected supported versions

Django master development branch
Django 2.2 before version 2.2.4
Django 2.1 before version 2.1.11
Django 1.11 before version 1.11.23

Resolution

Patches to resolve the issue have been applied to Django's master branch and the 2.2, 2.1, and 1.11 release branches. The patches may be obtained from the following changesets:

On the development master branch:

On the Django 2.2 release branch:

On the Django 2.1 release branch:

On the Django 1.11 release branch:

The following releases have been issued:

Django 1.11.23 (download Django 1.11.23 | 1.11.23 checksums)
Django 2.1.11 (download Django 2.1.11 | 2.1.11 checksums)
Django 2.2.4 (download Django 2.2.4 | 2.2.4 checksums)

The PGP key ID used for this release is Carlton Gibson: E17DF5C82B4F9D00

General notes regarding security reporting

As always, we ask that potential security issues be reported via private email to security@djangoproject.com, and not via Django's Trac instance, Django's GitHub repositories, or the django-developers list. Please see our security policies for further information.

↧

PSF GSoC students blogs: Weekly Checkin #5

August 1, 2019, 3:09 am

≫ Next: Matt Layman: Add Static Assets to Deployment - Building SaaS #29

≪ Previous: Django Weblog: Django security releases issued: 2.2.4, 2.1.11 and 1.11.23

1. What did you do this week?

I'm still going on with zip reformulation and at the same time also working on the keyfun parameter of the min and max builtin. To be completely honest this week has not be that productive because I am studying for an upcoming exam. But I'm hoping to get back on track in a next couple of days.

2. What is coming up next?

I will keep working on both them until they are ready to be merged.

3. Did you get stuck anywhere?

Yes I was stuck with zip reformulations for a while but with the help of my mentor everything is back on track.

↧

Matt Layman: Add Static Assets to Deployment - Building SaaS #29

July 30, 2019, 5:00 pm

≫ Next: Catalin George Festila: Python 3.7.3 : Using the flask - part 004.

≪ Previous: PSF GSoC students blogs: Weekly Checkin #5

In this episode, we pushed CI built static files to S3, then pulled those files into the Ansible deployment. This is part of the ongoing effort to simplify deployment by moving work to CI. Last time, we processed static files like JavaScript, CSS, and images using webpack on Circle CI. Once the files were processed, I used the tar command to create a tarball (i.e., a .tar.gz file) that contains all the static assets.

↧

Catalin George Festila: Python 3.7.3 : Using the flask - part 004.

August 1, 2019, 3:04 am

≫ Next: Catalin George Festila: Python 3.7.3 : Using the flask - part 005.

≪ Previous: Matt Layman: Add Static Assets to Deployment - Building SaaS #29

The goal of this tutorial is to interact with the database in order to use it with flask_sqlalchemy python module. The db.Model is used to interact with the database. A database doesn't need a primary key but if you using the flask-sqlalchemy you need to have it for each one table in order to connect it. Let's see the database: C:\Python373\my_flask>python Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25

↧

Catalin George Festila: Python 3.7.3 : Using the flask - part 005.

August 1, 2019, 3:13 am

≫ Next: IslandT: Use the Blockchain data to populate the combo box

≪ Previous: Catalin George Festila: Python 3.7.3 : Using the flask - part 004.

In the last tutorial, I used the flask-sqlalchemy python module. Today I will show you how to use the flask_marshmallow python module. First, let's take a look at this python module, see the official webpage: Flask-Marshmallow is a thin integration layer for Flask (a Python web framework) and marshmallow (an object serialization/deserialization library) that adds additional features to

↧

IslandT: Use the Blockchain data to populate the combo box

August 1, 2019, 4:59 am

≫ Next: Continuum Analytics Blog: 4 Ways Financial Firms Put Machine Learning to Work

≪ Previous: Catalin George Festila: Python 3.7.3 : Using the flask - part 005.

Previously the cryptocurrency application has loaded the world currency text file and then populate the currency combo box based on the currency symbol in that text file. In this article, the cryptocurrency program will use the returning currency symbol from Blockchain to populate that same combo box.

Populate the currency combo box with currency symbols from Blockchain.

    for k in ticker:
        #sell_buy += "BTC:" + str(k) + " " + str(ticker[k].p15min) + "\n"
        currency_exchange_rate.append((ticker[k].p15min))
        currency_index.append(str(k))
        curr1 += (str(k),) # the tuple used to populate the currency combo box

As you can see all the currency symbols will be kept in a tuple which will then be used to populate the currency combo box.

Below is the entire code used to populate the combo box with the currency symbol.

def get_exchange_rate():  # this method will display the incoming exchange rate data after the api called

    global exchange_rate
    global base_crypto

    # read the currency file
    #c_string = ''
    #with open('currency.txt') as fp:
        #for currency in fp.readlines():
            #c_string += currency[:-1] + ","
    #c_string = c_string[:-1]

    #base_crypto = crypto.get()  # get the desired crypto currency
    #try:
        #url = "https://min-api.cryptocompare.com/data/price"  # url for API call
        #data = {'fsym': base_crypto, 'tsyms': c_string}
        # = requests.get(url, params=data)
        #exchange_rate_s = json.loads(json.dumps(r.json()))

    #except:
        #print("An exception occurred")

    curr1 = tuple()  # the tuple which will be populated by currency
    sell_buy = ''

    #for key, value in exchange_rate_s.items():  # populate exchange rate string and the currency tuple
        #sell_buy += base_crypto + ":" + key + "  " + str(value) + "\n"
        #curr1 += (key,)

    #sell_buy += "Bitcoin : Currency price every 15 minute:" + "\n\n"
    # print the 15 min price for every bitcoin/currency
    currency_exchange_rate = []
    currency_index = []


    for k in ticker:
        #sell_buy += "BTC:" + str(k) + " " + str(ticker[k].p15min) + "\n"
        currency_exchange_rate.append((ticker[k].p15min))
        currency_index.append(str(k))
        curr1 += (str(k),) # the tuple used to populate the currency combo box

    # construct the pandas data frame object
    d = {'BTC': currency_exchange_rate}
    df = pd.DataFrame(data=d, index=currency_index)

    countVar = StringVar()  # use to hold the character count
    text_widget.tag_remove("search", "1.0", "end")  # cleared the hightlighted currency pair

    text_widget.delete('1.0', END)  # clear all those previous text first
    s.set(df)
    text_widget.insert(INSERT, s.get())  # populate the text widget with new exchange rate data

    # highlight the background of the searched currency pair
    pos = text_widget.search('AUD', "1.0", stopindex="end", count=countVar)
    text_widget.tag_configure("search", background="green")
    end_pos = float(pos) + float(0.96)
    text_widget.tag_add("search", pos, str(end_pos))
    pos = float(pos) + 2.0
    text_widget.see(str(pos))

    # fill up combo box of world currency
    based['values'] = curr1
    based.current(0)

    # enable all buttons
    action_search.config(state=NORMAL)
    action_coin_volume.config(state=NORMAL)
    action_coin_market_cap.config(state=NORMAL)
    action_coin_top_exchange.config(state=NORMAL)

Some of the code has appeared in the previous chapter which you can read them in the previous article.

Load the currency symbol from Blockchain

↧

Big Data Concepts in Python

Lambda Functions

filter(), map(), and reduce()

Sets

Hello World in PySpark

What Is Spark?

What Is PySpark?

PySpark API and Data Structures

Installing PySpark

Running PySpark Programs

Jupyter Notebook

Command-Line Interface

Cluster

PySpark Shell

Combining PySpark With Other Tools

Next Steps for Real Big Data Processing

Conclusion

What is a JSON File?

Saving to a JSON file

Python JSON Example:

How to Use Pandas to Load a JSON File

Pandas Read Json Example:

Data Manipulation using Pandas

Pandas JSON to CSV Example

How to Load JSON from an URL

Load JSON from an URL Second Example

Time Series Plot from JSON Data using Seaborn

Conclusion

What is it?

v1.0.1 released!

Be thoughtful about using data

Where to go for more

What did I do this week?

What is coming up next?

Did I get stuck anywhere?

What did I do this week?

What is coming up next?

Did you get stuck anywhere?

Changes For The User

New Modules

Modules to manage secondary unit

Changes For The Developer

CVE-2019-14232: Denial-of-service possibility in django.utils.text.Truncator

CVE-2019-14233: Denial-of-service possibility in strip_tags()

CVE-2019-14234: SQL injection possibility in key and index lookups for JSONField/HStoreField

CVE-2019-14235: Potential memory exhaustion in django.utils.encoding.uri_to_iri()

Affected supported versions

Resolution

General notes regarding security reporting

1. What did you do this week?

2. What is coming up next?

3. Did you get stuck anywhere?

`filter()`, `map()`, and `reduce()`

CVE-2019-14232: Denial-of-service possibility in `django.utils.text.Truncator`

CVE-2019-14233: Denial-of-service possibility in `strip_tags()`

CVE-2019-14234: SQL injection possibility in key and index lookups for `JSONField`/`HStoreField`

CVE-2019-14235: Potential memory exhaustion in `django.utils.encoding.uri_to_iri()`