Quantcast
Channel: Planet Python
Viewing all 22642 articles
Browse latest View live

Eli Bendersky: Covariance and contravariance in subtyping

$
0
0

Many programming languages support subtyping, a kind of polymorphism that lets us define hierarchical relations on types, with specific types being subtypes of more generic types. For example, a Cat could be a subtype of Mammal, which itself is a subtype of Vertebrate.

Intuitively, functions that accept any Mammal would accept a Cat too. More formally, this is known as the Liskov substitution principle:

Let \phi (x) be a property provable about objects x of type T. Then \phi (y) should be true for objects y of type S where S is a subtype of T.

A shorter way to say S is a subtype of T is S <: T. The relation <: is also sometimes expressed as \le, and can be thought of as "is less general than". So Cat <: Mammal and Mammal <: Vertebrate. Naturally, <: is transitive, so Cat <: Vertebrate; it's also reflexive, as T <: T for any type T [1].

Kinds of variance in subtyping

Variance refers to how subtyping between composite types (e.g. list of Cats versus list of Mammals) relates to subtyping between their components (e.g. Cats and Mammals). Let's use the general Composite<T> to refer to some composite type with components of type T.

Given types S and T with the relation S <: T, variance is a way to describe the relation between the composite types:

  • Covariant means the ordering of component types is preserved: Composite<S> <: Composite<T>.
  • Contravariant means the ordering is reversed: Composite<T> <: Composite<S>[2].
  • Bivariant means both covariant and contravariant.
  • Invariant means neither covariant nor contravariant.

That's a lot of theory and rules right in the beginning; the following examples should help clarify all of this.

Covariance in return types of overriding methods in C++

In C++, when a subclass method overrides a similarly named method in a superclass, their signatures have to match. There is an important exception to this rule, however. When the original return type is B* or B&, the return type of the overriding function is allowed to be D* or D& respectively, provided that D is a public subclass of B. This rule is important to implement methods like Clone:

structMammal{virtual~Mammal()=0;virtualMammal*Clone()=0;};structCat:publicMammal{virtual~Cat(){}Cat*Clone()override{returnnewCat(*this);}};structDog:publicMammal{virtual~Dog(){}Dog*Clone()override{returnnewDog(*this);}};

And we can write functions like the following:

Mammal*DoSomething(Mammal*m){Mammal*cloned=m->Clone();// Do something with clonedreturncloned;}

No matter what the concrete run-time class of m is, m->Clone() will return the right kind of object.

Armed with our new terminology, we can say that the return type rule for overriding methods is covariant for pointer and reference types. In other words, given Cat <: Mammal we have Cat* <: Mammal*.

Being able to replace Mammal* by Cat* seems like a natural thing to do in C++, but not all typing rules are covariant. Consider this code:

structMammalClinic{virtualvoidAccept(Mammal*m);};structCatClinic:publicMammalClinic{virtualvoidAccept(Cat*c);};

Looks legit? We have general MammalClinics that accept all mammals, and more specialized CatClinics that only accept cats. Given a MammalClinic*, we should be able to call Accept and the right one will be invoked at run-time, right? Wrong. CatClinic::Accept does not actually override MammalClinic::Accept; it simply overloads it. If we try to add the override keyword (as we should always do starting with C++11):

structCatClinic:publicMammalClinic{virtualvoidAccept(Cat*c)override;};

We'll get:

error: ‘virtual void CatClinic::Accept(Cat*)’ marked ‘override’, but does not override
   virtual void Accept(Cat* c) override;
                ^

This is precisely what the override keyword was created for - help us find erroneous assumptions about methods overriding other methods. The reality is that function overrides are not covariant for pointer types. They are invariant. In fact, the vast majority of typing rules in C++ are invariant; std::vector<Cat> is not a subclass of std::vector<Mammal>, even though Cat <: Mammal. As the next section demonstrates, there's a good reason for that.

Covariant arrays in Java

Suppose we have PersianCat <: Cat, and some class representing a list of cats. Does it make sense for lists to be covariant? On initial thought, yes. Say we have this (pseudocode) function:

MakeThemMeow(List<Cat> lst) {
    for each cat in lst {
        cat->Meow()
    }
}

Why shouldn't we be able to pass a List<PersianCat> into it? After all, all persian cats are cats, so they can all meow! As long as lists are immutable, this is actually safe. The problem appears when lists can be modified. The best example of this problem can be demonstrated with actual Java code, since in Java array constructors are covariant:

classMain{publicstaticvoidmain(String[]args){Stringstrings[]={"house","daisy"};Objectobjects[]=strings;// covariantobjects[1]="cauliflower";// works fineobjects[0]=5;// throws exception}}

In Java, String <: Object, and since arrays are covariant, it means that String[] <: Object[], which makes the assignment on the line marked with "covariant" type-check successfully. From that point on, objects is an array of Object as far as the compiler is concerned, so assigning anything that's a subclass of Object to its elements is kosher, including integers [3]. Therefore the last line in main throws an exception at run-time:

Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer
    at Main.main(Main.java:7)

Assigning an integer fails because at run-time it's known that objects is actually an array of strings. Thus, covariance together with mutability makes array types unsound. Note, however, that this is not just a mistake - it's a deliberate historical decision made when Java didn't have generics and polymorphism was still desired; the same problem exists in C# - read this for more details.

Other languages have immutable containers, which can then be made covariant without jeopardizing the soundness of the type system. For example in OCaml lists are immutable and covariant.

Contravariance for function types

Covariance seems like a pretty intuitive concept, but what about contravariance? When does it make sense to reverse the subtyping relation for composite types to get Composite<T> <: Composite<S> for S <: T?

An important use case is function types. Consider a function that takes a Mammal and returns a Mammal; in functional programming the type of this function is commonly referred to as Mammal -> Mammal. Which function types are valid subtypes of this type?

Here's a pseudo-code definition that makes it easier to discuss:

func user(f : Mammal -> Mammal) {
  // do stuff with 'f'
}

Can we call user providing it a function of type Mammal -> Cat as f? Inside its body, user may invoke f and expect its return value to be a Mammal. Since Mammal -> Cat returns cats, that's fine, so this usage is safe. It aligns with our earlier intuition that covariance makes sense for function return types.

Note that passing a Mammal -> Vertebrate function as f doesn't work as well, because user expects f to return Mammals, but our function may return a Vertebrate that's not a Mammal (maybe a Bird). Therefore, function return types are not contravariant.

But what about function parameters? So far we've been looking at function types that take Mammal - an exact match for the expected signature of f. Can we call user with a function of type Cat -> Mammal? No, because user expects to be able to pass any kind of Mammal into f, not just Cats. So function parameters are not covariant. On the other hand, it should be safe to pass a function of type Vertebrate -> Mammal as f, because it can take any Mammal, and that's what user is going to pass to it. So contravariance makes sense for function parameters.

Most generally, we can say that Vertebrate -> Cat is a subtype of Mammal -> Mammal, because parameters types are contravariant and return types are covariant. A nice quote that can help remember these rules is: be liberal in what you accept and conservative in what you produce.

This is not just theory; if we go back to C++, this is exactly how function types with std::function behave:

#include<functional>structVertebrate{};structMammal:publicVertebrate{};structCat:publicMammal{};Cat*f1(Vertebrate*v){returnnullptr;}Vertebrate*f2(Vertebrate*v){returnnullptr;}Cat*f3(Cat*v){returnnullptr;}voidUser(std::function<Mammal*(Mammal*)>f){// do stuff with 'f'}intmain(){User(f1);// worksreturn0;}

The invocation User(f1) compiles, because f1 is convertible to the type std::function<Mammal*(Mammal*)>[4]. Had we tried to invoke User(f2) or User(f3), they would fail because neither f2 nor f3 are proper subtypes of std::function<Mammal*(Mammal*)>.

Bivariance

So far we've seen examples of invariance, covariance and contravariance. What about bivariance? Recall, bivariance means that given S <: T, both Composite<S> <: Composite<T> and Composite<T> <: Composite<S> are true. When is this useful? Not often at all, it turns out.

In TypeScript, function parameters are bivariant. The following code compiles correctly but fails at run-time:

functiontrainDog(d:Dog){...}functioncloneAnimal(source:Animal,done:(result:Animal)=>void):void{...}letc=newCat();// Runtime error here occurs because we end up invoking 'trainDog' with a 'Cat'cloneAnimal(c,trainDog);

Once again, this is not because the TypeScript designers are incompetent. The reason is fairly intricate and explained on this page; the summary is that it's needed to help the type-checker treat functions that don't mutate their arguments as covariant for arrays.

That said, in TypeScript 2.6 this is being changed with a new strictness flag that treats parameters only contravariantly.

Explicit variance specification in Python type-checking

If you had to guess which of the mainstream languages has the most advanced support for variance in their type system, Python probably wouldn't be your first guess, right? I admit it wasn't mine either, because Python is dynamically (duck) typed. But the new type hinting support (described in PEP 484 with more details in PEP 483) is actually fairly advanced.

Here's an example:

classMammal:passclassCat(Mammal):passdefcount_mammals_list(seq:List[Mammal])->int:returnlen(seq)mlst=[Mammal(),Mammal()]print(count_mammals_list(mlst))

If we run mypy type-checking on this code, it will succeed. count_mammals_list takes a list of Mammals, and this is what we passed in; so far, so good. However, the following will fail:

clst=[Cat(),Cat()]print(count_mammals_list(clst))

Because List is not covariant. Python doesn't know whether count_mammals_list will modify the list, so allowing calls with a list of Cats is potentially unsafe.

It turns out that the typing module lets us express the variance of types explicitly. Here's a very minimal "immutable list" implementation that only supports counting elements:

T_co=TypeVar('T_co',covariant=True)classImmutableList(Generic[T_co]):def__init__(self,items:Iterable[T_co])->None:self.lst=list(items)def__len__(self)->int:returnlen(self.lst)

And now if we define:

defcount_mammals_ilist(seq:ImmutableList[Mammal])->int:returnlen(seq)

We can actually invoke it with a ImmutableList of Cats, and this will pass type checking:

cimmlst=ImmutableList([Cat(),Cat()])print(count_mammals_ilist(cimmlst))

Similarly, we can support contravariant types, etc. The typing module also provides a number of useful built-ins; for example, it's not really necessary to create an ImmutableList type, as there's already a Sequence type that is covariant.


[1]In most cases <: is also antisymmetric, making it a partial order, but in some cases it isn't; for example, structs with permuted fields can be considered subtypes of each other (in most languages they aren't!) but such subtyping is not antisymmetric.
[2]These terms come from math, and a good rule of thumb to remember how they apply is: co means together, while contra means against. As long as the composite types vary together (in the same direction) as their component types, they are co-variant. When they vary against their component types (in the reverse direction), they are contra-variant.
[3]Strictly speaking, integer literals like 5 are primitives in Java and not objects at all. However, due to autoboxing, this is equivalent to wrapping the 5 in Integer prior to the assignment.
[4]Note that we're using pointer types here. The same example would work with std::function<Mammal(Mammal)> and corresponding f1 taking and returning value types. It's just that in C++ value types are not very useful for polymorphism, so pointer (or reference) values are much more commonly used.

Stack Abuse: Creating a Neural Network from Scratch in Python: Multi-class Classification

$
0
0

This is the third article in the series of articles on "Creating a Neural Network From Scratch in Python".

If you have no prior experience with neural networks, I would suggest you first read Part 1 and Part 2 of the series (linked above). Once you feel comfortable with the concepts explained in those articles, you can come back and continue this article.

Introduction

In the previous article, we saw how we can create a neural network from scratch, which is capable of solving binary classification problems, in Python. A binary classification problem has only two outputs. However, real-world problems are far more complex.

Consider the example of digit recognition problem where we use the image of a digit as an input and the classifier predicts the corresponding digit number. A digit can be any number between 0 and 9. This is a classic example of a multi-class classification problem where input may belong to any of the 10 possible outputs.

In this article, we will see how we can create a simple neural network from scratch in Python, which is capable of solving multi-class classification problems.

Dataset

Let's first briefly take a look at our dataset. Our dataset will have two input features and one of the three possible output. We will manually create a dataset for this article.

To do so, execute the following script:

import numpy as np  
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])  
mouse_images = np.random.randn(700, 2) + np.array([3, 3])  
dog_images = np.random.randn(700, 2) + np.array([-3, 3])  

In the script above, we start by importing our libraries and then we create three two-dimensional arrays of size 700 x 2. You can think of each element in one set of the array as an image of a particular animal. Each array element corresponds to one of the three output classes.

An important point to note here is that, that if we plot the elements of the cat_images array on a two-dimensional plane, they will be centered around x=0 and y=-3. Similarly, the elements of the mouse_images array will be centered around x=3 and y=3, and finally, the elements of the array dog_images will be centered around x=-3 and y=3. You will see this once we plot our dataset.

Next, we need to vertically join these arrays to create our final dataset. Execute the following script to do so:

feature_set = np.vstack([cat_images, mouse_images, dog_images])  

We created our feature set, and now we need to define corresponding labels for each record in our feature set. The following script does that:

labels = np.array([0]*700 + [1]*700 + [2]*700)  

The above script creates a one-dimensional array of 2100 elements. The first 700 elements have been labeled as 0, the next 700 elements have been labeled as 1 while the last 700 elements have been labeled as 2. This is just our shortcut way of quickly creating the labels for our corresponding data.

For multi-class classification problems, we need to define the output label as a one-hot encoded vector since our output layer will have three nodes and each node will correspond to one output class. We want that when an output is predicted, the value of the corresponding node should be 1 while the remaining nodes should have a value of 0. For that, we need three values for the output label for each record. This is why we convert our output vector into a one-hot encoded vector.

Execute the following script to create the one-hot encoded vector array for our dataset:

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):  
    one_hot_labels[i, labels[i]] = 1

In the above script we create the one_hot_labels array of size 2100 x 3 where each row contains one-hot encoded vector for the corresponding record in the feature set. We then insert 1 in the corresponding column.

If you execute the above script, you will see that the one_hot_labels array will have 1 at index 0 for the first 700 records, 1 at index 1 for next 700 records while 1 at index 2 for the last 700 records.

Now let's plot the dataset that we just created. Execute the following script:

plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)  
plt.show()  

Once you execute the above script, you should see the following figure:

Generated dataset

You can clearly see that we have elements belonging to three different classes. Our task will be to develop a neural network capable of classifying data into the aforementioned classes.

Neural Network with Multiple Output Classes

The neural network that we are going to design has the following architecture:

Neural network structure

You can see that our neural network is pretty similar to the one we developed in Part 2 of the series. It has an input layer with 2 input features and a hidden layer with 4 nodes. However, in the output layer, we can see that we have three nodes. This means that our neural network is capable of solving the multi-class classification problem where the number of possible outputs is 3.

Softmax and Cross-Entropy Functions

Before we move on to the code section, let us briefly review the softmax and cross entropy functions, which are respectively the most commonly used activation and loss functions for creating a neural network for multi-class classification.

Softmax Function

From the architecture of our neural network, we can see that we have three nodes in the output layer. We have several options for the activation function at the output layer. One option is to use sigmoid function as we did in the previous articles.

However, there is a more convenient activation function in the form of softmax that takes a vector as input and produces another vector of the same length as output. Since our output contains three nodes, we can consider the output from each node as one element of the input vector. The output will be a length of the same vector where the values of all the elements sum to 1. Mathematically, the softmax function can be represented as:

$$ y_i(z_i) = \frac{e^{z_i}}{ \sum\nolimits_{k=1}^{k}{e^{z_k}} } $$

The softmax function simply divides the exponent of each input element by the sum of exponents of all the input elements. Let's take a look at a simple example of this:

def softmax(A):  
    expA = np.exp(A)
    return expA / expA.sum()

nums = np.array([4, 5, 6])  
print(softmax(nums))  

In the script above we create a softmax function that takes a single vector as input, takes exponents of all the elements in the vector and then divides the resulting numbers individually by the sum of exponents of all the numbers in the input vector.

You can see that the input vector contains elements 4, 5 and 6. In the output, you will see three numbers squashed between 0 and 1 where the sum of the numbers will be equal to 1. The output looks likes this:

[0.09003057 0.24472847 0.66524096]

Softmax activation function has two major advantages over the other activation functions, particular for multi-class classification problems: The first advantage is that softmax function takes a vector as input and the second advantage is that it produces an output between 0 and 1. Remember, in our dataset, we have one-hot encoded output labels which mean that our output will have values between 0 and 1. However, the output of the feedforward process can be greater than 1, therefore softmax function is the ideal choice at the output layer since it squashes the output between 0 and 1.

Cross-Entropy Function

With softmax activation function at the output layer, mean squared error cost function can be used for optimizing the cost as we did in the previous articles. However, for the softmax function, a more convenient cost function exists which is called cross-entropy.

Mathematically, the cross-entropy function looks likes this:

$$ H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i} $$

The cross-entropy is simply the sum of the products of all the actual probabilities with the negative log of the predicted probabilities. For multi-class classification problems, the cross-entropy function is known to outperform the gradient decent function.

Now we have sufficient knowledge to create a neural network that solves multi-class classification problems. Let's see how our neural network will work.

As always, a neural network executes in two steps: Feed-forward and back-propagation.

Feed Forward

The feedforward phase will remain more or less similar to what we saw in the previous article. The only difference is that now we will use the softmax activation function at the output layer rather than sigmoid function.

Remember, for the hidden layer output we will still use the sigmoid function as we did previously. The softmax function will be used only for the output layer activations.

Phase 1

Since we are using two different activation functions for the hidden layer and the output layer, I have divided the feed-forward phase into two sub-phases.

In the first phase, we will see how to calculate output from the hidden layer. For each input record, we have two features "x1" and "x2". To calculate the output values for each node in the hidden layer, we have to multiply the input with the corresponding weights of the hidden layer node for which we are calculating the value. Notice, we are also adding a bias term here. We then pass the dot product through sigmoid activation function to get the final value.

For instance to calculate the final value for the first node in the hidden layer, which is denoted by "ah1", you need to perform the following calculation:

$$ zh1 = x1w1 + x2w2 + b
$$

$$ ah1 = \frac{\mathrm{1} }{\mathrm{1} + e^{-zh1} }
$$

This is the resulting value for the top-most node in the hidden layer. In the same way, you can calculate the values for the 2nd, 3rd, and 4th nodes of the hidden layer.

Phase 2

To calculate the values for the output layer, the values in the hidden layer nodes are treated as inputs. Therefore, to calculate the output, multiply the values of the hidden layer nodes with their corresponding weights and pass the result through an activation function, which will be softmax in this case.

This operation can be mathematically expressed by the following equation:

$$ zo1 = ah1w9 + ah2w10 + ah3w11 + ah4w12
$$

$$ zo2 = ah1w13 + ah2w14 + ah3w15 + ah4w16
$$

$$ zo3 = ah1w17 + ah2w18 + ah3w19 + ah4w20
$$

Here zo1, zo2, and zo3 will form the vector that we will use as input to the sigmoid function. Lets name this vector "zo".

zo = [zo1, zo2, zo3]  

Now to find the output value a01, we can use softmax function as follows:

$$ ao1(zo) = \frac{e^{zo1}}{ \sum\nolimits_{k=1}^{k}{e^{zok}} } $$

Here "a01" is the output for the top-most node in the output layer. In the same way, you can use the softmax function to calculate the values for ao2 and ao3.

You can see that the feed-forward step for a neural network with multi-class output is pretty similar to the feed-forward step of the neural network for binary classification problems. The only difference is that here we are using softmax function at the output layer rather than the sigmoid function.

Back-Propagation

The basic idea behind back-propagation remains the same. We have to define a cost function and then optimize that cost function by updating the weights such that the cost is minimized. However, unlike previous articles where we used mean squared error as a cost function, in this article we will instead use cross-entropy function.

Back-propagation is an optimization problem where we have to find the function minima for our cost function.

To find the minima of a function, we can use the gradient decent algorithm. The gradient decent algorithm can be mathematically represented as follows:

$$ repeat \ until \ convergence: \begin{Bmatrix} w_j := w_j - \alpha \frac{\partial }{\partial w_j} J(w_0,w_1 ....... w_n) \end{Bmatrix} ............. (1) $$

The details regarding how gradient decent function minimizes the cost have already been discussed in the previous article. Here we will jus see the mathematical operations that we need to perform.

Our cost function is:

$$ H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i} $$

In our neural network, we have an output vector where each element of the vector corresponds to output from one node in the output layer. The output vector is calculated using the softmax function. If "ao" is the vector of the predicted outputs from all output nodes and "y" is the vector of the actual outputs of the corresponding nodes in the output vector, we have to basically minimize this function:

$$ cost(y, {ao}) = -\sum_i y_i \log {ao_i} $$
Phase 1

In the first phase, we need to update weights w9 up to w20. These are the weights of the output layer nodes.

From the previous article, we know that to minimize the cost function, we have to update weight values such that the cost decreases. To do so, we need to take the derivative of the cost function with respect to each weight. Mathematically we can represent it as:

$$ \frac {dcost}{dwo} = \frac {dcost}{dao} *, \frac {dao}{dzo} * \frac {dzo}{dwo} ..... (1) $$

Here "wo" refers to the weights in the output layer.

The first part of the equation can be represented as:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} ....... (2) $$

The detailed derivation of cross-entropy loss function with softmax activation function can be found at this link.

The derivative of equation (2) is:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} = ao - y ....... (3) $$

Where "ao" is predicted output while "y" is the actual output.

Finally, we need to find "dzo" with respect to "dwo" from Equation 1. The derivative is simply the outputs coming from the hidden layer as shown below:

$$ \frac {dzo}{dwo} = ah $$

To find new weight values, the values returned by Equation 1 can be simply multiplied with the learning rate and subtracted from the current weight values.

We also need to update the bias "bo" for the output layer. We need to differentiate our cost function with respect to bias to get new bias value as shown below:

$$ \frac {dcost}{dbo} = \frac {dcost}{dao} *\ \frac {dao}{dzo} * \frac {dzo}{dbo} ..... (4) $$

The first part of the Equation 4 has already been calculated in Equation 3. Here we only need to update "dzo" with respect to "bo" which is simply 1. So:

$$ \frac {dcost}{dbo} = ao - y ........... (5) $$

To find new bias values for output layer, the values returned by Equation 5 can be simply multiplied with the learning rate and subtracted from the current bias value.

Phase 2

In this section, we will back-propagate our error to the previous layer and find the new weight values for hidden layer weights i.e. weights w1 to w8.

Let's collectively denote hidden layer weights as "wh". We basically have to differentiate the cost function with respect to "wh".

Mathematically we can use chain rule of differentiation to represent it as:

$$ \frac {dcost}{dwh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dwh} ...... (6) $$

Here again, we will break Equation 6 into individual terms.

The first term "dcost" can be differentiated with respect to "dah" using the chain rule of differentiation as follows:

$$ \frac {dcost}{dah} = \frac {dcost}{dzo} *\ \frac {dzo}{dah} ...... (7) $$

Let's again break the Equation 7 into individual terms. From the Equation 3, we know that:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} =\frac {dcost}{dzo} = = ao - y ........ (8) $$

Now we need to find dzo/dah from Equation 7, which is equal to the weights of the output layer as shown below:

$$ \frac {dzo}{dah} = wo ...... (9) $$

Now we can find the value of dcost/dah by replacing the values from Equations 8 and 9 in Equation 7.

Coming back to Equation 6, we have yet to find dah/dzh and dzh/dwh.

The first term dah/dzh can be calculated as:

$$ \frac {dah}{dzh} = sigmoid(zh) * (1-sigmoid(zh)) ........ (10) $$

And finally, dzh/dwh is simply the input values:

$$ \frac {dzh}{dwh} = input features ........ (11) $$

If we replace the values from Equations 7, 10 and 11 in Equation 6, we can get the updated matrix for the hidden layer weights. To find new weight values for the hidden layer weights "wh", the values returned by Equation 6 can be simply multiplied with the learning rate and subtracted from the current hidden layer weight values.

Similarly, the derivative of the cost function with respect to hidden layer bias "bh" can simply be calculated as:

$$ \frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dbh} ...... (12) $$

Which is simply equal to:

$$ \frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} ...... (13) $$

because,

$$ \frac {dzh}{dbh} = 1 $$

To find new bias values for the hidden layer, the values returned by Equation 13 can be simply multiplied with the learning rate and subtracted from the current hidden layer bias values and that's it for the back-propagation.

You can see that the feed-forward and back-propagation process is quite similar to the one we saw in our last articles. The only thing we changed is the activation function and cost function.

Code for Neural Networks for Multi-class Classification

We have covered the theory behind the neural network for multi-class classification, and now is the time to put that theory into practice.

Take a look at the following script:

import numpy as np  
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])  
mouse_images = np.random.randn(700, 2) + np.array([3, 3])  
dog_images = np.random.randn(700, 2) + np.array([-3, 3])

feature_set = np.vstack([cat_images, mouse_images, dog_images])

labels = np.array([0]*700 + [1]*700 + [2]*700)

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):  
    one_hot_labels[i, labels[i]] = 1

plt.figure(figsize=(10,7))  
plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)  
plt.show()

def sigmoid(x):  
    return 1/(1+np.exp(-x))

def sigmoid_der(x):  
    return sigmoid(x) *(1-sigmoid (x))

def softmax(A):  
    expA = np.exp(A)
    return expA / expA.sum(axis=1, keepdims=True)

instances = feature_set.shape[0]  
attributes = feature_set.shape[1]  
hidden_nodes = 4  
output_labels = 3

wh = np.random.rand(attributes,hidden_nodes)  
bh = np.random.randn(hidden_nodes)

wo = np.random.rand(hidden_nodes,output_labels)  
bo = np.random.randn(output_labels)  
lr = 10e-4

error_cost = []

for epoch in range(50000):  
############# feedforward

    # Phase 1
    zh = np.dot(feature_set, wh) + bh
    ah = sigmoid(zh)

    # Phase 2
    zo = np.dot(ah, wo) + bo
    ao = softmax(zo)

########## Back Propagation

########## Phase 1

    dcost_dzo = ao - one_hot_labels
    dzo_dwo = ah

    dcost_wo = np.dot(dzo_dwo.T, dcost_dzo)

    dcost_bo = dcost_dzo

########## Phases 2

    dzo_dah = wo
    dcost_dah = np.dot(dcost_dzo , dzo_dah.T)
    dah_dzh = sigmoid_der(zh)
    dzh_dwh = feature_set
    dcost_wh = np.dot(dzh_dwh.T, dah_dzh * dcost_dah)

    dcost_bh = dcost_dah * dah_dzh

    # Update Weights ================

    wh -= lr * dcost_wh
    bh -= lr * dcost_bh.sum(axis=0)

    wo -= lr * dcost_wo
    bo -= lr * dcost_bo.sum(axis=0)

    if epoch % 200 == 0:
        loss = np.sum(-one_hot_labels * np.log(ao))
        print('Loss function value: ', loss)
        error_cost.append(loss)

The code is pretty similar to the one we created in the previous article. In the feed-forward section, the only difference is that "ao", which is the final output, is being calculated using the softmax function.

Similarly, in the back-propagation section, to find the new weights for the output layer, the cost function is derived with respect to softmax function rather than the sigmoid function.

If you run the above script, you will see that the final error cost will be 0.5. The following figure shows how the cost decreases with the number of epochs.

Cost vs epochs

As you can see, not many epochs are needed to reach our final error cost.

Similarly, if you run the same script with sigmoid function at the output layer, the minimum error cost that you will achieve after 50000 epochs will be around 1.5 which is greater than 0.5, achieved with softmax.

Conclusion

Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how we can create a very simple neural network for multi-class classification, from scratch in Python. This is the final article of the series: "Neural Network from Scratch in Python". In the future articles, I will explain how we can create more specialized neural networks such as recurrent neural networks and convolutional neural networks from scratch in Python.

PyCon: PyCon 2019 Launches Financial Aid

$
0
0
The PyCon conference prides itself on being affordable. However, registration is only one of several expenses an attendee must incur, and it’s likely the smallest one. Flying, whether halfway around the world or from a few hundred miles away, is more expensive. Staying in a hotel for a few days is also more expensive. All together, the cost of attending a conference can become prohibitively expensive. That’s where PyCon's Financial Aid program comes in. We’re opening applications for Financial Aid today, and we’ll be accepting them through February 12, 2019.

To apply, first set up an account on the site, and then you will be able to fill out the application here or through your dashboard.

For those proposing talks, tutorials, or posters, selecting the “I require a speaker grant if my proposal is accepted” box on your speaker profile serves as your request, so you do not need to fill out the financial aid application. Upon acceptance, we’ll contact the speakers who checked that box to gather the appropriate information. Accepted speakers and presenters are prioritized for travel grants. Additionally, we do not expose grant requests to reviewers while evaluating proposals. The Program Committee evaluates proposals on the basis of their presentation, and later the Financial Aid team comes in and looks at how we can help our speakers.

We offer need-based grants to enable people from across our community to attend PyCon. The criteria for evaluating requests takes into account several things, such as whether the applicant is a student, unemployed, or underemployed; their geographic location; and their involvement in both the conference and the greater Python community.

Our process aims to help a large amount of people with partial grants, as opposed to covering full expenses for a small amount of people. Based on individual need, we craft grant amounts that we hope can turn PyCon from inaccessible to reality. While some direct costs—like those associated with PyCon itself—are discounted or waived, external costs such as travel are handled via reimbursement, where the attendee pays and then submits receipts to be paid back an amount based on their grant. For the full details, see our FAQ at https://us.pycon.org/2019/financial-assistance/faq/ and contact pycon-aid@python.org with further questions.

The Python Software Foundation& PyLadies make Financial Aid possible. This year the Python Software Foundation is providing $110,000 USD towards financial aid and PyLadies will contribute as much as they can based on the contributions they get throughout 2018.

For more information about Financial Aid, see https://us.pycon.org/2019/financial-assistance.


Our Call for Proposals is open! Tutorial presentations are due November 26, while talk, poster, and education summit proposals are due January 3. For more information, see https://us.pycon.org/2019/speaking/.

*Note: Main content is from post written by Brian Curtin for 2018 launch

Real Python: Python, Boto3, and AWS S3: Demystified

$
0
0

Amazon Web Services (AWS) has become a leader in cloud computing. One of its core components is S3, the object storage service offered by AWS. With its impressive availability and durability, it has become the standard way to store videos, images, and data. You can combine S3 with other services to build infinitely scalable applications.

Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.

If you’ve had some AWS exposure before, have your own AWS account, and want to take your skills to the next level by starting to use AWS services from within your Python code, then keep reading.

By the end of this tutorial, you’ll:

  • Be confident working with buckets and objects directly from your Python scripts
  • Know how to avoid common pitfalls when using Boto3 and S3
  • Understand how to set up your data from the start to avoid performance issues later
  • Learn how to configure your objects to take advantage of S3’s best features

Before exploring Boto3’s characteristics, you will first see how to configure the SDK on your machine. This step will set you up for the rest of the tutorial.

Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Installation

To install Boto3 on your computer, go to your terminal and run the following:

$ pip install boto3

You’ve got the SDK. But, you won’t be able to use it right now, because it doesn’t know which AWS account it should connect to.

To make it run against your AWS account, you’ll need to provide some valid credentials. If you already have an IAM user that has full permissions to S3, you can use those user’s credentials (their access key and their secret access key) without needing to create a new user. Otherwise, the easiest way to do this is to create a new AWS user and then store the new credentials.

To create a new user, go to your AWS account, then go to Services and select IAM. Then choose Users and click on Add user.

Give the user a name (for example, boto3user). Enable programmatic access. This will ensure that this user will be able to work with any AWS supported SDK or make separate API calls:

add AWS IAM user

To keep things simple, choose the preconfigured AmazonS3FullAccess policy. With this policy, the new user will be able to have full control over S3. Click on Next: Review:

aws s3 IAM user add policy

Select Create user:

aws s3 IAM user finish creation

A new screen will show you the user’s generated credentials. Click on the Download .csv button to make a copy of the credentials. You will need them to complete your setup.

Now that you have your new user, create a new file, ~/.aws/credentials:

$ touch ~/.aws/credentials

Open the file and paste the structure below. Fill in the placeholders with the new user credentials you have downloaded:

[default]aws_access_key_id=YOUR_ACCESS_KEY_IDaws_secret_access_key=YOUR_SECRET_ACCESS_KEY

Save the file.

Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.

There is one more configuration to set up: the default region that Boto3 should interact with. You can check out the complete table of the supported AWS regions. Choose the region that is closest to you. Copy your preferred region from the Region column. In my case, I am using eu-west-1 (Ireland).

Create a new file, ~/.aws/config:

$ touch ~/.aws/config

Add the following and replace the placeholder with the region you have copied:

[default]region=YOUR_PREFERRED_REGION

Save your file.

You are now officially set up for the rest of the tutorial.

Next, you will see the different options Boto3 gives you to connect to S3 and other AWS services.

Client Versus Resource

At its core, all that Boto3 does is call AWS APIs on your behalf. For the majority of the AWS services, Boto3 offers two distinct ways of accessing these abstracted APIs:

  • Client: low-level service access
  • Resource: higher-level object-oriented service access

You can use either to interact with S3.

To connect to the low-level client interface, you must use Boto3’s client(). You then pass in the name of the service you want to connect to, in this case, s3:

importboto3s3_client=boto3.client('s3')

To connect to the high-level interface, you’ll follow a similar approach, but use resource():

importboto3s3_resource=boto3.resource('s3')

You’ve successfully connected to both versions, but now you might be wondering, “Which one should I use?”

With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you’ll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.

With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.

Understanding how the client and the resource are generated is also important when you’re considering which one to choose:

  • Boto3 generates the client from a JSON service definition file. The client’s methods support every single type of interaction with the target AWS service.
  • Resources, on the other hand, are generated from JSON resource definition files.

Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn’t offered by the resource. Here’s the interesting part: you don’t need to change your code to use the client everywhere. For that operation, you can access the client directly via the resource like so: s3_resource.meta.client.

One such client operation is .generate_presigned_url(), which enables you to give your users access to an object within your bucket for a set period of time, without requiring them to have AWS credentials.

Common Operations

Now that you know about the differences between clients and resources, let’s start using them to build some new S3 components.

Creating a Bucket

To start off, you need an S3 bucket. To create one programmatically, you must first choose a name for your bucket. Remember that this name must be unique throughout the whole AWS platform, as bucket names are DNS compliant. If you try to create a bucket, but another user has already claimed your desired bucket name, your code will fail. Instead of success, you will see the following error: botocore.errorfactory.BucketAlreadyExists.

You can increase your chance of success when creating your bucket by picking a random name. You can generate your own function that does that for you. In this implementation, you’ll see how using the uuid module will help you achieve that. A UUID4’s string representation is 36 characters long (including hyphens), and you can add a prefix to specify what each bucket is for.

Here’s a way you can achieve that:

importuuiddefcreate_bucket_name(bucket_prefix):# The generated bucket name must be between 3 and 63 chars longreturn''.join([bucket_prefix,str(uuid.uuid4())])

You’ve got your bucket name, but now there’s one more thing you need to be aware of: unless your region is in the United States, you’ll need to define the region explicitly when you are creating the bucket. Otherwise you will get an IllegalLocationConstraintException.

To exemplify what this means when you’re creating your S3 bucket in a non-US region, take a look at the code below:

s3_resource.create_bucket(Bucket=YOUR_BUCKET_NAME,CreateBucketConfiguration={'LocationConstraint':'eu-west-1'})

You need to provide both a bucket name and a bucket configuration where you must specify the region, which in my case is eu-west-1.

This isn’t ideal. Imagine that you want to take your code and deploy it to the cloud. Your task will become increasingly more difficult because you’ve now hardcoded the region. You could refactor the region and transform it into an environment variable, but then you’d have one more thing to manage.

Luckily, there is a better way to get the region programatically, by taking advantage of a session object. Boto3 will create the session from your credentials. You just need to take the region and pass it to create_bucket() as its LocationConstraint configuration. Here’s how to do that:

defcreate_bucket(bucket_prefix,s3_connection):session=boto3.session.Session()current_region=session.region_namebucket_name=create_bucket_name(bucket_prefix)bucket_response=s3_connection.create_bucket(Bucket=bucket_name,CreateBucketConfiguration={'LocationConstraint':current_region})print(bucket_name,current_region)returnbucket_name,bucket_response

The nice part is that this code works no matter where you want to deploy it: locally/EC2/Lambda. Moreover, you don’t need to hardcode your region.

As both the client and the resource create buckets in the same way, you can pass either one as the s3_connection parameter.

You’ll now create two buckets. First create one using the client, which gives you back the bucket_response as a dictionary:

>>>
>>> first_bucket_name,first_response=create_bucket(... bucket_prefix='firstpythonbucket',... s3_connection=s3_resource.meta.client)firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 eu-west-1>>> first_response{'ResponseMetadata': {'RequestId': 'E1DCFE71EDE7C1EC', 'HostId': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'x-amz-request-id': 'E1DCFE71EDE7C1EC', 'date': 'Fri, 05 Oct 2018 15:00:00 GMT', 'location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'Location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/'}

Then create a second bucket using the resource, which gives you back a Bucket instance as the bucket_response:

>>>
>>> second_bucket_name,second_response=create_bucket(... bucket_prefix='secondpythonbucket',s3_connection=s3_resource)secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644 eu-west-1>>> second_responses3.Bucket(name='secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644')

You’ve got your buckets. Next, you’ll want to start adding some files to them.

Naming Your Files

You can name your objects by using standard file naming conventions. You can use any valid name. In this article, you’ll look at a more specific case that helps you understand how S3 works under the hood.

If you’re planning on hosting a large number of files in your S3 bucket, there’s something you should keep in mind. If all your file names have a deterministic prefix that gets repeated for every file, such as a timestamp format like “YYYY-MM-DDThh:mm:ss”, then you will soon find that you’re running into performance issues when you’re trying to interact with your bucket.

This will happen because S3 takes the prefix of the file and maps it onto a partition. The more files you add, the more will be assigned to the same partition, and that partition will be very heavy and less responsive.

What can you do to keep that from happening?

The easiest solution is to randomize the file name. You can imagine many different implementations, but in this case, you’ll use the trusted uuid module to help with that. To make the file names easier to read for this tutorial, you’ll be taking the first six characters of the generated number’s hex representation and concatenate it with your base file name.

The helper function below allows you to pass in the number of bytes you want the file to have, the file name, and a sample content for the file to be repeated to make up the desired file size:

defcreate_temp_file(size,file_name,file_content):random_file_name=''.join([str(uuid.uuid4().hex[:6]),file_name])withopen(random_file_name,'w')asf:f.write(str(file_content)*size)returnrandom_file_name

Create your first file, which you’ll be using shortly:

first_file_name=create_temp_file(300,'firstfile.txt','f')

By adding randomness to your file names, you can efficiently distribute your data within your S3 bucket.

Creating Bucket and Object Instances

The next step after creating your file is to see how to integrate it into your S3 workflow.

This is where the resource’s classes play an important role, as these abstractions make it easy to work with S3.

By using the resource, you have access to the high-level classes (Bucket and Object). This is how you can create one of each:

first_bucket=s3_resource.Bucket(name=first_bucket_name)first_object=s3_resource.Object(bucket_name=first_bucket_name,key=first_file_name)

The reason you have not seen any errors with creating the first_object variable is that Boto3 doesn’t make calls to AWS to create the reference. The bucket_name and the key are called identifiers, and they are the necessary parameters to create an Object. Any other attribute of an Object, such as its size, is lazily loaded. This means that for Boto3 to get the requested attributes, it has to make calls to AWS.

Understanding Sub-resources

Bucket and Object are sub-resources of one another. Sub-resources are methods that create a new instance of a child resource. The parent’s identifiers get passed to the child resource.

If you have a Bucket variable, you can create an Object directly:

first_object_again=first_bucket.Object(first_file_name)

Or if you have an Object variable, then you can get the Bucket:

first_bucket_again=first_object.Bucket()

Great, you now understand how to generate a Bucket and an Object. Next, you’ll get to upload your newly generated file to S3 using these constructs.

Uploading a File

There are three ways you can upload a file:

  • From an Object instance
  • From a Bucket instance
  • From the client

In each case, you have to provide the Filename, which is the path of the file you want to upload. You’ll now explore the three alternatives. Feel free to pick whichever you like most to upload the first_file_name to S3.

Object Instance Version

You can upload using an Object instance:

s3_resource.Object(first_bucket_name,first_file_name).upload_file(Filename=first_file_name)

Or you can use the first_object instance:

first_object.upload_file(first_file_name)

Bucket Instance Version

Here’s how you can upload using a Bucket instance:

s3_resource.Bucket(first_bucket_name).upload_file(Filename=first_file_name,Key=first_file_name)

Client Version

You can also upload using the client:

s3_resource.meta.client.upload_file(Filename=first_file_name,Bucket=first_bucket_name,Key=first_file_name)

You have successfully uploaded your file to S3 using one of the three available methods. In the upcoming sections, you’ll mainly work with the Object class, as the operations are very similar between the client and the Bucket versions.

Downloading a File

To download a file from S3 locally, you’ll follow similar steps as you did when uploading. But in this case, the Filename parameter will map to your desired local path. This time, it will download the file to the tmp directory:

s3_resource.Object(first_bucket_name,first_file_name).download_file(f'/tmp/{first_file_name}')# Python 3.6+

You’ve successfully downloaded your file from S3. Next, you’ll see how to copy the same file between your S3 buckets using a single API call.

Copying an Object Between Buckets

If you need to copy files from one bucket to another, Boto3 offers you that possibility. In this example, you’ll copy the file from the first bucket to the second, using .copy():

defcopy_to_bucket(bucket_from_name,bucket_to_name,file_name):copy_source={'Bucket':bucket_from_name,'Key':file_name}s3_resource.Object(bucket_to_name,file_name).copy(copy_source)copy_to_bucket(first_bucket_name,second_bucket_name,first_file_name)

Note: If you’re aiming to replicate your S3 objects to a bucket in a different region, have a look at Cross Region Replication.

Deleting an Object

Let’s delete the new file from the second bucket by calling .delete() on the equivalent Object instance:

s3_resource.Object(second_bucket_name,first_file_name).delete()

You’ve now seen how to use S3’s core operations. You’re ready to take your knowledge to the next level with more complex characteristics in the upcoming sections.

Advanced Configurations

In this section, you’re going to explore more elaborate S3 features. You’ll see examples of how to use them and the benefits they can bring to your applications.

ACL (Access Control Lists)

Access Control Lists (ACLs) help you manage access to your buckets and the objects within them. They are considered the legacy way of administrating permissions to S3. Why should you know about them? If you have to manage access to individual objects, then you would use an Object ACL.

By default, when you upload an object to S3, that object is private. If you want to make this object available to someone else, you can set the object’s ACL to be public at creation time. Here’s how you upload a new file to the bucket and make it accessible to everyone:

second_file_name=create_temp_file(400,'secondfile.txt','s')second_object=s3_resource.Object(first_bucket.name,second_file_name)second_object.upload_file(second_file_name,ExtraArgs={'ACL':'public-read'})

You can get the ObjectAcl instance from the Object, as it is one of its sub-resource classes:

second_object_acl=second_object.Acl()

To see who has access to your object, use the grants attribute:

>>>
>>> second_object_acl.grants[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}, {'Grantee': {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'}, 'Permission': 'READ'}]

You can make your object private again, without needing to re-upload it:

>>>
>>> response=second_object_acl.put(ACL='private')>>> second_object_acl.grants[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}]

You have seen how you can use ACLs to manage access to individual objects. Next, you’ll see how you can add an extra layer of security to your objects by using encryption.

Note: If you’re looking to split your data into multiple categories, have a look at tags. You can grant access to the objects based on their tags.

Encryption

With S3, you can protect your data using encryption. You’ll explore server-side encryption using the AES-256 algorithm where AWS manages both the encryption and the keys.

Create a new file and upload it using ServerSideEncryption:

third_file_name=create_temp_file(300,'thirdfile.txt','t')third_object=s3_resource.Object(first_bucket_name,third_file_name)third_object.upload_file(third_file_name,ExtraArgs={'ServerSideEncryption':'AES256'})

You can check the algorithm that was used to encrypt the file, in this case AES256:

>>>
>>> third_object.server_side_encryption'AES256'

You now understand how to add an extra layer of protection to your objects using the AES-256 server-side encryption algorithm offered by AWS.

Storage

Every object that you add to your S3 bucket is associated with a storage class. All the available storage classes offer high durability. You choose how you want to store your objects based on your application’s performance access requirements.

At present, you can use the following storage classes with S3:

  • STANDARD: default for frequently accessed data
  • STANDARD_IA: for infrequently used data that needs to be retrieved rapidly when requested
  • ONEZONE_IA: for the same use case as STANDARD_IA, but stores the data in one Availability Zone instead of three
  • REDUCED_REDUNDANCY: for frequently used noncritical data that is easily reproducible

If you want to change the storage class of an existing object, you need to recreate the object.

For example, reupload the third_object and set its storage class to Standard_IA:

third_object.upload_file(third_file_name,ExtraArgs={'ServerSideEncryption':'AES256','StorageClass':'STANDARD_IA'})

Note: If you make changes to your object, you might find that your local instance doesn’t show them. What you need to do at that point is call .reload() to fetch the newest version of your object.

Reload the object, and you can see its new storage class:

>>>
>>> third_object.reload()>>> third_object.storage_class'STANDARD_IA'

Note: Use LifeCycle Configurations to transition objects through the different classes as you find the need for them. They will automatically transition these objects for you.

Versioning

You should use versioning to keep a complete record of your objects over time. It also acts as a protection mechanism against accidental deletion of your objects. When you request a versioned object, Boto3 will retrieve the latest version.

When you add a new version of an object, the storage that object takes in total is the sum of the size of its versions. So if you’re storing an object of 1 GB, and you create 10 versions, then you have to pay for 10GB of storage.

Enable versioning for the first bucket. To do this, you need to use the BucketVersioning class:

defenable_bucket_versioning(bucket_name):bkt_versioning=s3_resource.BucketVersioning(bucket_name)bkt_versioning.enable()print(bkt_versioning.status)
>>>
>>> enable_bucket_versioning(first_bucket_name)Enabled

Then create two new versions for the first file Object, one with the contents of the original file and one with the contents of the third file:

s3_resource.Object(first_bucket_name,first_file_name).upload_file(first_file_name)s3_resource.Object(first_bucket_name,first_file_name).upload_file(third_file_name)

Now reupload the second file, which will create a new version:

s3_resource.Object(first_bucket_name,second_file_name).upload_file(second_file_name)

You can retrieve the latest available version of your objects like so:

>>>
>>> s3_resource.Object(first_bucket_name,first_file_name).version_id'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'

In this section, you’ve seen how to work with some of the most important S3 attributes and add them to your objects. Next, you’ll see how to easily traverse your buckets and objects.

Traversals

If you need to retrieve information from or apply an operation to all your S3 resources, Boto3 gives you several ways to iteratively traverse your buckets and your objects. You’ll start by traversing all your created buckets.

Bucket Traversal

To traverse all the buckets in your account, you can use the resource’s buckets attribute alongside .all(), which gives you the complete list of Bucket instances:

>>>
>>> forbucketins3_resource.buckets.all():... print(bucket.name)...firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644

You can use the client to retrieve the bucket information as well, but the code is more complex, as you need to extract it from the dictionary that the client returns:

>>>
>>> forbucket_dictins3_resource.meta.client.list_buckets().get('Buckets'):... print(bucket_dict['Name'])...firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644

You have seen how to iterate through the buckets you have in your account. In the upcoming section, you’ll pick one of your buckets and iteratively view the objects it contains.

Object Traversal

If you want to list all the objects from a bucket, the following code will generate an iterator for you:

>>>
>>> forobjinfirst_bucket.objects.all():... print(obj.key)...127367firstfile.txt616abesecondfile.txtfb937cthirdfile.txt

The obj variable is an ObjectSummary. This is a lightweight representation of an Object. The summary version doesn’t support all of the attributes that the Object has. If you need to access them, use the Object() sub-resource to create a new reference to the underlying stored key. Then you’ll be able to extract the missing attributes:

>>>
>>> forobjinfirst_bucket.objects.all():... subsrc=obj.Object()... print(obj.key,obj.storage_class,obj.last_modified,... subsrc.version_id,subsrc.metadata)...127367firstfile.txt STANDARD 2018-10-05 15:09:46+00:00 eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv {}616abesecondfile.txt STANDARD 2018-10-05 15:09:47+00:00 WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6 {}fb937cthirdfile.txt STANDARD_IA 2018-10-05 15:09:05+00:00 null {}

You can now iteratively perform operations on your buckets and objects. You’re almost done. There’s one more thing you should know at this stage: how to delete all the resources you’ve created in this tutorial.

Deleting Buckets and Objects

To remove all the buckets and objects you have created, you must first make sure that your buckets have no objects within them.

Deleting a Non-empty Bucket

To be able to delete a bucket, you must first delete every single object within the bucket, or else the BucketNotEmpty exception will be raised. When you have a versioned bucket, you need to delete every object and all its versions.

If you find that a LifeCycle rule that will do this automatically for you isn’t suitable to your needs, here’s how you can programatically delete the objects:

defdelete_all_objects(bucket_name):res=[]bucket=s3_resource.Bucket(bucket_name)forobj_versioninbucket.object_versions.all():res.append({'Key':obj_version.object_key,'VersionId':obj_version.id})print(res)bucket.delete_objects(Delete={'Objects':res})

The above code works whether or not you have enabled versioning on your bucket. If you haven’t, the version of the objects will be null. You can batch up to 1000 deletions in one API call, using .delete_objects() on your Bucket instance, which is more cost-effective than individually deleting each object.

Run the new function against the first bucket to remove all the versioned objects:

>>>
>>> delete_all_objects(first_bucket_name)[{'Key': '127367firstfile.txt', 'VersionId': 'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'}, {'Key': '127367firstfile.txt', 'VersionId': 'UnQTaps14o3c1xdzh09Cyqg_hq4SjB53'}, {'Key': '127367firstfile.txt', 'VersionId': 'null'}, {'Key': '616abesecondfile.txt', 'VersionId': 'WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6'}, {'Key': '616abesecondfile.txt', 'VersionId': 'null'}, {'Key': 'fb937cthirdfile.txt', 'VersionId': 'null'}]

As a final test, you can upload a file to the second bucket. This bucket doesn’t have versioning enabled, and thus the version will be null. Apply the same function to remove the contents:

>>>
>>> s3_resource.Object(second_bucket_name,first_file_name).upload_file(... first_file_name)>>> delete_all_objects(second_bucket_name)[{'Key': '9c8b44firstfile.txt', 'VersionId': 'null'}]

You’ve successfully removed all the objects from both your buckets. You’re now ready to delete the buckets.

Deleting Buckets

To finish off, you’ll use .delete() on your Bucket instance to remove the first bucket:

s3_resource.Bucket(first_bucket_name).delete()

If you want, you can use the client version to remove the second bucket:

s3_resource.meta.client.delete_bucket(Bucket=second_bucket_name)

Both the operations were successful because you emptied each bucket before attempting to delete it.

You’ve now run some of the most important operations that you can perform with S3 and Boto3. Congratulations on making it this far! As a bonus, let’s explore some of the advantages of managing S3 resources with Infrastructure as Code.

Python Code or Infrastructure as Code (IaC)?

As you’ve seen, most of the interactions you’ve had with S3 in this tutorial had to do with objects. You didn’t see many bucket-related operations, such as adding policies to the bucket, adding a LifeCycle rule to transition your objects through the storage classes, archive them to Glacier or delete them altogether or enforcing that all objects be encrypted by configuring Bucket Encryption.

Manually managing the state of your buckets via Boto3’s clients or resources becomes increasingly difficult as your application starts adding other services and grows more complex. To monitor your infrastructure in concert with Boto3, consider using an Infrastructure as Code (IaC) tool such as CloudFormation or Terraform to manage your application’s infrastructure. Either one of these tools will maintain the state of your infrastructure and inform you of the changes that you’ve applied.

If you decide to go down this route, keep the following in mind:

  • Any bucket related-operation that modifies the bucket in any way should be done via IaC.
  • If you want all your objects to act in the same way (all encrypted, or all public, for example), usually there is a way to do this directly using IaC, by adding a Bucket Policy or a specific Bucket property.
  • Bucket read operations, such as iterating through the contents of a bucket, should be done using Boto3.
  • Object-related operations at an individual object level should be done using Boto3.

Conclusion

Congratulations on making it to the end of this tutorial!

You’re now equipped to start working programmatically with S3. You now know how to create objects, upload them to S3, download their contents and change their attributes directly from your script, all while avoiding common pitfalls with Boto3.

May this tutorial be a stepping stone in your journey to building something great using AWS!

Further Reading

If you want to learn more, check out the following:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Stack Abuse: Python Dictionary Tutorial

$
0
0

Introduction

Python comes with a variety of built-in data structures, capable of storing different types of data. A Python dictionary is one such data structure that can store data in the form of key-value pairs. The values in a Python dictionary can be accessed using the keys. In this article, we will be discussing the Python dictionary in detail.

Creating a Dictionary

To create a Python dictionary, we need to pass a sequence of items inside curly braces {}, and separate them using a comma (,). Each item has a key and a value expressed as a "key:value" pair.

The values can belong to any data type and they can repeat, but the keys must remain unique.

The following examples demonstrate how to create Python dictionaries:

Creating an empty dictionary:

dict_sample = {}  

Creating a dictionary with integer keys:

dict_sample = {1: 'mango', 2: 'pawpaw'}  

Creating a dictionary with mixed keys:

dict_sample = {'fruit': 'mango', 1: [4, 6, 8]}  

We can also create a dictionary by explicitly calling the Python's dict() method:

dict_sample = dict({1:'mango', 2:'pawpaw'})  

A dictionary can also be created from a sequence as shown below:

dict_sample = dict([(1,'mango'), (2,'pawpaw')])  

Dictionaries can also be nested, which means that we can create a dictionary inside another dictionary. For example:

dict_sample = {1: {'student1' : 'Nicholas', 'student2' : 'John', 'student3' : 'Mercy'},  
        2: {'course1' : 'Computer Science', 'course2' : 'Mathematics', 'course3' : 'Accounting'}}

To print the dictionary contents, we can use the Python's print() function and pass the dictionary name as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio', 'year': 2012}

Accessing Elements

To access dictionary items, pass the key inside square brackets []. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample["model"]  
print(x)  

Output:

Premio  

We created a dictionary named dict_sample. A variable named x was then created and its value is set to be the value for the key "model" in the dictionary.

Here is another example:

dict = {'Name': 'Mercy', 'Age': 23, 'Course': 'Accounting'}  
print("Student Name:", dict['Name'])  
print("Course:", dict['Course'])  
print("Age:", dict['Age'])  

Output:

Student Name: Mercy  
Course: Accounting  
Age: 23  

The dictionary object also provides the get() function, which can be used to access dictionary elements as well. We append the function with the dictionary name using the dot operator and then pass the name of the key as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample.get("model")  
print(x)  

Output:

Premio  

Now we know how to access dictionary elements using a few different methods. In the next section we'll discuss how to add new elements to an already existing dictionary.

Adding Elements

There are numerous ways to add new elements to a dictionary. We can use a new index key and assign a value to it. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample["Capacity"] = "1800CC"  
print(dict_sample)  

Output:

{'Capacity': '1800CC', 'year': 2012, 'Company': 'Toyota', 'model': 'Premio'}

The new element has "Capacity" as the key and "1800CC" as its corresponding value. It has been added as the first element of the dictionary.

Here is another example. First let's first create an empty dictionary:

MyDictionary = {}  
print("An Empty Dictionary: ")  
print(MyDictionary)  

Output:

An Empty Dictionary:  

The dictionary returns nothing as it has nothing stored yet. Let us add some elements to it, one at a time:

MyDictionary[0] = 'Apples'  
MyDictionary[2] = 'Mangoes'  
MyDictionary[3] = 20  
print("\n3 elements have been added: ")  
print(MyDictionary)  

Output:

3 elements have been added:  
{0: 'Apples', 2: 'Mangoes', 3: 20}

To add the elements, we specified keys as well as the corresponding values. For example:

MyDictionary[0] = 'Apples'  

In the above example, 0 is the key while "Apples" is the value.

It is even possible for us to add a set of values to one key. For example:

MyDictionary['Values'] = 1, "Pairs", 4  
print("\n3 elements have been added: ")  
print(MyDictionary)  

Output:

3 elements have been added:  
{'Values': (1, 'Pairs', 4)}

In the above example, the name of the key is "Values" while everything after the = sign are the actual values for that key, stored as a Set.

Other than adding new elements to a dictionary, dictionary elements can also be updated/changed, which we'll go over in the next section.

Updating Elements

After adding a value to a dictionary we can then modify the existing dictionary element. You use the key of the element to change the corresponding value. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

dict_sample["year"] = 2014

print(dict_sample)  

Output:

{'year': 2014, 'model': 'Premio', 'Company': 'Toyota'}

In this example you can see that we have updated the value for the key "year" from the old value of 2012 to a new value of 2014.

Removing Elements

The removal of an element from a dictionary can be done in several ways, which we'll discuss one-by-one in this section:

The del keyword can be used to remove the element with the specified key. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
del dict_sample["year"]  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

We called the del keyword followed by the dictionary name. Inside the square brackets that follow the dictionary name, we passed the key of the element we need to delete from the dictionary, which in this example was "year". The entry for "year" in the dictionary was then deleted.

Another way to delete a key-value pair is to use the pop() function and pass the key of the entry to be deleted as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.pop("year")  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

We invoked the pop() function by appending it with the dictionary name. Again, in this example the entry for "year" in the dictionary will be deleted.

The popitem() function removes the last item inserted into the dictionary, without needing to specify the key. Take a look at the following example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.popitem()  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

The last entry into the dictionary was "year". It has been removed after calling the popitem() function.

But what if you want to delete the entire dictionary? It would be difficult and cumbersome to use one of these methods on every single key. Instead, you can use the del keyword to delete the entire dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
del dict_sample  
print(dict_sample)  

Output:

NameError: name 'dict_sample' is not defined  

The code returns an error. The reason is that we are trying to access a dictionary which doesn't exist since it has been deleted.

However, your use-case may require you to just remove all dictionary elements and be left with an empty dictionary. This can be achieved by calling the clear() function on the dictionary:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.clear()  
print(dict_sample)  

Output:

{}

The code returns an empty dictionary since all the dictionary elements have been removed.

Other Common Methods

The len() Method

With this method, you can count the number of elements in a dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
print(len(dict_sample))  

Output:

3  

There are three entries in the dictionary, hence the method returned 3.

The copy() Method

This method returns a copy of the existing dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample.copy()

print(x)  

Output:

{'Company': 'Toyota', 'year': 2012, 'model': 'Premio'}

We created a copy of dictionary named dict_sample and assigned it to the variable x. If x is printed on the console, you will see that it contains the same elements as those stored by dict_sample dictionary.

Note that this is useful because modifications made to the copied dictionary won't affect the original one.

The items() Method

When called, this method returns an iterable object. The iterable object has key-value pairs for the dictionary, as tuples in a list. This method is primarily used when you want to iterate through a dictionary.

The method is simply called on the dictionary object name as shown below:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

for k, v in dict_sample.items():  
  print(k, v)

Output:

('Company', 'Toyota')
('model', 'Premio')
('year', 2012)

The object returned by items() can also be used to show the changes that have been implemented on the dictionary. This is demonstrated below:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.items()

print(x)

dict_sample["model"] = "Mark X"

print(x)  

Output:

dict_items([('Company', 'Toyota'), ('model', 'Premio'), ('year', 2012)])  
dict_items([('Company', 'Toyota'), ('model', 'Mark X'), ('year', 2012)])  

The output shows that when you change a value in the dictionary, the items object is also updated to reflect this change.

The fromkeys() Method

This method returns a dictionary having specified keys and values. It takes the syntax given below:

dictionary.fromkeys(keys, value)  

The value for required keys parameter is an iterable and it specifies the keys for the new dictionary. The value for value parameter is optional and it specifies the default value for all the keys. The default value for this is None.

Suppose we need to create a dictionary of three keys all with the same value. We can do so as follows:

name = ('John', 'Nicholas', 'Mercy')  
age = 25

dict_sample = dict.fromkeys(name, age)


print(dict_sample)  

Output:

{'John': 25, 'Mercy': 25, 'Nicholas': 25}

In the script above, we specified the keys and one value. The fromkeys() method was able to pick the keys and combine them with this value to create a populated dictionary.

The value for the keys parameter is mandatory. The following example demonstrates what happens when the value for the values parameter is not specified:

name = ('John', 'Nicholas', 'Mercy')

dict_sample = dict.fromkeys(name)


print(dict_sample)  

Output:

{'John': None, 'Mercy': None, 'Nicholas': None}

The default value, which is None, was used.

The setdefault() Method

This method is applicable when we need to get the value of the element with the specified key. If the key is not found, it will be inserted into the dictionary alongside the specified value.

The method takes the following syntax:

dictionary.setdefault(keyname, value)  

In this function the keyname parameter is required. It represents the keyname of the item you need to return a value from. The value parameter is optional. If the dictionary already has the key, this parameter won't have any effect. If the key doesn't exist, then the value given in this function will become the value of the key. It has a default value of None.

For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.setdefault("color", "Gray")

print(x)  

Output

Gray  

The dictionary doesn't have the key for color. The setdefault() method has inserted this key and the specified a value, that is, "Gray", has been used as its value.

The following example demonstrates how the method behaves if the value for the key does exist:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.setdefault("model", "Allion")

print(x)  

Output:

Premio  

The value "Allion" has no effect on the dictionary since we already have a value for the key.

The keys() Method

This method also returns an iterable object. The object returned is a list of all keys in the dictionary. And just like with the items() method, the returned object can be used to reflect the changes made to the dictionary.

To use this method, we only call it on the name of the dictionary, as shown below:

dictionary.keys()  

For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.keys()

print(x)  

Output:

dict_keys(['model', 'Company', 'year'])  

Often times this method is used to iterate through each key in your dictionary, like so:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

for k in dict_sample.keys():  
  print(k)

Output:

Company  
model  
year  

Conclusion

This marks the end of this tutorial on Python dictionaries. These dictionaries store data in "key:value" pairs. The "key" acts as the identifier for the item while "value" is the value of the item. The Python dictionary comes with a variety of functions that can be applied for retrieval or manipulation of data. In this article, we saw how Python dictionary can be created, modified and deleted along with some of the most commonly used dictionary methods.

The No Title® Tech Blog: Haiku R1/beta1 review - revisiting BeOS, 18 years after its latest official release

$
0
0

Having experimented and used BeOS R5 Pro back in the early 2000’s, when the company that created it was just going down, I have been following with some interest the development of Haiku during all these years. While one can argue that both the old BeOS and Haiku miss some important features to be considered modern OSes these days, the fact is that a lightweight operating system can always be, for instance, an excellent way to bring new life into old, or new but less powerfull, hardware.

Talk Python to Me: #182 Picture Python at Shutterfly

$
0
0
Join me and Doug Farrell as we discuss his career and what he's up to at Shutterfly. You'll learn about the Python stack he's using to work with, not just with bits and bytes, but physical devices on a production line for creating all sorts of picturesque items. You'll also hear how both he and I feel it's a great time to be a developer, even if you're on the older side of 30 or 40 or beyond.

Davy Wybiral: LoRa IoT Network Programming | RYLR896

$
0
0
Hey everyone, so I just got some LoRa modules from REYAX to experiment with long range network applications and these things are so cool! So far I've made a long range security alarm, a button to water plants on the other side of my property, and some bridge code to interact with IP and BLE networks.

Just thought I'd do a quick video update on this stuff:



The module I wrote is part of the Espruino collection now: https://www.espruino.com/RYLR

I got these LoRa devices from REYAX: https://reyax.com/products/rylr896/

They seem to only sell them on e-bay right now: RYLR896

Mike Driscoll: Python 101: Episode #29 – Installing Packages

$
0
0

In this screencast we will learn how to install 3rd party modules and packages using easy_install, pip and from source.

You can also read the chapter this video is based on here or get the book on Leanpub

PyBites: Data Analysis of Pybites Community Branch Activity

$
0
0
/*! * * IPython notebook * */ /* CSS font colors for translated ANSI colors. */ .ansibold { font-weight: bold; } /* use dark versions for foreground, to improve visibility */ .ansiblack { color: black; } .ansired { color: darkred; } .ansigreen { color: darkgreen; } .ansiyellow { color: #c4a000; } .ansiblue { color: darkblue; } .ansipurple { color: darkviolet; } .ansicyan { color: steelblue; } .ansigray { color: gray; } /* and light for background, for the same reason */ .ansibgblack { background-color: black; } .ansibgred { background-color: red; } .ansibggreen { background-color: green; } .ansibgyellow { background-color: yellow; } .ansibgblue { background-color: blue; } .ansibgpurple { background-color: magenta; } .ansibgcyan { background-color: cyan; } .ansibggray { background-color: gray; } div.cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; border-radius: 2px; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; border-width: 1px; border-style: solid; border-color: transparent; width: 100%; padding: 5px; /* This acts as a spacer between cells, that is outside the border */ margin: 0px; outline: none; border-left-width: 1px; padding-left: 5px; background: linear-gradient(to right, transparent -40px, transparent 1px, transparent 1px, transparent 100%); } div.cell.jupyter-soft-selected { border-left-color: #90CAF9; border-left-color: #E3F2FD; border-left-width: 1px; padding-left: 5px; border-right-color: #E3F2FD; border-right-width: 1px; background: #E3F2FD; } @media print { div.cell.jupyter-soft-selected { border-color: transparent; } } div.cell.selected { border-color: #ababab; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 5px, transparent 5px, transparent 100%); } @media print { div.cell.selected { border-color: transparent; } } div.cell.selected.jupyter-soft-selected { border-left-width: 0; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 7px, #E3F2FD 7px, #E3F2FD 100%); } .edit_mode div.cell.selected { border-color: #66BB6A; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #66BB6A -40px, #66BB6A 5px, transparent 5px, transparent 100%); } @media print { .edit_mode div.cell.selected { border-color: transparent; } } .prompt { /* This needs to be wide enough for 3 digit prompt numbers: In[100]: */ min-width: 14ex; /* This padding is tuned to match the padding on the CodeMirror editor. */ padding: 0.4em; margin: 0px; font-family: monospace; text-align: right; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; /* Don't highlight prompt number selection */ -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; /* Use default cursor */ cursor: default; } @media (max-width: 540px) { .prompt { text-align: left; } } div.inner_cell { min-width: 0; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_area { border: 1px solid #cfcfcf; border-radius: 2px; background: #f7f7f7; line-height: 1.21429em; } /* This is needed so that empty prompt areas can collapse to zero height when there is no content in the output_subarea and the prompt. The main purpose of this is to make sure that empty JavaScript output_subareas have no height. */ div.prompt:empty { padding-top: 0; padding-bottom: 0; } div.unrecognized_cell { padding: 5px 5px 5px 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.unrecognized_cell .inner_cell { border-radius: 2px; padding: 5px; font-weight: bold; color: red; border: 1px solid #cfcfcf; background: #eaeaea; } div.unrecognized_cell .inner_cell a { color: inherit; text-decoration: none; } div.unrecognized_cell .inner_cell a:hover { color: inherit; text-decoration: none; } @media (max-width: 540px) { div.unrecognized_cell > div.prompt { display: none; } } div.code_cell { /* avoid page breaking on code cells when printing */ } @media print { div.code_cell { page-break-inside: avoid; } } /* any special styling for code cells that are currently running goes here */ div.input { page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.input { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_prompt { color: #303F9F; border-top: 1px solid transparent; } div.input_area > div.highlight { margin: 0.4em; border: none; padding: 0px; background-color: transparent; } div.input_area > div.highlight > pre { margin: 0px; border: none; padding: 0px; background-color: transparent; } /* The following gets added to the if it is detected that the user has a * monospace font with inconsistent normal/bold/italic height. See * notebookmain.js. Such fonts will have keywords vertically offset with * respect to the rest of the text. The user should select a better font. * See: https://github.com/ipython/ipython/issues/1503 * * .CodeMirror span { * vertical-align: bottom; * } */ .CodeMirror { line-height: 1.21429em; /* Changed from 1em to our global default */ font-size: 14px; height: auto; /* Changed to auto to autogrow */ background: none; /* Changed from white to allow our bg to show through */ } .CodeMirror-scroll { /* The CodeMirror docs are a bit fuzzy on if overflow-y should be hidden or visible.*/ /* We have found that if it is visible, vertical scrollbars appear with font size changes.*/ overflow-y: hidden; overflow-x: auto; } .CodeMirror-lines { /* In CM2, this used to be 0.4em, but in CM3 it went to 4px. We need the em value because */ /* we have set a different line-height and want this to scale with that. */ padding: 0.4em; } .CodeMirror-linenumber { padding: 0 8px 0 4px; } .CodeMirror-gutters { border-bottom-left-radius: 2px; border-top-left-radius: 2px; } .CodeMirror pre { /* In CM3 this went to 4px from 0 in CM2. We need the 0 value because of how we size */ /* .CodeMirror-lines */ padding: 0; border: 0; border-radius: 0; } /* Original style from softwaremaniacs.org (c) Ivan Sagalaev Adapted from GitHub theme */ .highlight-base { color: #000; } .highlight-variable { color: #000; } .highlight-variable-2 { color: #1a1a1a; } .highlight-variable-3 { color: #333333; } .highlight-string { color: #BA2121; } .highlight-comment { color: #408080; font-style: italic; } .highlight-number { color: #080; } .highlight-atom { color: #88F; } .highlight-keyword { color: #008000; font-weight: bold; } .highlight-builtin { color: #008000; } .highlight-error { color: #f00; } .highlight-operator { color: #AA22FF; font-weight: bold; } .highlight-meta { color: #AA22FF; } /* previously not defined, copying from default codemirror */ .highlight-def { color: #00f; } .highlight-string-2 { color: #f50; } .highlight-qualifier { color: #555; } .highlight-bracket { color: #997; } .highlight-tag { color: #170; } .highlight-attribute { color: #00c; } .highlight-header { color: blue; } .highlight-quote { color: #090; } .highlight-link { color: #00c; } /* apply the same style to codemirror */ .cm-s-ipython span.cm-keyword { color: #008000; font-weight: bold; } .cm-s-ipython span.cm-atom { color: #88F; } .cm-s-ipython span.cm-number { color: #080; } .cm-s-ipython span.cm-def { color: #00f; } .cm-s-ipython span.cm-variable { color: #000; } .cm-s-ipython span.cm-operator { color: #AA22FF; font-weight: bold; } .cm-s-ipython span.cm-variable-2 { color: #1a1a1a; } .cm-s-ipython span.cm-variable-3 { color: #333333; } .cm-s-ipython span.cm-comment { color: #408080; font-style: italic; } .cm-s-ipython span.cm-string { color: #BA2121; } .cm-s-ipython span.cm-string-2 { color: #f50; } .cm-s-ipython span.cm-meta { color: #AA22FF; } .cm-s-ipython span.cm-qualifier { color: #555; } .cm-s-ipython span.cm-builtin { color: #008000; } .cm-s-ipython span.cm-bracket { color: #997; } .cm-s-ipython span.cm-tag { color: #170; } .cm-s-ipython span.cm-attribute { color: #00c; } .cm-s-ipython span.cm-header { color: blue; } .cm-s-ipython span.cm-quote { color: #090; } .cm-s-ipython span.cm-link { color: #00c; } .cm-s-ipython span.cm-error { color: #f00; } .cm-s-ipython span.cm-tab { background: url(); background-position: right; background-repeat: no-repeat; } div.output_wrapper { /* this position must be relative to enable descendents to be absolute within it */ position: relative; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; z-index: 1; } /* class for the output area when it should be height-limited */ div.output_scroll { /* ideally, this would be max-height, but FF barfs all over that */ height: 24em; /* FF needs this *and the wrapper* to specify full width, or it will shrinkwrap */ width: 100%; overflow: auto; border-radius: 2px; -webkit-box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); display: block; } /* output div while it is collapsed */ div.output_collapsed { margin: 0px; padding: 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } div.out_prompt_overlay { height: 100%; padding: 0px 0.4em; position: absolute; border-radius: 2px; } div.out_prompt_overlay:hover { /* use inner shadow to get border that is computed the same on WebKit/FF */ -webkit-box-shadow: inset 0 0 1px #000; box-shadow: inset 0 0 1px #000; background: rgba(240, 240, 240, 0.5); } div.output_prompt { color: #D84315; } /* This class is the outer container of all output sections. */ div.output_area { padding: 0px; page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.output_area .MathJax_Display { text-align: left !important; } div.output_area div.output_area div.output_area img, div.output_area svg { max-width: 100%; height: auto; } div.output_area img.unconfined, div.output_area svg.unconfined { max-width: none; } /* This is needed to protect the pre formating from global settings such as that of bootstrap */ .output { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } @media (max-width: 540px) { div.output_area { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } div.output_area pre { margin: 0; padding: 0; border: 0; vertical-align: baseline; color: black; background-color: transparent; border-radius: 0; } /* This class is for the output subarea inside the output_area and after the prompt div. */ div.output_subarea { overflow-x: auto; padding: 0.4em; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; max-width: calc(100% - 14ex); } div.output_scroll div.output_subarea { overflow-x: visible; } /* The rest of the output_* classes are for special styling of the different output types */ /* all text output has this class: */ div.output_text { text-align: left; color: #000; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; } /* stdout/stderr are 'text' as well as 'stream', but execute_result/error are *not* streams */ div.output_stderr { background: #fdd; /* very light red background for stderr */ } div.output_latex { text-align: left; } /* Empty output_javascript divs should have no height */ div.output_javascript:empty { padding: 0; } .js-error { color: darkred; } /* raw_input styles */ div.raw_input_container { line-height: 1.21429em; padding-top: 5px; } pre.raw_input_prompt { /* nothing needed here. */ } input.raw_input { font-family: monospace; font-size: inherit; color: inherit; width: auto; /* make sure input baseline aligns with prompt */ vertical-align: baseline; /* padding + margin = 0.5em between prompt and cursor */ padding: 0em 0.25em; margin: 0em 0.25em; } input.raw_input:focus { box-shadow: none; } p.p-space { margin-bottom: 10px; } div.output_unrecognized { padding: 5px; font-weight: bold; color: red; } div.output_unrecognized a { color: inherit; text-decoration: none; } div.output_unrecognized a:hover { color: inherit; text-decoration: none; } .rendered_html { color: #000; /* any extras will just be numbers: */ } .rendered_html :link { text-decoration: underline; } .rendered_html :visited { text-decoration: underline; } .rendered_html h1:first-child { margin-top: 0.538em; } .rendered_html h2:first-child { margin-top: 0.636em; } .rendered_html h3:first-child { margin-top: 0.777em; } .rendered_html h4:first-child { margin-top: 1em; } .rendered_html h5:first-child { margin-top: 1em; } .rendered_html h6:first-child { margin-top: 1em; } .rendered_html * + ul { margin-top: 1em; } .rendered_html * + ol { margin-top: 1em; } .rendered_html pre, .rendered_html tr, .rendered_html th, .rendered_html td, .rendered_html * + table { margin-top: 1em; } .rendered_html * + p { margin-top: 1em; } .rendered_html * + img { margin-top: 1em; } .rendered_html img, .rendered_html img.unconfined, div.text_cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.text_cell > div.prompt { display: none; } } div.text_cell_render { /*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/ outline: none; resize: none; width: inherit; border-style: none; padding: 0.5em 0.5em 0.5em 0.4em; color: #000; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; } a.anchor-link:link { text-decoration: none; padding: 0px 20px; visibility: hidden; } h1:hover .anchor-link, h2:hover .anchor-link, h3:hover .anchor-link, h4:hover .anchor-link, h5:hover .anchor-link, h6:hover .anchor-link { visibility: visible; } .text_cell.rendered .input_area { display: none; } .text_cell.rendered .text_cell.unrendered .text_cell_render { display: none; } .cm-header-1, .cm-header-2, .cm-header-3, .cm-header-4, .cm-header-5, .cm-header-6 { font-weight: bold; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; } .cm-header-1 { font-size: 185.7%; } .cm-header-2 { font-size: 157.1%; } .cm-header-3 { font-size: 128.6%; } .cm-header-4 { font-size: 110%; } .cm-header-5 { font-size: 100%; font-style: italic; } .cm-header-6 { font-size: 100%; font-style: italic; } .highlight .hll { background-color: #ffffcc } .highlight { background: #f8f8f8; } .highlight .c { color: #408080; font-style: italic } /* Comment */ .highlight .err { border: 1px solid #FF0000 } /* Error */ .highlight .k { color: #008000; font-weight: bold } /* Keyword */ .highlight .o { color: #666666 } /* Operator */ .highlight .ch { color: #408080; font-style: italic } /* Comment.Hashbang */ .highlight .cm { color: #408080; font-style: italic } /* Comment.Multiline */ .highlight .cp { color: #BC7A00 } /* Comment.Preproc */ .highlight .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */ .highlight .c1 { color: #408080; font-style: italic } /* Comment.Single */ .highlight .cs { color: #408080; font-style: italic } /* Comment.Special */ .highlight .gd { color: #A00000 } /* Generic.Deleted */ .highlight .ge { font-style: italic } /* Generic.Emph */ .highlight .gr { color: #FF0000 } /* Generic.Error */ .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */ .highlight .gi { color: #00A000 } /* Generic.Inserted */ .highlight .go { color: #888888 } /* Generic.Output */ .highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */ .highlight .gs { font-weight: bold } /* Generic.Strong */ .highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */ .highlight .gt { color: #0044DD } /* Generic.Traceback */ .highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */ .highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */ .highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */ .highlight .kp { color: #008000 } /* Keyword.Pseudo */ .highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */ .highlight .kt { color: #B00040 } /* Keyword.Type */ .highlight .m { color: #666666 } /* Literal.Number */ .highlight .s { color: #BA2121 } /* Literal.String */ .highlight .na { color: #7D9029 } /* Name.Attribute */ .highlight .nb { color: #008000 } /* Name.Builtin */ .highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */ .highlight .no { color: #880000 } /* Name.Constant */ .highlight .nd { color: #AA22FF } /* Name.Decorator */ .highlight .ni { color: #999999; font-weight: bold } /* Name.Entity */ .highlight .ne { color: #D2413A; font-weight: bold } /* Name.Exception */ .highlight .nf { color: #0000FF } /* Name.Function */ .highlight .nl { color: #A0A000 } /* Name.Label */ .highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */ .highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */ .highlight .nv { color: #19177C } /* Name.Variable */ .highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */ .highlight .w { color: #bbbbbb } /* Text.Whitespace */ .highlight .mb { color: #666666 } /* Literal.Number.Bin */ .highlight .mf { color: #666666 } /* Literal.Number.Float */ .highlight .mh { color: #666666 } /* Literal.Number.Hex */ .highlight .mi { color: #666666 } /* Literal.Number.Integer */ .highlight .mo { color: #666666 } /* Literal.Number.Oct */ .highlight .sa { color: #BA2121 } /* Literal.String.Affix */ .highlight .sb { color: #BA2121 } /* Literal.String.Backtick */ .highlight .sc { color: #BA2121 } /* Literal.String.Char */ .highlight .dl { color: #BA2121 } /* Literal.String.Delimiter */ .highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */ .highlight .s2 { color: #BA2121 } /* Literal.String.Double */ .highlight .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */ .highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */ .highlight .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */ .highlight .sx { color: #008000 } /* Literal.String.Other */ .highlight .sr { color: #BB6688 } /* Literal.String.Regex */ .highlight .s1 { color: #BA2121 } /* Literal.String.Single */ .highlight .ss { color: #19177C } /* Literal.String.Symbol */ .highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */ .highlight .fm { color: #0000FF } /* Name.Function.Magic */ .highlight .vc { color: #19177C } /* Name.Variable.Class */ .highlight .vg { color: #19177C } /* Name.Variable.Global */ .highlight .vi { color: #19177C } /* Name.Variable.Instance */ .highlight .vm { color: #19177C } /* Name.Variable.Magic */ .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */ /* Temporary definitions which will become obsolete with Notebook release 5.0 */ .ansi-black-fg { color: #3E424D; } .ansi-black-bg { background-color: #3E424D; } .ansi-black-intense-fg { color: #282C36; } .ansi-black-intense-bg { background-color: #282C36; } .ansi-red-fg { color: #E75C58; } .ansi-red-bg { background-color: #E75C58; } .ansi-red-intense-fg { color: #B22B31; } .ansi-red-intense-bg { background-color: #B22B31; } .ansi-green-fg { color: #00A250; } .ansi-green-bg { background-color: #00A250; } .ansi-green-intense-fg { color: #007427; } .ansi-green-intense-bg { background-color: #007427; } .ansi-yellow-fg { color: #DDB62B; } .ansi-yellow-bg { background-color: #DDB62B; } .ansi-yellow-intense-fg { color: #B27D12; } .ansi-yellow-intense-bg { background-color: #B27D12; } .ansi-blue-fg { color: #208FFB; } .ansi-blue-bg { background-color: #208FFB; } .ansi-blue-intense-fg { color: #0065CA; } .ansi-blue-intense-bg { background-color: #0065CA; } .ansi-magenta-fg { color: #D160C4; } .ansi-magenta-bg { background-color: #D160C4; } .ansi-magenta-intense-fg { color: #A03196; } .ansi-magenta-intense-bg { background-color: #A03196; } .ansi-cyan-fg { color: #60C6C8; } .ansi-cyan-bg { background-color: #60C6C8; } .ansi-cyan-intense-fg { color: #258F8F; } .ansi-cyan-intense-bg { background-color: #258F8F; } .ansi-white-fg { color: #C5C1B4; } .ansi-white-bg { background-color: #C5C1B4; } .ansi-white-intense-fg { color: #A1A6B2; } .ansi-white-intense-bg { background-color: #A1A6B2; } .ansi-bold { font-weight: bold; }

Pybites Community Branch Activity

I wanted to play around with a dataset and see what I could find out about it. I decided on analyzing the little bit of data that I could collect from Github without having to use an OAuth key, which limits it to just 300 events.

To Run All of The Cells

You have the option of running each of the cells one at a time or you can just run them all in sequential order. Selecting a cell and either clicking on the Run button on the menu or using the key combination Shift+Enter will run the code in that cell if its code.

To run them all you will have to use the menu: Cell > Run All

In [1]:
importjsonfromcollectionsimportCounterfrompathlibimportPathimportmatplotlib.patchesasmpatchesimportmatplotlib.pyplotaspltimportnumpyasnpimportpandasaspdimportrequestsimportseabornassnsfromdateutil.parserimportparsefrommatplotlibimportrcfrommatplotlib.pyplotimportfigure
In [2]:
data_location=Path.cwd().joinpath("data")

Retrieving and Importing the Data

The following code will load the three event json files in the data directory if the data directory exists. If the direcotry is not found it will be created and the files will be pulled down from Github and then loaded into memory.

In [3]:
defretrieve_data():ifnotdata_location.exists():data_location.mkdir()url="https://api.github.com/repos/pybites/challenges/events?page={}&per_page=1000"forpageinrange(1,4):response=requests.get(url.format(page))ifresponse.ok:file_name=data_location.joinpath(f"events{page}.json")try:file_name.write_text(json.dumps(response.json()))print(f"  Created: {file_name.name}")exceptExceptionase:print(e)else:print(f"Something went wrong: [response.status_code]: {response.reason}")defload_data():ifdata_location.exists():forpageinrange(1,4):file_name=data_location.joinpath(f"events{page}.json")events.extend(json.loads(file_name.read_text()))print(f"  Loaded: {file_name.name}")else:print("Data directory was not found:")retrieve_data()load_data()

NOTE: If you want to work with the latest data, just remove the data directory and all its contents to have it pulled down once again.

In [4]:
events=[]load_data()print(f"Total Events Loaded: {len(events)}")
  Loaded: events1.json
  Loaded: events2.json
  Loaded: events3.json
Total Events Loaded: 300

Parsing the Data

From what I hear, we should just get used to cleaning data up before we can use it and its no exception here. I'm interested in exploring a few key points from the data. Mostly I'm interested in the following:

  • Pull Request Events
  • Data that they were created
  • The username of the developer
  • The amount of time spent on the challenge
  • How difficult they found the challenge to be
In [5]:
# helper functiondefparse_data(line):if'['inline:data=line.split(': [')[1].replace(']','').strip()else:data=line.split(': ')[1].strip()returndata# list to store the datacreated=[]devs=[]diff_levels=[]time_spent=[]foreventinevents:# only insterested in pull request eventsifevent['type']=='PullRequestEvent':# developer usernamedev=event['actor']['login']# ignore pybites ;)ifdev!='pybites':# store developer usernamedevs.append(dev)# store the datecreated.append(event['created_at'].split('T')[0])# parse comment from user for datacomment=event['payload']['pull_request']['body']forlineincomment.split('\n'):# get difficulty level and time spentif'Difficulty level (1-10):'inline:diff=parse_data(line)elif'Estimated time spent (hours):'inline:spent=parse_data(line)# pandas DataFrames require that all columns are the same length# so if we have a missing value, None is used in its placeifdiff:diff_levels.append(int(diff))else:diff_levels.append(None)ifspent:time_spent.append(int(spent))else:time_spent.append(None)

Creating The DataFrame

Now that we have the lists with the data that we parsed, a DataFrame can be created with them.

In [6]:
df=pd.DataFrame({'Developers':devs,'Difficulty_Levels':diff_levels,'Time_Spent':time_spent,'Date':created,})

Data Exploration

Here, we can start exploring the data. To take a quick peek at how it's looking, there is no better choice then to use head().

In [7]:
df.head()
Out[7]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
DevelopersDifficulty_LevelsTime_SpentDate
0cod3Ghoul4.020.02018-10-17
1YauheniKr4.02.02018-10-16
2YauheniKr4.02.02018-10-16
3clamytoe6.06.02018-10-15
4vipinreyo4.04.02018-10-15

To get some quick statistacaly metrics on the dataset, describe() can be used.

In [8]:
df.describe()
Out[8]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Difficulty_LevelsTime_Spent
count44.00000044.000000
mean3.6818183.090909
std1.6392393.297767
min1.0000001.000000
25%2.0000001.000000
50%4.0000002.000000
75%5.0000004.000000
max8.00000020.000000

Based on what I could see above, I Wanted to get a feel for the following portions. I can see the average difficulty level above, next to the 50%, but I also wanted to show you how to pull that out individually.

In [9]:
print(f'Developers: {len(df["Developers"])}')print(f'Average Difficulty: {df["Difficulty_Levels"].median()}')print(f'Time Spent: {df["Time_Spent"].sum()}')
Developers: 53
Average Difficulty: 4.0
Time Spent: 136.0

The following Counters are just me exploring the data even further.

In [10]:
developers=Counter(df['Developers']).most_common(6)developers
Out[10]:
[('clamytoe', 8),
 ('sorian', 8),
 ('vipinreyo', 7),
 ('demarcoz', 4),
 ('bbelderbos', 3),
 ('mridubhatnagar', 3)]
In [11]:
bite_difficulty=Counter(df['Difficulty_Levels'].dropna()).most_common()bite_difficulty
Out[11]:
[(4.0, 13), (2.0, 8), (3.0, 7), (6.0, 6), (5.0, 5), (1.0, 4), (8.0, 1)]
In [12]:
bite_duration=Counter(df['Time_Spent'].dropna()).most_common()bite_duration
Out[12]:
[(1.0, 16),
 (2.0, 10),
 (3.0, 6),
 (4.0, 4),
 (8.0, 3),
 (6.0, 2),
 (5.0, 2),
 (20.0, 1)]
In [13]:
created_at=sorted(Counter(df['Date'].dropna()).most_common())created_at
Out[13]:
[('2018-10-01', 1),
 ('2018-10-02', 6),
 ('2018-10-03', 3),
 ('2018-10-04', 4),
 ('2018-10-05', 8),
 ('2018-10-07', 7),
 ('2018-10-08', 4),
 ('2018-10-09', 2),
 ('2018-10-10', 1),
 ('2018-10-11', 1),
 ('2018-10-12', 4),
 ('2018-10-13', 3),
 ('2018-10-14', 3),
 ('2018-10-15', 3),
 ('2018-10-16', 2),
 ('2018-10-17', 1)]

Hmm, how many days are we looking at?

In [14]:
len(created_at)
Out[14]:
16

Time To Get Down To Business

Now that we've loaded our data and cleaned it up, lets see what it can tell us.

Number of Pull Request per Day

Pretty amazing that Pybites Blog Challenges had over 300 distinct github interactions in such a short time!

In [15]:
# resize graphfigure(num=None,figsize=(6,6),dpi=80,facecolor='w',edgecolor='k')# gather data into a custom DataFramedates=[day[0]fordayincreated_at]prs=[pr[1]forprincreated_at]df_prs=pd.DataFrame({'xvalues':dates,'yvalues':prs})# plotplt.plot('xvalues','yvalues',data=df_prs)# labelsplt.xticks(rotation='vertical',fontweight='bold')# titleplt.title('Number of Pull Request per Day')# show the graphicplt.show()

Top Blog Challenge Ninjas

Although there are many more contributors, I had to limit the count so that the data would be easier to visualize.

In [16]:
# resize graphfigure(num=None,figsize=(6,6),dpi=80,facecolor='w',edgecolor='k')# create labelslabels=[dev[0]fordevindevelopers]# get a count of the pull requestsprs=[dev[1]fordevindevelopers]# pull out top ninja sliceexplode=[0]*len(developers)explode[0]=0.1# create the pie chartplt.pie(prs,explode=explode,labels=labels,shadow=True,startangle=90)# add title and centerplt.axis('equal')plt.title('Top Blog Challenge Ninjas')# show the graphicplt.show()

Time Spent vs Difficulty Level per Pull Request

Finally I wanted to explore what the relation between time spent per PR vs how difficult the develop found the challenge to be.

In [17]:
# resize graphfigure(num=None,figsize=(15,6),dpi=80,facecolor='w',edgecolor='k')# drop null valuesdf_clean=df.dropna()# add legenddiff=mpatches.Patch(color='#557f2d',label='Difficulty Level')time=mpatches.Patch(color='#2d7f5e',label='Time Spent')plt.legend(handles=[time,diff])# y-axis in boldrc('font',weight='bold')# values of each groupbars1=df_clean['Difficulty_Levels']bars2=df_clean['Time_Spent']# heights of bars1 + bars2bars=df_clean['Difficulty_Levels']+df_clean['Time_Spent']# position of the bars on the x-axisr=range(len(df_clean))# names of group and bar widthnames=df_clean['Developers']barWidth=1# create green bars (bottom)plt.bar(r,bars1,color='#557f2d',edgecolor='white',width=barWidth)# create green bars (top), on top of the firs onesplt.bar(r,bars2,bottom=bars1,color='#2d7f5e',edgecolor='white',width=barWidth)# custom X axisplt.xticks(r,names,rotation='vertical',fontweight='bold')plt.xlabel("Developers",fontweight='bold')# titleplt.title('Time Spent vs Difficulty Level per Pull Request')# show graphicplt.show()

Conclusions

As you can see, the Pybites Ninjas are an active bunch. With such a small limited dataset its plain to see that some good information can be extracted from it. Would be interesting to see which challenges are getting the most action though, but I'll leave that as an exercise for you to explore!

PyCharm: PyCharm 2018.3 EAP 7

$
0
0

PyCharm 2018.3 EAP 7 is out! Get it now from the JetBrains website.

In this EAP we have introduced a host of new features as well as fixed bugs for various subsystems.

Read the Release Notes

New in This Version

WSL Support

py_wsl_interpreter

We have some great news for Windows users, PyCharm now supports Windows Subsystem for Linux (WSL). With support for WSL, you can select a WSL-based Python interpreter in PyCharm’s project interpreter settings and then run and debug your project or perform any other actions as if you had a local interpreter setup. There’s only one exception – you won’t be able to create virtual environments with WSL-based interpreters. All packages have to be installed on the corresponding WSL system interpreter. Before trying this new type of Python interpreter in PyCharm, please make sure you have properly installed WSL.

Read more about WSL support in the PyCharm Documentation.

Structure of ‘from’ Imports

Selection_247

The new “Structure of ‘from’ imports” set of style options is available under Settings(Preferences) | Editor | Code Style | Python. Using these options you can control the code style for imports by choosing between joining imports into one line and splitting imports by placing each of them on a new line when performing imports optimizations (Ctrl(Cmd)+Alt+O).

Read more about the other code style options available.

Support for Python Stub Files and PEP-561

PyCharm has been supporting Python stub files (.pyi) for a while. These files let you specify type hints using Python 3 syntax for both Python 2 and 3. PyCharm shows an asterisk in the left-hand gutter for those code elements that have stubs. Clicking the asterisk results in jumping to the corresponding stub:

elements with stubs

With the PEP-561 support introduced in this PyCharm 2018.3 EAP build, you can install stubs as packages for a Python 3.7 interpreter:

py_install_stub_package

Read more about the Python stub files support in the PyCharm Documentation.

Time Tracking

Selection_243

With the PyCharm’s built-in Time Tracking plugin, you can track the amount of time you spend on a task when working in the editor. To enable this feature go toSettings/Preferences | Tools | Tasks | Time Tracking, and select the Enable Time Tracking checkbox. Once enabled, you can start using the tool to track and record your productivity:

Read more about the Time Tracking tool in the PyCharm documentation.

Copyright Notices in Project Files

Inserting copyright notices in the project files can be daunting. PyCharm makes it easier with its new “Copyright”-related set of settings and features. Set different copyright profiles along with the project scopes that they apply to in Settings (Preferences) | Copyright. After you have your copyright profiles in place, simply generate copyright notices by simply pressing Alt + Insert anywhere in a file:

Selection_246

Interested?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to keep up to date with the latest releases throughout the entire EAP.

If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP, and stay up to date. You can find the installation instructions on our website.

PyCharm 2018.3 is in constant development during the EAP phase, therefore not all new features are already available. More features will be added in the coming weeks. As PyCharm 2018.3 is pre-release software, it is worth noting that it is not as stable as the release versions. Furthermore, we may decide to change and/or drop certain features as the EAP progresses.

All EAP versions will ship with a built-in EAP license, which means that these versions are free to use for up to 30 days after the day that they are built. As EAPs are released weekly, you’ll be able to use PyCharm Professional Edition EAP for free for the duration of the EAP program, as long as you upgrade at least once every 30 days.

 

Stack Abuse: Getting User Input in Python

$
0
0

Introduction

The way in which information is obtained and handled is one of the most important aspects in the ethos of any programming language, more so for the information supplied and obtained from the user.

Python, while comparatively slow in this regard when compared to other programming languages like C or Java, contains robust tools to obtain, analyze, and process data obtained directly from the end user.

This article briefly explains how different Python functions can be used to obtain information from the user through the keyboard, with the help of some code snippets to serve as examples.

Input in Python

To receive information through the keyboard, Python uses either the input() or raw_input() functions (more about the difference between the two in the following section). These functions have an optional parameter, commonly known as prompt, which is a string that will be printed on the screen whenever the function is called.

When one of the input() or raw_input() functions is called, the program flow stops until the user enters the input via the command line. To actually enter the data, the user needs to press the ENTER key after inputing their string. While hitting the ENTER key usually inserts a newline character ("\n"), it does not in this case. The entered string will simply be submitted to the application.

On a curious note, little has changed in how this function works between Python versions 2 and 3, which is reflected in the workings of input() and raw_input(), explained in the next section.

Comparing the input and raw_input Functions

The difference when using these functions only depends on what version of Python is being used. For Python 2, the function raw_input() is used to get string input from the user via the command line, while the input() function returns will actually evaluate the input string and try to run it as Python code.

In Python 3, raw_input() function has been deprecated and replaced by the input() function and is used to obtain a user's string through the keyboard. And the input() function of Python 2 is discontinued in version 3. To obtain the same functionality that was provided by Python 2's input() function, the statement eval(input()) must be used in Python 3.

Take a look at an example of raw_input function in Python 2.

# Python 2

txt = raw_input("Type something to test this out: ")  
print "Is this what you just said?", txt  

Output

Type something to test this out: Let the Code be with you!

Is this what you just said? Let the Code be with you!  

Similarly, take a look at an example of input function in Python 3.

# Python 3

txt = input("Type something to test this out: ")

# Note that in version 3, the print() function
# requires the use of parenthesis.
print("Is this what you just said? ", txt)  

Output

Type something to test this out: Let the Code be with you!  
Is this what you just said? Let the Code be with you!  

From here onwards this article will use the input method from Python 3, unless specified otherwise.

String and Numeric input

The input() function, by default, will convert all the information it receives into a string. The previous example we showed demonstrates this behavior.

Numbers, on the other hand, need to be explicitly handled as such since they come in as strings originally. The following example demonstrates how numeric type information is received:

# An input is requested and stored in a variable
test_text = input ("Enter a number: ")

# Converts the string into a integer. If you need
# to convert the user input into decimal format,
# the float() function is used instead of int()
test_number = int(test_text)

# Prints in the console the variable as requested
print ("The number you entered is: ", test_number)  

Output

Enter a number: 13  
The number you entered is: 13  

Another way to do the same thing is as follows:

test_number = int(input("Enter a number: "))  

Here we directly save the input, after immediate conversion, into a variable.

Keep in mind that if the user doesn't actually enter an integer then this code will throw an exception, even if the entered string is a floating point number.

Input Exception Handling

There are several ways to ensure that the user enters valid information. One of the ways is to handle all the possible errors that may occur while user enters the data.

In this section we'll demonstrates some good methods of error handling when taking input.

But first, here is some unsafe code:

test2word = input("Tell me your age: ")  
test2num = int(test2word)  
print("Wow! Your age is ", test2num)  

When running this code, let's say you enter the following:

Tell me your age: Three  

Here, when the int() function is called with the "Three" string, a ValueError exception is thrown and the program will stop and/or crash.

Now let's see how we would make this code safer to handle user input:

test3word = input("Tell me your lucky number: ")

try:  
    test3num = int(test3word)
    print("This is a valid number! Your lucky number is: ", test3num)
except ValueError:  
    print("This is not a valid number. It isn't a number at all! This is a string, go and try again. Better luck next time!")

This code block will evaluate the new input. If the input is an integer represented as a string then the int() function will convert it into a proper integer. If not, an exception will be raised, but instead of crashing the application it will be caught and the second print statement is run.

Here is an example of this code running when an exception is raised:

Tell me your lucky number: Seven  
This is not a valid number. It isn't a number at all! This is a string, go and try again. Better luck next time!  

This is how input-related errors can be handled in Python. You can combine this code with another construct, like a while loop to ensure that the code is repeatedly run until you receive the valid integer input that your program requires.

A Complete Example

# Makes a function that will contain the
# desired program.
def example():

    # Calls for an infinite loop that keeps executing
    # until an exception occurs
    while True:
        test4word = input("What's your name? ")

        try:
            test4num = int(input("From 1 to 7, how many hours do you play in your mobile?" ))

        # If something else that is not the string
        # version of a number is introduced, the
        # ValueError exception will be called.
        except ValueError:
            # The cycle will go on until validation
            print("Error! This is not a number. Try again.")

        # When successfully converted to an integer,
        # the loop will end.
        else:
            print("Impressive, ", test4word, "! You spent", test4num*60, "minutes or", test4num*60*60, "seconds in your mobile!")
            break

# The function is called
example()  

The output will be:

What's your name? Francis  
From 1 to 7, how many hours do you play in your mobile? 3  
Impressive, Francis! You spent 180 minutes or 10800 seconds on your mobile!  

Conclusion

In this article, we saw how the built-in Python utilities can be used to get user input in a variety of formats. We also saw how we can handle the exceptions and errors that can possibly occur while obtaining user input.

Catalin George Festila: The ebooklib python module .

$
0
0
Happy new year 2018!
The official webpage of this python module comes with this intro:
EbookLib is a Python library for managing EPUB2/EPUB3 and Kindle files. It's capable of reading and writing EPUB files programmatically (Kindle support is under development).
First the installation of this python module named ebooklib.
C:\>cd Python27

C:\Python27>cd Script
The system cannot find the path specified.

C:\Python27>cd Scripts

C:\Python27\Scripts>pip install ebooklib
Collecting ebooklib
Downloading EbookLib-0.16.tar.gz
Requirement already satisfied: lxml in c:\python27\lib\site-packages (from ebooklib)
Requirement already satisfied: six in c:\python27\lib\site-packages (from ebooklib)
Installing collected packages: ebooklib
Running setup.py install for ebooklib ... done
Successfully installed ebooklib-0.16
If you don't see the Scripts folder into your Python27 folder you need to install pip tool.
Just download the get-pip.py script into your Python27 folder and run it with python.
Let's test some default example:
C:\Python27>python.exe get-pip.py
The next step is to test a simple example:
from ebooklib import epub

book = epub.EpubBook()

# set metadata
book.set_identifier('id123456')
book.set_title('Sample book')
book.set_language('en')

book.add_author('Author Python')
book.add_author('catafest', file_as='', role='writer', uid='author')

# create chapter
c1 = epub.EpubHtml(title='Intro', file_name='chap_01.xhtml', lang='hr')
c1.content=u'Intro heading.Python is a interpreted high-level programming language ...'

# add chapter
book.add_item(c1)

# define Table Of Contents
book.toc = (epub.Link('chap_01.xhtml', 'Introduction', 'intro'),
(epub.Section('Simple book'),
(c1, ))
)

# add default NCX and Nav file
book.add_item(epub.EpubNcx())
book.add_item(epub.EpubNav())

# define CSS style
style = 'BODY {color: white;}'
nav_css = epub.EpubItem(uid="style_nav", file_name="style/nav.css", media_type="text/css", content=style)

# add CSS file
book.add_item(nav_css)

# basic spine
book.spine = ['nav', c1]

# write to the file
epub.write_epub('test.epub', book, {})
You can update and make more good your epub book with HTML5 tags.
I used this example with headings and paragraph to change the text, see the result:

Catalin George Festila: Python 2.7 : InsecurePlatformWarning error.

$
0
0
This is not a common error and can be solve it easy like any python issue.
The result of this error can be shown like into the next example:
c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:318: 
SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension
to TLS is not available on this platform. This may cause the server to present an incorrect TLS
certificate, which can cause validation failures. You can upgrade to a newer version of Python to
solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html
#snimissingwarning.
SNIMissingWarning
c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:122:
InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from
configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade
to a newer version of Python to solve this. For more information, see https://urllib3.readthe
docs.io/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
The simple way to test this python error is to install these python modules:
pip install urllib3 
pip install requests
This last python module named requests to come with:
Successfully installed certifi-2017.11.5 chardet-3.0.4 idna-2.6 requests-2.18.4
What is this python module named requests?
Is a security the requests python module inject pyopenssl into urllib3
.
C:\Python27>python
Python 2.7 (r27:82525, Jul 4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> help()

Welcome to Python 2.7! This is the online help utility.

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules. To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, or topics, type "modules",
"keywords", or "topics". Each module also comes with a one-line summary
of what it does; to list the modules whose summaries contain a given word
such as "spam", type "modules spam".

help> modules requests

Here is a list of matching modules. Enter any module name to get more help.

pip._vendor.cachecontrol.controller - The httplib2 algorithms ported for use with requests.
pip._vendor.requests - Requests HTTP library
pip._vendor.requests.adapters - requests.adapters
pip._vendor.requests.api - requests.api
pip._vendor.requests.auth - requests.auth
pip._vendor.requests.certs - requests.certs
pip._vendor.requests.compat - requests.compat
pip._vendor.requests.cookies - requests.cookies
pip._vendor.requests.exceptions - requests.exceptions
pip._vendor.requests.hooks - requests.hooks
pip._vendor.requests.models - requests.models
pip._vendor.requests.packages
pip._vendor.requests.packages.chardet
pip._vendor.requests.packages.chardet.big5freq
pip._vendor.requests.packages.chardet.big5prober
pip._vendor.requests.packages.chardet.chardetect - Script which takes one or more file paths
and reports on their detected
pip._vendor.requests.packages.chardet.chardistribution
pip._vendor.requests.packages.chardet.charsetgroupprober
pip._vendor.requests.packages.chardet.charsetprober
pip._vendor.requests.packages.chardet.codingstatemachine
pip._vendor.requests.packages.chardet.compat
pip._vendor.requests.packages.chardet.constants
pip._vendor.requests.packages.chardet.cp949prober
pip._vendor.requests.packages.chardet.escprober
pip._vendor.requests.packages.chardet.escsm
pip._vendor.requests.packages.chardet.eucjpprober
pip._vendor.requests.packages.chardet.euckrfreq
pip._vendor.requests.packages.chardet.euckrprober
pip._vendor.requests.packages.chardet.euctwfreq
pip._vendor.requests.packages.chardet.euctwprober
pip._vendor.requests.packages.chardet.gb2312freq
pip._vendor.requests.packages.chardet.gb2312prober
pip._vendor.requests.packages.chardet.hebrewprober
pip._vendor.requests.packages.chardet.jisfreq
pip._vendor.requests.packages.chardet.jpcntx
pip._vendor.requests.packages.chardet.langbulgarianmodel
pip._vendor.requests.packages.chardet.langcyrillicmodel
pip._vendor.requests.packages.chardet.langgreekmodel
pip._vendor.requests.packages.chardet.langhebrewmodel
pip._vendor.requests.packages.chardet.langhungarianmodel
pip._vendor.requests.packages.chardet.langthaimodel
pip._vendor.requests.packages.chardet.latin1prober
pip._vendor.requests.packages.chardet.mbcharsetprober
pip._vendor.requests.packages.chardet.mbcsgroupprober
pip._vendor.requests.packages.chardet.mbcssm
pip._vendor.requests.packages.chardet.sbcharsetprober
pip._vendor.requests.packages.chardet.sbcsgroupprober
pip._vendor.requests.packages.chardet.sjisprober
pip._vendor.requests.packages.chardet.universaldetector
pip._vendor.requests.packages.chardet.utf8prober
pip._vendor.requests.packages.urllib3 - urllib3 - Thread-safe connection pooling and re-using.
pip._vendor.requests.packages.urllib3._collections
pip._vendor.requests.packages.urllib3.connection
pip._vendor.requests.packages.urllib3.connectionpool
pip._vendor.requests.packages.urllib3.contrib
pip._vendor.requests.packages.urllib3.contrib.appengine
pip._vendor.requests.packages.urllib3.contrib.ntlmpool - NTLM authenticating pool,
contributed by erikcederstran
pip._vendor.requests.packages.urllib3.contrib.pyopenssl
pip._vendor.requests.packages.urllib3.contrib.socks - SOCKS support for urllib3
pip._vendor.requests.packages.urllib3.exceptions
pip._vendor.requests.packages.urllib3.fields
pip._vendor.requests.packages.urllib3.filepost
pip._vendor.requests.packages.urllib3.packages
pip._vendor.requests.packages.urllib3.packages.ordered_dict
pip._vendor.requests.packages.urllib3.packages.six - Utilities for writing code that runs on
Python 2 and 3
pip._vendor.requests.packages.urllib3.packages.ssl_match_hostname
pip._vendor.requests.packages.urllib3.packages.ssl_match_hostname._implementation - The match_hostname()
function from Python 3.3.3, essential when using SSL.
pip._vendor.requests.packages.urllib3.poolmanager
pip._vendor.requests.packages.urllib3.request
pip._vendor.requests.packages.urllib3.response
pip._vendor.requests.packages.urllib3.util
pip._vendor.requests.packages.urllib3.util.connection
pip._vendor.requests.packages.urllib3.util.request
pip._vendor.requests.packages.urllib3.util.response
pip._vendor.requests.packages.urllib3.util.retry
pip._vendor.requests.packages.urllib3.util.ssl_
pip._vendor.requests.packages.urllib3.util.timeout
pip._vendor.requests.packages.urllib3.util.url
pip._vendor.requests.sessions - requests.session
pip._vendor.requests.status_codes
pip._vendor.requests.structures - requests.structures
pip._vendor.requests.utils - requests.utils
requests - Requests HTTP Library
requests.__version__
requests._internal_utils - requests._internal_utils
requests.adapters - requests.adapters
requests.api - requests.api
requests.auth - requests.auth
requests.certs - requests.certs
requests.compat - requests.compat
requests.cookies - requests.cookies
requests.exceptions - requests.exceptions
requests.help - Module containing bug report helper(s).
requests.hooks - requests.hooks
requests.models - requests.models
requests.packages
requests.sessions - requests.session
requests.status_codes
requests.structures - requests.structures
requests.utils - requests.utils
help>
You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)". Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.
>>>
...

Catalin George Festila: Python 2.7 : Python and BigQuery service object.

$
0
0
Here's another tutorial about python and google. I thought it would be useful for the beginning of 2018.
The Google team tell us:

What is BigQuery?

Storing and querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google's infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data.


This tutorial it follows more precisely the steps from here.
First of all, you must create an authentication file by using the Create service account from your Google project.
Go to Google Console, navigate to the Create service account key page.
From the Service account drop-down, select the New service account.
Input a name into the form field.
From the Role drop-down, select Project and Owner.
The result is a JSON file type (this is for authenticating with google) download it renames and put into your project folder.
Like into the next image:

Now, select from the left area the Library does add the BigQuery API, try this link.
Search for BigQuery API and then use the button ENABLE to use it.
The next step is to install these python modules: pyopenssl and google-cloud-bigquery.
C:\Python27\Scripts>pip install -U pyopenssl
C:\Python27\Scripts>pip install --upgrade google-cloud-bigquery
Add this JSON file to windows path from my test folder:
set GOOGLE_APPLICATION_CREDENTIALS=C:\test\python_doc.json
Because my JSON file is named python_doc.json then this is the name I will use with my python script.
Let's see the script:
import google
from google.cloud import bigquery

def query_shakespeare():
client = bigquery.Client()
client = client.from_service_account_json('python_doc.json')
query_job = client.query("""
#standardSQL
SELECT corpus AS title, COUNT(*) AS unique_words
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY title
ORDER BY unique_words DESC
LIMIT 10""")

results = query_job.result() # Waits for job to complete.

for row in results:
print("{}: {}".format(row.title, row.unique_words))

if __name__ == '__main__':
query_shakespeare()
The result is:
C:\Python27>python.exe goo_test_bquerry.py
hamlet: 5318
kinghenryv: 5104
cymbeline: 4875
troilusandcressida: 4795
kinglear: 4784
kingrichardiii: 4713
2kinghenryvi: 4683
coriolanus: 4653
2kinghenryiv: 4605
antonyandcleopatra: 4582
NOTE: Take care of the JSON file because it gives access to your google account and tries to use the restrictions according to the application's requirements.


Catalin George Festila: The trinket website for learning.

$
0
0
This website comes with this feature:
Trinket lets you run and write code in any browser, on any device.
Trinkets work instantly, with no need to log in, download plugins, or install software.
Easily share or embed the code with your changes when you're done.

  • Just create Your Free Account then use the web interface to play with turtle python module:
  • Trinket lets you run and write code in any browser, on any device.
  • Trinkets work instantly, with no need to log in, download plugins, or install software.
  • Easily share or embed the code with your changes when you're done.

Catalin George Festila: The collections python module .

$
0
0
This module named collections implements some nice data structures which will help you to solve various real-life problems.
Let's start to see the content of this python module:
C:\Users\catafest>python

C:\Users\catafest>cd C:\Python27\

C:\Python27>python
Python 2.7 (r27:82525, Jul 4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> from collections import *
>>> dir(collections)
['Callable', 'Container', 'Counter', 'Hashable', 'ItemsView', 'Iterable', 'Iterator', 'KeysView',
'Mapping', 'MappingView', 'MutableMapping', 'MutableSequence', 'MutableSet', 'OrderedDict', 'Sequence',
'Set', 'Sized', 'ValuesView', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__'
, '_abcoll', '_chain', '_eq', '_heapq', '_ifilter', '_imap', '_iskeyword', '_itemgetter', '_repeat',
'_starmap', '_sys', 'defaultdict', 'deque', 'namedtuple']
Now I will tell you about some
First is Counter and is a direct subclass which helps to count hashable objects.
The elements are stored as dictionary keys and counts are stored as values which can be zero or negative.
Next is defaultdict and is a dictionary object which provides all methods provided by the dictionary.
This takes the first argument (default_factory) as default data type for the dictionary.
The namedtuple helps to have the meaning of each position in a tuple.
This allows us to code with better readability and self-documenting code.
Let's try some examples:
>>> from collections import Counter
>>> from collections import defaultdict
>>> from collections import namedtuple
>>> import re
>>> path = 'C:/yara_reg_rundll32.txt'
>>> output = re.findall('\w+', open(path).read().lower())
>>> Counter(output).most_common(5)
[('a', 2), ('nocase', 2), ('javascript', 2), ('b', 2), ('rundll32', 2)]
>>>
>>> d = defaultdict(list)
>>> colors = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
>>> for k, v in colors:
... d[k].append(v)
...
>>> d.items()
[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]
>>>
>>> Vertex = namedtuple('vertex', ['x', 'y'])
>>> v = Vertex(5,y = 9)
>>> v
vertex(x=5, y=9)
>>> v.x*v.y
45
>>> v[0]
5
>>> v[0]+v[1]
14
>>> x,y = v
>>> v
vertex(x=5, y=9)
>>> x
5
>>> y
9
>>>
The content of the yara_reg_rundll32.txt file is:
rule poweliks_rundll32_exe_javascript
{
meta:
description = "detect Poweliks' autorun rundll32.exe javascript:..."
string:
$a = "rundll32.exe" nocase
$b = "javascript" nocase
condition:
$a and $b
}

I used vertex variables into my example because can be used with Blender 3D.
You can see many examples at official documentation website.





Catalin George Festila: Use IMDB website with IMDbPY python module .

$
0
0
This python package is written in pure Python 3 to access the IMDb's database and used it.

You can read about this python module from GitHub docs webpage
The development team comes with this DISCLAIMER:
# DISCLAIMER

IMDbPY and the authors are not affiliated with Internet Movie Database Inc.

IMDb is a trademark of Internet Movie Database Inc. and all contents
and data included on the IMDb's site is the property of IMDb or its
content suppliers and protected by United States and international
copyright laws.

Please, read the IMDb's conditions of use in their website:
- http://www.imdb.com/conditions
- http://www.imdb.com/licensing
- any other notice in the http://www.imdb.com/ site.

First I start the install process with the pip tool:
C:\Python364\Scripts>pip install IMDbPY
Requirement already satisfied: IMDbPY in c:\python364\lib\site-packages
Requirement already satisfied: lxml in c:\python364\lib\site-packages (from IMDbPY)
Requirement already satisfied: sqlalchemy-migrate in c:\python364\lib\site-packages (from IMDbPY)
Requirement already satisfied: SQLAlchemy in c:\python364\lib\site-packages (from IMDbPY)
Requirement already satisfied: pbr>=1.8 in c:\python364\lib\site-packages (from sqlalchemy-migrate->IMDbPY)
Requirement already satisfied: decorator in c:\python364\lib\site-packages (from sqlalchemy-migrate->IMDbPY)
Requirement already satisfied: six>=1.7.0 in c:\python364\lib\site-packages (from sqlalchemy-migrate->IMDbPY)
Requirement already satisfied: sqlparse in c:\python364\lib\site-packages (from sqlalchemy-migrate->IMDbPY)
Requirement already satisfied: Tempita>=0.4 in c:\python364\lib\site-packages (from sqlalchemy-migrate->IMDbPY)
This is my source code to test it and working well.
# start with IMDb python class
from imdb import IMDb
imd = IMDb('http')
print("-===-")
# search movies by title
# and show the long imdb canonical title and movieID of the results.
title = imd.search_movie("Under the Dome")
for item in title:
print(item['long imdb canonical title'], item.movieID)
print("-===-")
# search for a person
for person in imd.search_person("Ana de Armas"):
print(person.personID, person['name'])
print("-===-")
# get 5 movies tagged with a keyword
movies_keyword = imd.get_keyword('novel', results=5)
for item in movies_keyword:
print(item['long imdb canonical title'], item.movieID)
print("-===-")
# get top 250 from top movies
top250 = imd.get_top250_movies()
for item in top250:
print(item['long imdb canonical title'], item.movieID)
print("-===-")
print("top 250 -=> ")
# get bottom 100 from top movies
bottom100 = imd.get_bottom100_movies()
print("bottom 100 -=> ")
for item in top250:
print(item['long imdb canonical title'], item.movieID)

Catalin George Festila: News: The Spyder IDE - new release .

$
0
0
Many python users use the Spyder IDE.
This IDE comes with many features and is easy to use, see Wikipedia page:
Spyder (formerly Pydee[3]) is an open-source cross-platform integrated development environment (IDE) for scientific programming in the Python language. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.[4][5] It is released under the MIT license.[6]
Six days ago, a release of this IDE with version 3.2.7 was announced.
This IDE can be download from GitHub page.

Catalin George Festila: The regex online tool for python and any programming languages.

$
0
0
Today I tested this online tool.
Is a tool for a regular expression (regex or regexp for short) for many programming languages.
These programming languages are php, javascript, golang and python.
The tool is easy to use it.
First, you need to select the programming language that is used for regular expression.
The next step is to put the regular expression into the edit box and add your text to be parsed by this regular expression.
For example, if you use this inputs for a regular expression:
([a-zA-Z]+) \d+
and this text example:
March 7 1976, June 1, August 9, Dec 25
the result output will be this:
March , June , August , Dec
Viewing all 22642 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>