Python Software Foundation: PyPI Security Q4 2019 Request for Information period opens.

August 28, 2019, 10:23 am

≫ Next: Talk Python to Me: #227 Maintainable data science: Tips for non-developers

≪ Previous: Stack Abuse: Introduction to the Python Pyramid Framework

The Python Software Foundation Packaging Working Group has received funding from Facebook research to develop and deploy of enhanced security features to PyPI.

PyPI is a foundational component of the Python ecosystem and broader computer software and technology landscape. This project aims to improve the security and accessibility of PyPI for all users worldwide, whether they are direct users like project maintainers and pip installers or indirect users. The impact of this work will be highly visible and improve crucial features of the service.

Specifically, this project aims to implement verifiable cryptographic signing of artifacts and infrastructure to support automated detection of malicious uploads to the index.

We plan to begin the project in December 2019. Because of the size of the project, funding has been allocated to secure one or more contractors to complete the development, testing, verification, and assist in the rollout of necessary features.

To receive notification when our Request for Information period closes and the Request for Proposals period opens, please register your interest here.

What is the Request for Information period?

A Request for Information (RFI) is a process intended to allow us (The Python Software Foundation) and potential contractors to openly share information to improve the scope and definition of the project at hand. Also, we encourage stakeholders in the community with expertise in the project areas to contribute their viewpoints on open questions for the scope of the work.

We hope that it will help potential contractors better understand the work to be completed and develop better specified proposals. Additionally we have designed the RFI with an open nature in order to expose the project to multiple perspectives and help shape the direction for some choices in the project.

The Request for Information period opens today, August 28, 2019, and is scheduled to close September 18, 2019.

After the RFI period closes, we will use the results of the process to prepare and open a Request for Proposals to solicit proposals from contractors to complete the work.

More Information

The full version of our Request for Information document can be found here.

Participate!

Our RFI will be conducted on the Python Community Discussion Forum. Participants will need to create an account in order to propose new topics of discussion or respond to existing topics.

All discussions will remain public and available for review by potential proposal authors who do not wish to or cannot create an account to participate directly.

↧

Talk Python to Me: #227 Maintainable data science: Tips for non-developers

August 28, 2019, 1:00 am

≫ Next: IslandT: Find the maximum value within a string with Python

≪ Previous: Python Software Foundation: PyPI Security Q4 2019 Request for Information period opens.

Did you come to software development outside of traditional computer science? This is common, and even how I got into programming myself. I think it's especially true for data science and scientific computing. That's why I'm thrilled to bring you an episode with Daniel Chen about maintainable data science tips and techniques.

↧

IslandT: Find the maximum value within a string with Python

August 28, 2019, 8:33 pm

≫ Next: Ned Batchelder: Don’t omit tests from coverage

≪ Previous: Talk Python to Me: #227 Maintainable data science: Tips for non-developers

In this chapter we are going to solve the above problem with a Python method. Given a string which consists of words and numbers, we are going to extract out the numbers that are within those words from that string, then compare and return the largest number within the given string.

These are the steps we need to do.

Turn the string into a list of words.
Create a string which only consists of digits separated by empty space that replaces the words within the digits.
Create a new list only consists of digits then returns the maximum digit.

def solve(s):
    
    s_list = list(s)
    str = ''
    for e in s_list:
        if e.isdigit():
            str += e
        else:
            str += ' '
    n_list = str.split(' ')
    e_list = []
    for x in n_list:
        if x.isdigit(): 
            e_list.append(int(x))
    return max(e_list)

The max method will return the maximum digit within a list.

↧

Ned Batchelder: Don’t omit tests from coverage

August 29, 2019, 3:43 am

≫ Next: py.CheckIO: New Python on CheckiO

≪ Previous: IslandT: Find the maximum value within a string with Python

There’s a common idea out there that I want to refute. It’s this: when measuring coverage, you should omit your tests from measurement. Searching GitHub shows that lots of people do this.

This is a bad idea. Your tests are real code, and the whole point of coverage is to give you information about your code. Why wouldn’t you want that information about your tests?

You might say, “but all my tests run all their code, so it’s useless information.” Consider this scenario: you have three tests written, and you need a fourth, similar to the third. You copy/paste the third test, tweak the details, and now you have four tests. Except oops, you forgot to change the name of the test.

Tests are weird: you have to name them, but the names don’t matter. Nothing calls the name directly. It’s really easy to end up with two same-named tests. Which means you only have one test, because the new one overwrites the old. Coverage would alert you to the problem.

Also, if your test suite is large, you likely have helper code in there as well as straight-up tests. Are you sure you need all that helper code? If you run coverage on the tests (and the helpers), you’d know about some weird clause in there that is never used. That’s odd, why is that? It’s probably useful to know. Maybe it’s a case you no longer need to consider. Maybe your tests aren’t exercising everything you thought.

The only argument against running coverage on tests is that it “artificially” inflates the results. True, it’s much easier to get 100% coverage on a test file than a product file. But so what? Your coverage goal was chosen arbitrarily anyway. Instead of aiming for 90% coverage, you should include your tests and aim for 95% coverage. 90% doesn’t have a magical meaning.

What’s the downside of including tests in coverage? “People will write more tests as a way to get the easy coverage.” Sounds good to me. If your developers are trying to game the stats, they’ll find a way, and you have bigger problems.

True, it makes the reports larger, but if your tests are 100% covered, you can exclude those files from the report with [report] skip_covered setting.

Your tests are important. You’ve put significant work into them. You want to know everything you can about them. Coverage can help. Don’t omit tests from coverage.

↧

py.CheckIO: New Python on CheckiO

August 28, 2019, 5:44 pm

≫ Next: Thibauld Nion: 7 years of Django in 7-ish days

≪ Previous: Ned Batchelder: Don’t omit tests from coverage

↧

Thibauld Nion: 7 years of Django in 7-ish days

August 29, 2019, 1:24 pm

≫ Next: Continuum Analytics Blog: Canaries Can Tweet: Preview New Features with Conda Canary

≪ Previous: py.CheckIO: New Python on CheckiO

Spring was quite an "interesting time" for my personal project: WaterOnMars.

Indeed I started to work on adding a new feature (a first in a while but maybe the topic of another post) but each time I was pushing or deploying code I was suddenly getting back warnings unrelated to my changes but pointing at core components like, err... Python or Django versions being deprecated.

So kudos for Python and github developers for making a clever use of warnings and, yes, I admit that using Python2.7 (ending its life in 2020) and Django1.4 (published 7 years ago) in 2019 is lame.

So... migrations !

Continuum Analytics Blog: Canaries Can Tweet: Preview New Features with Conda Canary

August 29, 2019, 1:33 pm

≫ Next: Python Insider: Python 3.8.0b4 is now available for testing

≪ Previous: Thibauld Nion: 7 years of Django in 7-ish days

Conda-canary is the pre-defaults-release channel for conda — it has the most recent version of conda. On occasion it will also have the latest pre-defaults-release of conda-build and other conda dependencies such as ruamel.yaml. Normally,…

The post Canaries Can Tweet: Preview New Features with Conda Canary appeared first on Anaconda.

↧

Python Insider: Python 3.8.0b4 is now available for testing

August 29, 2019, 9:42 pm

≫ Next: PyCharm: PyCharm 2019.2.2 Preview

≪ Previous: Continuum Analytics Blog: Canaries Can Tweet: Preview New Features with Conda Canary

It's time for the last beta release of Python 3.8. Go find it at:
https://www.python.org/downloads/release/python-380b4/

This release is the last of four planned beta release previews. Beta release previews are intended to give the wider community the opportunity to test new features and bug fixes and to prepare their projects to support the new feature release. The next pre-release of Python 3.8 will be 3.8.0c1, the first release candidate, currently scheduled for 2019-09-30.

Call to action

We strongly encourage maintainers of third-party Python projects to test with 3.8 during the beta phase and report issues found to the Python bug tracker as soon as possible. Please note this is the last beta release, there is not much time left to identify and fix issues before the release of 3.8.0. If you were hesitating trying it out before, now is the time.
While the release is planned to be feature complete entering the beta phase, it is possible that features may be modified or, in rare cases, deleted up until the start of the release candidate phase (2019-09-30). Our goal is have no ABI changes after beta 3 and no code changes after 3.8.0c1, the release candidate. To achieve that, it will be extremely important to get as much exposure for 3.8 as possible during the beta phase.

Please keep in mind that this is a preview release and its use is not recommended for production environments.

Acknowledgments

Many developers worked hard for the past four weeks to squash remaining bugs, some requiring non-obvious decisions. Many thanks to the most active, namely Raymond Hettinger, Steve Dower, Victor Stinner, Terry Jan Reedy, Serhiy Storchaka, Pablo Galindo Salgado, Tal Einat, Zackery Spytz, Ronald Oussoren, Neil Schemenauer, Inada Naoki, Christian Heimes, and Andrew Svetlov.

3.8.0 would not reach the Last Beta without you. Thank you!

↧

PyCharm: PyCharm 2019.2.2 Preview

August 30, 2019, 7:08 am

≫ Next: Python Bytes: #145 The Python 3 “Y2K” problem

≪ Previous: Python Insider: Python 3.8.0b4 is now available for testing

PyCharm 2019.2.2 Preview is now available!

Fixed in this Version

Some code insight fixes were implemented for Python 3.8:
- Now the “continue” and “finally” clauses are allowed to be used.
- Support for unicode characters in the re module was added.
An error on the Python Console that was not showing documentation for functions was resolved.
Some issues were solved for IPython that were causing the debugger not to work properly.
We had some regression issues with the debugger causing breakpoints to be ignored and/or throw exceptions and the data viewer not to show the proper information and those were solved.
A problem that caused PyCharm to stall when a Docker server was configured as remote python interpreter was fixed.
Jupyter Notebooks got some fixes: kernel specification selection is now based on the Python version for the module where a new notebook is created and in case the kernel specification is missing from the metadata a proper error message will be shown.
An issue that caused one remote interpreter not be used from two different machines was solved as well.
And many more fixes, see the release notes for more information.

Getting the New Version

Download the Preview from Confluence.

↧

Python Bytes: #145 The Python 3 “Y2K” problem

August 30, 2019, 9:46 pm

≫ Next: IslandT: Find the maximum value within a string with Python

≪ Previous: PyCharm: PyCharm 2019.2.2 Preview

↧

IslandT: Find the maximum value within a string with Python

August 28, 2019, 8:33 pm

≫ Next: IslandT: Combine two strings with Python method

≪ Previous: Python Bytes: #145 The Python 3 “Y2K” problem

These are the steps we need to do.

Turn the string into a list of words.
Create a string which only consists of digits separated by empty space that replaces the words within the digits.
Create a new list only consists of digits then returns the maximum digit.

def solve(s):
    
    s_list = list(s)
    str = ''
    for e in s_list:
        if e.isdigit():
            str += e
        else:
            str += ' '
    n_list = str.split(' ')
    e_list = []
    for x in n_list:
        if x.isdigit(): 
            e_list.append(int(x))
    return max(e_list)

The max method will return the maximum digit within a list.

↧

IslandT: Combine two strings with Python method

August 30, 2019, 9:55 pm

≫ Next: Kushal Das: Announcing lymworkbook project

≪ Previous: IslandT: Find the maximum value within a string with Python

In this example, we are going to create a method which will do the followings:-

Extract unique characters from two strings then group them into two separate lists.
Create a new list consists of the characters in those two lists. The character within the list must only appear once and only consists of lowercase a-z characters.

Below is the solution.

Create two lists with non-repeated characters from the given two strings.
Loop through all the lowercase characters (from a-z) and if this character appears within any of those two lists then appends them to a new character list.
Turn that new list into a string and then returns that new string.

import string
def longest(s1, s2):
    
    s1 = list(set(s1))
    s2 = list(set(s2))
    s3 = []

    for character in string.ascii_lowercase:
        if character in s1 or character in s2:
            s3.append(character)
    return ''.join(s3)

We will use the string module ascii_lowercase list property to save all the typing we need in order to create the lowercase letters list.

Homework :

Create a new string which only consists of non-repeated digits in the ascending order from two given strings. For example, s1 = “agy569” and s2 = “gyou5370” will produce s3 = “035679”. Write your solution in the comment box below this article.

Do you finish the homework all by your own?

↧

Kushal Das: Announcing lymworkbook project

August 31, 2019, 7:17 am

≫ Next: PyCon: PyCon 2020 Conference Site is here!

≪ Previous: IslandT: Combine two strings with Python method

In 2017, I started working on a new book to teach Linux command line in our online summer training. The goal was to have the basics covered in the book, and the same time not to try to explain things which can be learned better via man pages (yes, we encourage people to read man pages).

Where to practice

This one question always came up, many times, the students managed to destroy their systems by doing random things. rm -rf is always one of the various commands in this regard.

Introducing lymworkbook

Now, the book has a new chapter, LYM Workbook, where the reader can set up VMs in the local machine via Vagrant, and go through a series of problems in those machines. One can then verify if the solution they worked on is correct or not. For example:

sudo lymsetup copypaste
sudo lymverify copypaste

We are starting with only a few problems, but I (and a group of volunteers) will slowly add many more problems. We will also increase the complexity by increasing the number of machines and having setup more difficult systems. This will include the basic system administration related tasks.

How can you help

Have a look at the issues, feel free to pick up any open issue or create issues with various problems which you think are good to learn. Things can be as easy as rsync a directory to another system, or setting up Tor Project and use it as a system proxy.

Just adding one problem as an issue is also a big help, so please spend 5 minutes of your free time, and add any problem you like.

↧

PyCon: PyCon 2020 Conference Site is here!

August 31, 2019, 6:04 am

≫ Next: Weekly Python StackOverflow Report: (cxcii) stackoverflow python report

≪ Previous: Kushal Das: Announcing lymworkbook project

After 2 successful years in Cleveland, OH, PyCon 2020 and PyCon 2021 will be moving to Pittsburgh, PA!

Head over to us.pycon.org/2020 to check out the look for PyCon 2020.
Our bold design includes the Roberto Clemente Bridge, also known as the Sixth Street Bridge, which spans the Allegheny River in downtown Pittsburgh. The Pittsburgh Steelmark, was originally created for United States Steel Corporation to promote the attributes of steel: yellow lightens your work; orange brightens your leisure; and blue widens your world. The PPG Building, is a complex in downtown Pittsburgh, consisting of six buildings within three city blocks and five and a half acres. Named for its anchor tenant, PPG Industries, who initiated the project for its headquarters, the buildings are all of matching glass design consisting of 19,750 pieces of glass. Also included in the design are a fun snake, terminal window, and hardware related items.

Sponsor Opportunities

Sponsors help keep PyCon affordable and accessible to the widest possible audience. Sponsors are what make this conference possible. From low ticket prices to financial aid, to video recording, the organizations who step forward to support PyCon, in turn, support the entire Python community. They make it possible for so many to attend, for so many to be presenters, and for the people at home to watch along.

As with any sponsorship, the benefits go both ways. Organizations have many options for sponsorship packages, and they all benefit from exposure to an ever growing audience of Python programmers, from those just getting started to 20 year veterans and every walk of life in between. If you're hiring, the Job Fair puts your organization within reach of a few thousand dedicated people who came to PyCon looking to sharpen their skills.

For more details on sponsorship opportunities go to the Sponsor Prospectus. If you are interested in becoming a PyCon sponsor go to the application form.

We look forward to sharing more news on the call for proposals, financial aid applications, registration, and more, so stay tuned! Also follow us here on the PyCon Blog and @PyCon on Twitter.

↧

Weekly Python StackOverflow Report: (cxcii) stackoverflow python report

August 31, 2019, 3:41 pm

≫ Next: IslandT: Multiply according to the number of times

≪ Previous: PyCon: PyCon 2020 Conference Site is here!

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2019-08-31 22:41:03 GMT

↧

IslandT: Multiply according to the number of times

August 31, 2019, 10:05 pm

≫ Next: Tryton News: Newsletter September 2019

≪ Previous: Weekly Python StackOverflow Report: (cxcii) stackoverflow python report

In this example, a root number and the number which indicates how many numbers of times that root number should get multiplied have been passed into a function which will then return a list of multiplied numbers of that original number. For example, if we pass in 3 and 5 to this method multiples(3, 5), we will receive a list of multiplied numbers: [5, 10, 15].

def multiples(m, n):
    arr = []
    for number in range(1, m+1):
        arr.append(n * number)
    return arr

As you can see, the m parameter has been used in the range method to create the multiplied numbers list.

Your Homework :

Create a method which will only return the multiplied numbers that are not larger than the power of m for the given n number. Provide your answer in the tweet below.

Create a method which will only return the multiplied numbers that are not larger than the power of m for the given n number. Given multiples(m, n) #Python Provide answer below this tweet!
— TechLikin (@ChooWhei) September 1, 2019

If you are serious about learning Python, then homework is the best method to improve your Python skill!

↧

Tryton News: Newsletter September 2019

August 31, 2019, 11:00 pm

≫ Next: Codementor: Motor control PLC in Python

≪ Previous: IslandT: Multiply according to the number of times

@ced wrote:

building-ceiling-classroom-373488.jpg1280×853 133 KB
For this coming back month, Tryton has still improved for the users by simplifying some usage but also for the developers by providing more tools.
Contents:
Changes for users
Changes for developers
Changes For The User
The mobile contacts can now be clicked similar to phone contacts.
The update unite price flag of taxes is now also supported on children taxes.
In addition to the country, the tax rules can now be written using the subdivisions of origin and/or destination. A child subdivision will match the rule based on an upper level subdivision. This is useful for countries that have different tax rates for some subdivisions.
It is now possible to define default values for the customer and supplier tax rules. This can be useful to apply a local tax rule based on subdivision by default.
Now it is possible to configure a sequence for the product code that will be used to be filled at creation time. This may be used to ensure to have a unique code per product, even when it is duplicated.
The product cost price can be used in the price list. It uses the cost price of the company set in the context. This allows to build price lists by defining a margin to apply on the cost.
It is now possible to configure the customer code of the current company on the supplier party. The code will be displayed on the request for quotation.
Changes For The Developer
We added a partial support of TO_CHAR for date and datetime for SQLite Databases. We manage to support only the date and datetime which are the most useful usage and only for formats that can be easily converted into python strftime format. So we can use it now without breaking tests on SQLite.
We added a new function on Report to format timedelta. It uses the same representation as the clients to format duration field values.
As we now keep a link between the inventory moves and the outgoing moves, we can simplify the synchronization algorithm to use this link. Another advantage is that if the product is changed on the inventory move, the outgoing move is also updated instead of creating a new move.
If you forget to set a context on your RPC calls, Tryton will raise a better error message.
Now we have a lazy_gettext method which allows to defer the translation by using a LazyString. It can be used as label or help text of Fields. This is useful for base Model classes and Mixins to limit the duplication of the translation of the same string for each derived class.

Posts: 1

Participants: 1

Read full topic

↧

Codementor: Motor control PLC in Python

September 1, 2019, 8:26 am

≫ Next: Codementor: Useful Development Tools For Beginners

≪ Previous: Tryton News: Newsletter September 2019

Multi processing is very slow using Multi thread

↧

Codementor: Useful Development Tools For Beginners

September 1, 2019, 8:32 am

≫ Next: Ed Crewe: Teaching an old Pythonista new Gopher tricks

≪ Previous: Codementor: Motor control PLC in Python

Quick guide for beginners to learn faster and have some useful tools/reading materials.

↧

Ed Crewe: Teaching an old Pythonista new Gopher tricks

September 1, 2019, 5:27 am

≫ Next: PyCoder’s Weekly: Issue #383 (Aug. 27, 2019)

≪ Previous: Codementor: Useful Development Tools For Beginners

I recently got a new job where I need to write a lot of Golang, so needed to learn it.
I figured that you don't really learn a language unless you try and write code that actually does something useful. However having been to a recent Golang meetup where someone had come to a similar conclusion, and had written a full emulator of the Gameboy in Go - I also figured I wanted to do something that was not quite so complex or low level ... ie hopefully, could be done in a week.

So I decided to take the plunge by creating an open source package that does the same job, as a Python one that I released many years ago called django-csvimport. A simple add-on for the Django ORM that caters for loading data to models from CSV files, with the option to generate the model code from scratch for a CSV file by checking the data fields and determining the data type for each column.

Also doing a task where I had solved the problems in another language would mean I could just focus on how Golang might approach the problem, not the problem itself. So this post is about the practical differences between writing a Python and Golang solution. As such it compares the languages as tools for a certain job, which I hope is complementary to the many posts that compare the languages themselves. Suffice is to say, they differ in many ways ... most significantly in static vs. dynamic typing ... whilst being most similar in regarding readable consistent simple syntax as paramount - where other languages have different priorities - hence for both auto-formatting code is good practise, with Go's builtin go format doing the job of Python's black or yapf.

So firstly Django is one of the leading full web frameworks for Python, so what is the equivalent for Go? Gorilla, Gin, Buffalo etc. there are plenty of frameworks but which is the leading one with an ORM? ... I tried out a couple but reading around it, it became apparent that if you choose to develop a web app in Go, then the majority of devs don't use a framework at all!, so already the differences in the languages was becoming apparent. Reasons? If you choose Go for creating a web app then performance may be a significant requirement, even micro frameworks can be slower than raw code. Go is a recent language and as such has lots of web related features built into the core already ... templating, etc. and even imports are url based so a web framework in Go gives you less than it does in Python.

So instead I checked out Go ORMs and decided to write an extension package for Gorm as one of the leading Go ORMs.

So ditching the Web Framework / UI integration features of django-csvimport as an unnecessary extra, then the problem just consists of two parts, creating ORM model definitions that create relational database tables and parsing the CSV files to import the data to those tables.

From this high level spec. the core functional components that compose the tool that we want to rebuild in Golang are:

CLI interface to take arguments specifying source files and actions to perform
An ORM to manage vendor independent database schema creation and population
Utility to inspect data sources and determine data types
Template tool to create ORM models (metaprogramming)
CSV parser to read in CSV files - ideally capable of handling various formats and poor or inconsistent formatting - ie real CSV files!

For all of these we would hope for language level packages are available to do the major lifting. Then the package can just knit them together into a CSV to relational database import utility.

So stepping through these and rating Go vs Python...

CLI framework (draw)

As a minimum, our task requires a command line utility to point to the CSV data files to be imported.

Django comes with a CLI framework in the form of management commands. For our Go CSV import, gormcsv, we just have the ORM so we could roll our own CLI handling, but in this case that is probably not a great idea, since like Python, Go has a dominant CLI framework - Cobra equates to Python's Click. Ituses the Viper config framework which is like Python's core configparser lib with extras. Within the gormcsv module I created these CLI command go files as a cmd package via Cobra's autogenerate feature and used them to wrap the importcsv.go and inspectcsv.go source files in the importcsv package that do the real work.

ORM (draw)

Any language's leading ORM's should cope with the database management and data population tasks and GORM is functionally similar in its capabilities to the Django ORM

Data source introspection tool (Python win)

Messytables is a mature package designed for the task of scraping in data from various heterogenous third party sources - possibly of poor quality. As such it is one of the many utilities created around Python's well established role in the data analytics realm. Go has no such tool. There is no third party package to cater for inspecting, type checking and cleaning up data sources :-(

So we have to make our own much simpler data inspector that will hopefully cope Ok with the most common data types if they are reasonably consistently formatted.

Templating tool for creating models (Go win)

For GORM and Django the ORM models are implemented directly as classes in the language rather than using an intermediate DSL or XML etc. So to create models based on introspecting source data metaprogramming must be used to generate code.
Templates are available in the core of Go. Also given it is statically typed and has no generics, then for some problems that generics would solve, the best alternative is to use metaprogramming. Hence templated generation of Go code is a normal Go pattern. So arguably this is better (core) supported in Go than Python. For Python code generation is rarely needed, and my original django-csvimport implementation just uses string construction and didn't even employ one of Python's many add on template packages, eg. Django or Jinja2 templates (hmm needs a rewrite!)
Note that both languages have fully functional reflection / introspection libraries in the core.

CSV Parser (Python win)

Most important to this application is the quality of the CSV parser. This is where Go is sadly completely let down. Its CSV parser is frankly inadequate and can only cope with CSV that is strictly formatted according to RFC 4180.

To quote from Python's csv parser library ...

CSV format was used for many years prior to attempts to describe the format in a standardized way in RFC 4180. The lack of a well-defined standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources.

TBH Python 3's CSV parser is itself significantly more strict about format than the old Python 2 one and so certain CSV files cannot be parsed that Python 2 happily dealt with - largely due to the switch to unicode resulting in more character encoding related critical fails. However the Go parser is a whole other level of strict and realistically it can probably handle less than 10% of the real world CSV source files out there that you might want to scrape data from, into a database. Whilst Python 3's can probably cope with over 80%

I also investigated third party Go librarys that cater for parsing a more realistic range of CSV formatting, but found none that did so.

Conclusion

So in conclusion, Python may not be a Gopher Snake but for this task it does rather eat Go for breakfast. There is no ready made third party package to deal with ingesting unknown or badly formatted data like Python's aptly named messytables. Golang may sometimes be used for writing performant concurrent data processing in data science ... but it isn't used for the scraping and cleaning data sources part of the job! However this is a minor issue compared to the major blocker of not having an existing library that can import real world (ie sloppy format) CSV files.

So I have written my Go package for pushing CSV files to databases, gormcsv, and due to Go's great concurrency features it could certainly beat django-csvimport hands down in speed terms where big data quantities of CSV sources need ingesting. But I have yet to release it. Because with such poor compatibility with real CSV files, there doesn't seem to be much point - however I will hopefully persist in finishing things off, probably as a less performant work around to pre-clean CSV files into strict RFC 4180 prior to parsing. Since implementing my own CSV parser from scratch for Go would likely break my original goal's of coming up with an open source project in the language that would take no longer than a week!

Oh and what do I think of Go? Well I like it, I most like the concept of classes just being data structs with bags of composed methods loosely coupled to them. I least like the error handling unseparated from normal code flow ... since it can lead to poor readability of code due to the excessive error boilerplate stuck within the program flow. It is my new favourite (statically typed) language ... but it hasn't replaced Python as my overall favourite.

↧