<p><strong>Watch the live stream:</strong></p>
<a href='https://www.youtube.com/watch?v=2dhBSF6EL-M' style='font-weight: bold;'>Watch on YouTube</a><br>
<br>
<p><strong>About the show</strong></p>
<p>Sponsored by FusionAuth: <a href="http://pythonbytes.fm/fusionauth">pythonbytes.fm/fusionauth</a></p>
<p>Special guest: <a href="https://twitter.com/ianhellen"><strong>Ian Hellen</strong></a></p>
<p><strong>Brian #1:</strong> <a href="https://radimrehurek.com/gensim/parsing/preprocessing.html"><strong>gensim.parsing.preprocessing</strong></a></p>
<ul>
<li>Problem I’m working on
<ul>
<li>Turn a blog title into a possible url
<ul>
<li>example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”</li>
<li>would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”</li>
</ul></li>
</ul></li>
<li>Sub-problem: remove stop words ← this is the hard part</li>
<li>I started with an article called <a href="https://stackabuse.com/removing-stop-words-from-strings-in-python/">Removing Stop Words from Strings in Python</a>
<ul>
<li>It covered how to do this with NLTK, Gensim, and SpaCy</li>
<li>I was most successful with <code>remove_stopwords()</code> from Gensim
<ul>
<li><code>from gensim.parsing.preprocessing import remove_stopwords</code></li>
<li>It’s part of a <code>gensim.parsing.preprocessing</code> package</li>
</ul></li>
</ul></li>
<li>I wonder what’s all in there?
<ul>
<li>a treasure trove</li>
<li><code>gensim.parsing.preprocessing.preprocess_string</code> is one</li>
<li>this function applies filters to a string, with the defaults almost being just what I want:
<ul>
<li>strip_tags() </li>
<li>strip_punctuation() </li>
<li>strip_multiple_whitespaces() </li>
<li>strip_numeric() </li>
<li>remove_stopwords() </li>
<li>strip_short() </li>
<li>stem_text() ← I think I want everything except this
<ul>
<li>this one turns “Twisted” into “Twist”, not good.</li>
</ul></li>
</ul></li>
</ul></li>
<li>There’s lots of other text processing goodies in there also.</li>
<li>Oh, yeah, and Gensim is also cool.
<ul>
<li>topic modeling for training semantic NLP models</li>
</ul></li>
<li>So, I think I found a really big hammer for my little problem.
<ul>
<li>But I’m good with that</li>
</ul></li>
</ul>
<p><strong>Michael #2:</strong> <a href="https://devdocs.io/"><strong>DevDocs</strong></a></p>
<ul>
<li>via Loic Thomson</li>
<li>Gather and search a bunch of technology docs together at once</li>
<li>For example: Python + Flask + JavaScript + Vue + CSS</li>
<li>Has an offline mode for laptops / tablets</li>
<li>Installs as a PWA (sadly not on Firefox)
<img src="https://paper-attachments.dropbox.com/s_BE4BBD89C4EBAA44BDD490C3B77ECD9ACCC7896BE76EDE2104EEB73E4D14D4A9_1647539299232_offline-pwa.jpg" alt="" /></li>
</ul>
<p><strong>Ian</strong> <strong>#3:</strong> <a href="https://msticpy.readthedocs.io/"><strong>MSTICPy</strong></a></p>
<ul>
<li>MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.</li>
<li>What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.</li>
<li>Why Jupyter notebooks?
<ul>
<li>SOC (Security Ops Center) tools can be excellent but all have limitations</li>
<li>You can get data from anywhere</li>
<li>Use custom analysis and visualizations</li>
<li>Control the workflow…. workflow is repeatable</li>
</ul></li>
<li>Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞 </li>
<li>MSTICPy has 4 main functional areas:
<ul>
<li>Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)</li>
<li>Enrichment - is this IP Address or domain known to be malicious?</li>
<li>Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)</li>
<li>Visualization - more specialized than traditional graphs - timelines, process trees.</li>
</ul></li>
<li>All components use pandas, Bokeh for visualizations</li>
<li>Current focus on usability, discovery of functionality and being able to chain</li>
<li>Always looking for collaborators and contributors - code, docs, queries, critiques</li>
<li><a href="https://github.com/microsoft/msticpy">https://github.com/microsoft/msticpy</a></li>
<li><a href="https://msticpy.readthedocs.io/">https://msticpy.readthedocs.io/</a></li>
</ul>
<p><img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390760844_timeseries.png" alt="Time series analysis for identifying anomalies" /></p>
<p><img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390867204_ProcessTree.png" alt="Process tree visualizer" /></p>
<p><img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390899220_ThreatIntel.png" alt="Threat intelligence browser" /></p>
<hr />
<p><strong>Brian #4:</strong> <a href="https://davidamos.dev/the-right-way-to-compare-floats-in-python/"><strong>The Right Way To Compare Floats in Python</strong></a></p>
<ul>
<li>David Amos</li>
<li>Definitely an easier read than the classic <a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a>
<ul>
<li>What many of us remember
<ul>
<li>floating point numbers aren’t exact due to representation limitations and rounding error,</li>
<li>errors can accumulate</li>
<li>comparison is tricky</li>
</ul></li>
</ul></li>
<li>Be careful when comparing floating point numbers, even simple comparisons, like:
>>> 0.1 + 0.2 == 0.3
False
>>> 0.1 + 0.2 <= 0.3
False</li>
<li>David has a short but nice introduction to the problems of representation and rounding.</li>
<li>Three reasons for rounding
<ul>
<li>more significant digits than floating point allows</li>
<li>irrational numbers</li>
<li>rational but non-terminating</li>
</ul></li>
<li>So how do you compare:
<ul>
<li><code>math.isclose()</code>
<ul>
<li>be aware of <code>rel_tol</code> and <code>abs_tol</code> and when to use each.</li>
</ul></li>
<li><code>numpy.allclose()</code>, returns a boolean comparing two arrays</li>
<li><code>numpy.isclose()</code>, returns an array of booleans</li>
<li><code>pytest.approx()</code>, used a bit differently
<ul>
<li><code>0.1 + 0.2 == pytest.approx(0.3)</code></li>
<li>Also allows <code>rel</code> and <code>abs</code> comparisons</li>
</ul></li>
</ul></li>
<li>Discussion of <code>Decimal</code> and <code>Fraction</code> types
<ul>
<li>And the memory and speed hit you take on when using them.</li>
</ul></li>
</ul>
<p><strong>Michael #5:</strong> <a href="https://pypyr.io"><strong>Pypyr</strong></a></p>
<ul>
<li>Task runner for automation pipelines</li>
<li>For when your shell scripts get out of hand. Less tricky than makefile.</li>
<li>Script sequential task workflow steps in yaml</li>
<li>Conditional execution, loops, error handling & retries</li>
<li>Have a look at <a href="https://pypyr.io/docs/getting-started/run-your-first-pipeline/">the getting started</a>.</li>
</ul>
<p><strong>Ian</strong> <strong>#6:</strong> <a href="https://pygments.org/"><strong>Pygments</strong></a></p>
<ul>
<li>Python package that’s useful for anyone who wants to display code
<ul>
<li>Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.)</li>
<li>There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code</li>
</ul></li>
<li>Pygments can intelligently render syntax-highlighted code to HTML (and other formats)</li>
<li>Applications:
<ul>
<li>Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS</li>
<li>Displaying code snippets dynamically in readable form</li>
</ul></li>
<li>Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML</li>
<li>Easy to use - 3 lines of code - example:</li>
</ul>
<pre><code>from IPython.display import display, HTML
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
code = """
def print_hello(who="World"):
message = f"Hello {who}"
print(message)
"""
display(HTML(
highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True))
))
# use HtmlFormatter(style="stata-dark", full=True, nobackground=True)
# for dark themes
</code></pre>
<p><img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647392921561_Pygments-Output.png" alt="" /></p>
<ul>
<li>Output to HTML, Latex, image formats.</li>
<li>We use it in MSTICPy for displaying scripts used in attacks. Example:
<img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647392332348_Pygments-Code.png" alt="" /></li>
</ul>
<p><strong>Extras</strong> </p>
<p>Brian:</p>
<ul>
<li><a href="https://pypi.org/project/smart-open/"><strong>smart-open</strong></a>
<ul>
<li>one of the 3 Gensim dependencies</li>
<li>It’s for streaming large files, from really anywhere, and looks just like Python’s <code>open()</code>.</li>
</ul></li>
</ul>
<p>Michael:</p>
<ul>
<li><a href="https://docs.python.org/release/3.10.3/whatsnew/changelog.html#python-3-10-3-final"><strong>Python 3.10.3 is out</strong></a>.</li>
<li><a href="https://jordanelver.co.uk/blog/2020/06/04/fixing-commits-with-git-commit-fixup-and-git-rebase-autosquash/"><strong>git fixup</strong></a> (follow up from last week, via Adam Parkin)</li>
</ul>
<p><strong>Joke:</strong> <a href="https://twitter.com/PR0GRAMMERHUM0R/status/1504231058165882888"><strong>What’s your secret?</strong></a></p>
↧