Python Bytes: #276 Tracking cyber intruders with Jupyter and Python

Watch the live stream: <a href='https://www.youtube.com/watch?v=2dhBSF6EL-M' style='font-weight: bold;'>Watch on YouTube</a> About the show Sponsored by FusionAuth: <a href="http://pythonbytes.fm/fusionauth">pythonbytes.fm/fusionauth</a> Special guest: <a href="https://twitter.com/ianhellen">Ian Hellen</a> Brian #1: <a href="https://radimrehurek.com/gensim/parsing/preprocessing.html">gensim.parsing.preprocessing</a> <ul> <li>Problem I’m working on <ul> <li>Turn a blog title into a possible url <ul> <li>example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”</li> <li>would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”</li> </ul></li> </ul></li> <li>Sub-problem: remove stop words ← this is the hard part</li> <li>I started with an article called <a href="https://stackabuse.com/removing-stop-words-from-strings-in-python/">Removing Stop Words from Strings in Python</a> <ul> <li>It covered how to do this with NLTK, Gensim, and SpaCy</li> <li>I was most successful with <code>remove_stopwords()</code> from Gensim <ul> <li><code>from gensim.parsing.preprocessing import remove_stopwords</code></li> <li>It’s part of a <code>gensim.parsing.preprocessing</code> package</li> </ul></li> </ul></li> <li>I wonder what’s all in there? <ul> <li>a treasure trove</li> <li><code>gensim.parsing.preprocessing.preprocess_string</code> is one</li> <li>this function applies filters to a string, with the defaults almost being just what I want: <ul> <li>strip_tags() </li> <li>strip_punctuation() </li> <li>strip_multiple_whitespaces() </li> <li>strip_numeric() </li> <li>remove_stopwords() </li> <li>strip_short() </li> <li>stem_text() ← I think I want everything except this <ul> <li>this one turns “Twisted” into “Twist”, not good.</li> </ul></li> </ul></li> </ul></li> <li>There’s lots of other text processing goodies in there also.</li> <li>Oh, yeah, and Gensim is also cool. <ul> <li>topic modeling for training semantic NLP models</li> </ul></li> <li>So, I think I found a really big hammer for my little problem. <ul> <li>But I’m good with that</li> </ul></li> </ul> Michael #2: <a href="https://devdocs.io/">DevDocs</a> <ul> <li>via Loic Thomson</li> <li>Gather and search a bunch of technology docs together at once</li> <li>For example: Python + Flask + JavaScript + Vue + CSS</li> <li>Has an offline mode for laptops / tablets</li> <li>Installs as a PWA (sadly not on Firefox) <img src="https://paper-attachments.dropbox.com/s_BE4BBD89C4EBAA44BDD490C3B77ECD9ACCC7896BE76EDE2104EEB73E4D14D4A9_1647539299232_offline-pwa.jpg" alt="" /></li> </ul> Ian #3: <a href="https://msticpy.readthedocs.io/">MSTICPy</a> <ul> <li>MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.</li> <li>What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.</li> <li>Why Jupyter notebooks? <ul> <li>SOC (Security Ops Center) tools can be excellent but all have limitations</li> <li>You can get data from anywhere</li> <li>Use custom analysis and visualizations</li> <li>Control the workflow…. workflow is repeatable</li> </ul></li> <li>Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞 </li> <li>MSTICPy has 4 main functional areas: <ul> <li>Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)</li> <li>Enrichment - is this IP Address or domain known to be malicious?</li> <li>Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)</li> <li>Visualization - more specialized than traditional graphs - timelines, process trees.</li> </ul></li> <li>All components use pandas, Bokeh for visualizations</li> <li>Current focus on usability, discovery of functionality and being able to chain</li> <li>Always looking for collaborators and contributors - code, docs, queries, critiques</li> <li><a href="https://github.com/microsoft/msticpy">https://github.com/microsoft/msticpy</a></li> <li><a href="https://msticpy.readthedocs.io/">https://msticpy.readthedocs.io/</a></li> </ul> <img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390760844_timeseries.png" alt="Time series analysis for identifying anomalies" /> <img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390867204_ProcessTree.png" alt="Process tree visualizer" /> <img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647390899220_ThreatIntel.png" alt="Threat intelligence browser" /> <hr /> Brian #4: <a href="https://davidamos.dev/the-right-way-to-compare-floats-in-python/">The Right Way To Compare Floats in Python</a> <ul> <li>David Amos</li> <li>Definitely an easier read than the classic <a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a> <ul> <li>What many of us remember <ul> <li>floating point numbers aren’t exact due to representation limitations and rounding error,</li> <li>errors can accumulate</li> <li>comparison is tricky</li> </ul></li> </ul></li> <li>Be careful when comparing floating point numbers, even simple comparisons, like: >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 <= 0.3 False</li> <li>David has a short but nice introduction to the problems of representation and rounding.</li> <li>Three reasons for rounding <ul> <li>more significant digits than floating point allows</li> <li>irrational numbers</li> <li>rational but non-terminating</li> </ul></li> <li>So how do you compare: <ul> <li><code>math.isclose()</code> <ul> <li>be aware of <code>rel_tol</code> and <code>abs_tol</code> and when to use each.</li> </ul></li> <li><code>numpy.allclose()</code>, returns a boolean comparing two arrays</li> <li><code>numpy.isclose()</code>, returns an array of booleans</li> <li><code>pytest.approx()</code>, used a bit differently <ul> <li><code>0.1 + 0.2 == pytest.approx(0.3)</code></li> <li>Also allows <code>rel</code> and <code>abs</code> comparisons</li> </ul></li> </ul></li> <li>Discussion of <code>Decimal</code> and <code>Fraction</code> types <ul> <li>And the memory and speed hit you take on when using them.</li> </ul></li> </ul> Michael #5: <a href="https://pypyr.io">Pypyr</a> <ul> <li>Task runner for automation pipelines</li> <li>For when your shell scripts get out of hand. Less tricky than makefile.</li> <li>Script sequential task workflow steps in yaml</li> <li>Conditional execution, loops, error handling & retries</li> <li>Have a look at <a href="https://pypyr.io/docs/getting-started/run-your-first-pipeline/">the getting started</a>.</li> </ul> Ian #6: <a href="https://pygments.org/">Pygments</a> <ul> <li>Python package that’s useful for anyone who wants to display code <ul> <li>Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.)</li> <li>There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code</li> </ul></li> <li>Pygments can intelligently render syntax-highlighted code to HTML (and other formats)</li> <li>Applications: <ul> <li>Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS</li> <li>Displaying code snippets dynamically in readable form</li> </ul></li> <li>Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML</li> <li>Easy to use - 3 lines of code - example:</li> </ul> <pre><code>from IPython.display import display, HTML from pygments import highlight from pygments.lexers import PythonLexer from pygments.formatters import HtmlFormatter code = """ def print_hello(who="World"): message = f"Hello {who}" print(message) """ display(HTML( highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True)) )) # use HtmlFormatter(style="stata-dark", full=True, nobackground=True) # for dark themes </code></pre> <img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647392921561_Pygments-Output.png" alt="" /> <ul> <li>Output to HTML, Latex, image formats.</li> <li>We use it in MSTICPy for displaying scripts used in attacks. Example: <img src="https://paper-attachments.dropbox.com/s_9A0DA79EC37A4E770481D3239436EF54DFD2523769D91AD4EA62243A17D1D803_1647392332348_Pygments-Code.png" alt="" /></li> </ul> Extras Brian: <ul> <li><a href="https://pypi.org/project/smart-open/">smart-open</a> <ul> <li>one of the 3 Gensim dependencies</li> <li>It’s for streaming large files, from really anywhere, and looks just like Python’s <code>open()</code>.</li> </ul></li> </ul> Michael: <ul> <li><a href="https://docs.python.org/release/3.10.3/whatsnew/changelog.html#python-3-10-3-final">Python 3.10.3 is out</a>.</li> <li><a href="https://jordanelver.co.uk/blog/2020/06/04/fixing-commits-with-git-commit-fixup-and-git-rebase-autosquash/">git fixup</a> (follow up from last week, via Adam Parkin)</li> </ul> Joke: <a href="https://twitter.com/PR0GRAMMERHUM0R/status/1504231058165882888">What’s your secret?</a>

Latest Images

Trending Articles

Latest Images