HCI 574 - lecture 23 - glob, regex (Mar. 7, 2014) ● finish Python file system operations from lecture 22, open filesystem.py ● get and unzip lecture23.zip - will contain scripts and some play around files/folders ● scripts: using_glob.py and regex.py ● Python "demo" applications: folder_tree.py and redemo.py ● HW 6 - find files with the same name that live in different folders ● Optional: shutil module for shell commands on files/folders (copy, move, delete, etc.) ● Optional: Using the zipfile module to compress files glob module - file/folder name pattern matching with wildcards: ● task: list all files in the current folder starting with a and ending in .txt ● use a "glob", a pattern that contains special pattern matching characters (wildcards) such as *, ?, [0-9], [a-z], ! ● http://docs.python.org/library/glob.html (Modeled after UNIX style wildcard pattern matching) ● *: matches all letters and numbers: *.txt finds stuff.txt bla.txt but not bla.xml ● ?: matches a single letter/number: bl?.txt finds bla.txt and blo.txt The Python glob module (using_glob.py inside lecture23 folder) ● import glob # global module ● glob.glob() function - filename pattern matching for current folder via special "wildcards" ● files = glob.glob("*.txt")# return list of files that match a certain pattern ● glob() returns empty list [] if no matches are found ● pattern must be a single string: "*.txt" or r"..\*.*" or r"c:\temp\*.*" ● Note: you can use / for glob patterns, even in Windows (no need for \\) ● to glue together parts, use os.sep: "stuff" + os.sep + "folderA" + os.sep + "*.jpg" ● glob("*.txt") returns files bla.txt and blo.txt but not bla.doc ● glob("f*") returns files and folders starting with f ● */*.txt finds all txt files in all sub-folders ● */*/* finds all files in all the subfolders's subfolders more complex glob patterns ● [0-9] means a single number from 0 to 9 ( - sets up a range) ○ img[1-4].jpg finds img1.jpg, img1.jpg, ..., img4.jpg ○ img[135].jpg find only img1.jpg, img3.jpg and img5.jpg only (no - here!) ● [a-c]* finds all files starting with a, b or c ● [!a-c]* files NOT starting with a, b, or c (i.e. only files starting with d-z), ! means not ● brainteasers: (looking at files in lecture 23 folder): - what does img[0-9][0-9].jpg return? - what pattern returns all report files with a 3 letter month and are from 2008 or 2009? Regular expressions (re) - complex pattern matching in Python (also called Perl style reg.expr.) Uses another pattern matching syntax that is different(!!!) from the glob() syntax shown above! Regular expressions (re or regex) are a lot more powerful for pattern matching than glob() but its also quite a bit more complex. I'll only go over a tiny fraction of what you can do with re, but here are some links: ● http://docs.python.org/2/library/re.html ● docs.python.org/dev/howto/regex.html ● https://developers.google.com/edu/python/regular-expressions ● http://www.noah.org/wiki/RegEx_Python ● http://effbot.org/librarybook/re.htm First, let's play around with the more complex pattern matching syntax the Perl style regular expression syntax uses. Run the script redemo_GUI.py (in your lecture23 folder). Paste this into the middle window (text is also in Dear Grandson.txt) and make sure that MULTILINE is checked ON! Dear Grandson, My current email is grama.write@com. Or is is grama@write.com? Pa's email is grumpyoldman@write.com. Or maybe it's grumpy@old@man@write.com? Sorry, those funny @ signs are confusing! Please write us soon! We will extract all syntactically valid email addresses from this text. First manually in redemo, then in our own script. The pattern describing a syntactically valid email address is this: [A-Za-z0-9.]+@[A-Za-z0-9.]+com Paste this into the first line of redemo (check: show all matches) ● ● ● ● ● ● ● ● ● A-Z : all letters from A to Z (a range) [A-Za-z0-9] : [] => glue together several ranges: A-z or a-z or 0-9 - this gives the allowed letters [A-Za-z0-9.]: also allow the dot (but: no space => space acts as separator!) +: means - any allowed letter must occur one or more times. [A-Za-z0-9.]+ defines a word (here: dot(s) are allowed, but spaces, dashes, etc. are not!) [A-Za-z0-9.]+@a literal letter @ that must be to the right of a word [A-Za-z0-9.]+coma literal sequence of letters that must be to the right of a word [A-Za-z0-9.]+@[A-Za-z0-9.]+com a sequence of a word, the @, a word and the com (\w "word: is short for A-Za-z \d "decimal" is short for 0-9) Now let's use this inside Python (open reg_expr.py): import re s = """ Dear Grandson, My current email is grama.write@com. Or is is grama@write.com? Pa's email is grumpyoldman@write.com. Or maybe it's grumpy@old@man@write.com? Sorry, those funny @ signs are confusing! Please write us soon! """ # this string describe the pattern to match pattern = r"[A-Za-z0-9.]+@[A-Za-z0-9.]+com" all_matches = re.findall(pattern, s) print all_matches # => ['grama@write.com', 'grumpyoldman@write.com', 'man@write.com'] # replace matches with another string new_s = re.sub(pattern, "grandparents@write.com", s) print new_s Optional: shutil (shell utility) module - copying, moving, deleting files and folders (OS independent) ● shutil.copy("hey.txt", "folderA") # copy file hey.txt into folderA ● shutil.copy("hey.txt", "folderA/copy_of_hey.txt") # hey.txt -> folderAcopy_of_hey.txt ● http://docs.python.org/library/shutil.html Compressing files into a zip file archive ● http://docs.python.org/library/zipfile.html ● http://www.doughellmann.com/PyMOTW/zipfile/ ● uses a zipfile object called ZipFile ● make an empty zip archive, add (write) files into archive, close archive ● actual file compression must be set via ZIP_DEFLATED (you may need to import zlib) import zipfile zf = zipfile.ZipFile("myzip.zip", mode="w") # make empty zip file object zf.write('bla.txt', compress_type=zipfile.ZIP_DEFLATED) # put in zip file zf.close() # closes write steam but object still exists! ● files (bla.txt) can have a path ("lecture23/bla.txt) but cannot be a folder ● write() caveat: does NOT automatically add sub-folders, only adds files ● Unzipping: create and open ZipFile object for read, extractall() to folder, close(): zf2 = zipfile.ZipFile("myzip.zip") # open same file for reading os.makedirs("test") # make a test folder zf2.extractall("test") # extract content of zf2 into folder test ● zf.infolist() returns a list of ZipInfo objects for each file in the archive, which contain: date/time, comment, compressed size, etc.
© Copyright 2024