Text files come in various sources (i.e. local or remote) and formats (i.e. txt, csv/tsv, json , pkl etc). This post assumes that the reader are using COLAB (https://colab.research.google.com/). Some parts of the codes were based on the NLTK guide (NLTK - Processing Raw Text) while some others were based on sample data files in COLAB.
Sources:
(Source.1) Local machine
- get the path and import the content.
# import os # print current directory from within Python import os print( os.listdir('.') ) # assuming that this is a colab project # that starts from within the /content/ directory # which contains a sub-directory sample_data # then, run command to change directory # to sample_data directory %cd sample_data # open raw file # print content of raw file f = open('README.md') raw = f.read() print( raw[:150] )
(Source.2) Remote machine
- import the content direct from remote location, or download to local machine, get the path and import the content.
There are three commonly used module packages to get remote files i.e. UrlLib, Requests and HttpLib. The Requests package is recommended for a higher-level HTTP client interface (Python Docs).
#open direct from remote storage import requests url = "http://www.gutenberg.org/files/2554/2554-0.txt" raw= requests.get(url).content.decode('utf-8-sig') print(raw)
#download from remote storage and open locally url = "http://www.gutenberg.org/files/2554/2554-0.txt" file = url.split("/")[-1] !wget {url} f = open(file) raw = f.read() print( raw[:150] )
(Source.3) Web Page
#open web page and scrap #list of emotion types #using beautiful soup import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Emotion_classification" response = requests.get(url).content.decode('utf-8-sig') soup = BeautifulSoup(response, 'html.parser') listEmoType=[] for a in soup.find_all("div", class_="div-col")[:1]: for b in a.find_all("a"): #print (b.text) listEmoType.append(b.text) print(len(listEmoType)) print("\n".join(listEmoType))
Formats:
(Format.1) Text (TXT).
filepath='/content/sample_data/README.md' f = open(filepath) raw = f.read() print( raw[:150] )
(Format.2) Comma Separated Value (CSV).
#import csv with default delimiter import csv sourcefilepath='/content/sample_data/california_housing_test.csv' with open(sourcefilepath, newline='') as f: reader = csv.reader(f) data =list(reader) print(data[:150])
#export csv to tsv format import csv sourcefilepath='/content/sample_data/california_housing_test.csv' targetfilepath=filepath.replace(".csv",".tsv") csv.writer(open(targetfilepath, 'w+'), delimiter='\t') \ .writerows(csv.reader(open(sourcefilepath)))
#import csv with tab delimiter import csv filepath='/content/sample_data/california_housing_test.tsv' with open(filepath, newline='') as f: reader = csv.reader(f, delimiter="\t") #add delimiter format data =list(reader) print(data[:150])
(Format.3) JavaScript Object Notation (JSON)
JSON is a format for storing and exchanging data. Python has a built-in package called json which supports JSON data.
# open a json file import json filepath='/content/sample_data/anscombe.json' with open(filepath, 'r') as f: data = json.load(f) print(data)
# continue from the previous exercise # save the json data into a file import json targetfilepath='/content/sample_data/newdata.json' with open(targetfilepath, 'w') as json_file: json.dump(data, json_file)
(Format.4) Pickle (PKL)
Pickle is a python module that serializes data for storage. The data is stored in a binary string. Pickle is useful for efficient data loading in machine learning activities.
#write pickle import pickle # Create data listData = [{'good': 10, 'morning': 4}] print(type(listData)) # Open a file and dump data with open('fileData.pkl', 'wb') as fileData: # A new file will be created pickle.dump(listData, fileData)
#read pickle import pickle # Open the file in binary mode with open('fileData.pkl', 'rb') as fileData: # Open file and load data listData = pickle.load(fileData) print(type(listData)) print(listData)
0 Comments